<<

NUIG-Panlingua-KMI ↔Marathi MT Systems for Similar Language Translation Task @ WMT 2020 Atul Kr. Ojha!+, Priya Rani!, Akanksha Bansal+, Bharathi Raja Chakravarthi!, Ritesh Kumar*, John P. McCrae! !Data Science Institute, NUIG, Galway, +Panlingua Language Processing LLP, New Delhi, *Dr. Bhimrao Ambedkar University, Agra (atulkumar.ojha,priya.rani,bharathi.raja)@insight-centre.org, [email protected], [email protected], ritesh78 [email protected]

Abstract as OpenNMT (Klein et al., 2017), Marian (Junczys- Dowmunt et al., 2018) and Neamtus (Sennrich NUIG-Panlingua-KMI submission to WMT 2020 seeks to push the state-of-the-art in the et al., 2017), which provide various ways of ex- Similar language translation task for the Hindi perimenting with the use of different features and ↔ pair. As part of these ef- architectures, yet it fails to achieve the same re- forts, we conducted a series of experiments sults with low resource languages (Chakravarthi to address the challenges for translation be- et al., 2018, 2019b). However, Sennrich and Zhang tween similar languages. Among the 4 MT (2019) revisited the NMT models and tuned hyper- systems prepared for this task, 1 PBSMT sys- parameters, changed network architectures to op- tems were prepared for Hindi ↔ Marathi each and 1 NMT systems were developed for Hindi timize NMT for low-resource conditions and con- ↔ Marathi using Byte Pair Encoding (BPE) cluded that low-resource NMT is very sensitive of subwords. The results show that different to hyper-parameters such as Byte Pair Encoding architectures in NMT could be an effective (BPE) vocabulary size, word dropout, and others. method for developing MT systems for closely This paper is an extension of our work Ojha et al. related languages. Our Hindi-Marathi NMT (2019) submitted to WMT 2019 similar language th system was ranked 8 among the 14 teams that translation task. Therefore our team adapted meth- participated and our Marathi-Hindi NMT sys- ods of the low resource setting for NMT proposed tem was ranked 8th among the 11 teams partic- ipated for the task. by Sennrich and Zhang(2019) to explore the fol- lowing broad objectives: 1 Introduction • to compare the performance of SMT and Developing automated relations between closely NMT in case of closely related, relatively low- related languages is a contemporary concern espe- resourced language pairs, and cially in the domain of Machine Translation(MT). Hindi and Marathi exhibit a significant overlap in • to findout how to leverage the accuracy of their vocabularies and strong syntactic plus lexi- NMT in closely related languages using BPE cal similarities. These striking similarities seem into subwords. promising in enhancing the possibility of mutual inter-comprehension within closely related lan- • to analyze the effects of data quality in perfor- guages. However, automated translation between mance of the systems. such closely related languages is a rather challeng- ing task. 2 System Description The linguistic similarities and regularities in mor- phological variations and orthography motivate the This section provides an overview of the systems use of character-level translation models, which developed for the WMT 2020 Shared Task. In these have been applied to translation (Vilar et al., experiments, the NUIG-Panlingua-KMI team ex- 2007; Chakravarthi et al., 2020) and translitera- plored two different approaches: phrase-based sta- tion (Matthews, 2007; Chakravarthi et al., 2019a; tistical (Koehn et al., 2003), and neural method for Chakravarthi, 2020). In the past few years, neu- Hindi-Marathi and Marathi-Hindi language pairs. ral machine translation systems have achieved In all the submitted systems, we use the Moses outstanding performance with high resource lan- (Koehn et al., 2007) and Nematus (Sennrich et al., guages, with the help of open source toolkit such 2017) toolkit for developing statistical and neural

418 Proceedings of the 5th Conference on Machine Translation (WMT), pages 418–423 Online, November 19–20, 2020. c 2020 Association for Computational Linguistics machine translation systems respectively. The pre- Out of 43274 training sentences, the Hindi corpus processing was done to handle noise in data (for had Telugu sentences while the Marathi corpus had example, different language sentences, non-UTF Meitei sentences intermingled as shown in first row characters etc), the details of which are provided in (Figure1). The parallel data had more than 1192 section 3.1 lines that were not comparable with each other as shown in second and third row (Figure1), where 2.1 Phrase-based SMT Systems some Hindi sentences had only half the sentences These systems were built on the Moses open source translated in Marathi (second row) and some had toolkit using the KenLM (Heafield, 2011) language blank spaces against their Marathi counter parts model and GIZA++ (Och and Ney, 2003) aligner. (third row). The translation quality of the parallel ‘Grow-diag-final-and heuristic’ parameters were data was also not up to mark. In fact, the team could used to extract phrases from the corresponding par- locate a few instances of synthetic data. There were allel corpora. In addition to this, KenLM was used a few sentences where character encoding was an to build 5-gram language models. issue, hence were completely unintelligible.

2.2 Neural Machine Translation System Language Pair Training Tuning Monolingual Hindi ↔ Marathi 43274 1411 - Nematus was used to build 2 NMT systems. As Marathi - - 326748 we mentioned in an earlier section, at first data Hindi - - 75348193 was pre-processed at subwords level with BPE for Table 1: Statistics of Parallel and Monolingual Sen- neural translation, and then the system was trained tences of the Hindi and Marathi Languages using Nematus toolkit. Most of the system features were adopted from (Sennrich et al., 2017; Koehn and Knowles, 2017) (see section 3.3.2). 3.2 Pre-processing 2.3 Assessment The following pre-processing steps were performed Assessment of these systems was done on the stan- as part of the experiments: dard automatic evaluation metrics: BLEU (Pap- ineni et al., 2002), Rank-based Intuitive Bilingual a) Both corpora were tokenized and cleaned (sen- Evaluation Score (RIBES) (Isozaki et al., 2010) tences of length over 80 words were removed). and Translation Error Rate (TER) (Snover et al., 2006). b) For neural translation, training, validation and test data was prepossessed into subwords BPE 3 Experiments format. This format was utilised to prepare BPE and vocabulary further used. This section briefly describes the experiment set- tings for developing the systems. All these processes were performed using Moses scripts. However, the tokenization was done by the 3.1 Data Preparations RGNLP team tokenizer (Ojha et al., 2018) and In- The parallel data-set for these experiments was pro- dic nlp library.3 These tokenizers were used since vided by the WMT Similar Translation Shared Task Moses does not provide a tokenizer for Indic lan- 1 organisers and the Marathi monolingual data-set guages. Also the RGNLP tokenizer ensured that was taken from WMT 2020 Shared Task: Parallel the canonical representation of the charac- 2 Corpus Filtering for Low-Resource Conditions. ters are retained. The parallel data was sub-divided into training, tun- ing, and monolingual sets, as detailed in Table1. 3.3 Development of the NUIG-Panlingua- However, the shared data was very noisy. KMI MT Systems To enhance the data quality, the team had to After removing noisy and pre-processing data, the undertake an extensive pre-processing session fo- following steps were followed to build the NUIG- cused on identifying and cleaning the data-sets. Panlingua-KMI MT systems: 1http://www.statmt.org/wmt20/similar. html 3https://github.com/anoopkunchukuttan/ 2https://wmt20similar.cs.upc.edu/ indic_nlp_library

419 Figure 1: Examples of discrepancies in Hindi-Marathi parallel data

Figure 2: Analysis of the PBSMT and NMT’s Systems

3.3.1 Building Primary MT Systems: guage models were trained on 5-gram. After that, the systems were built independently and combined As previously mentioned, the Hindi-Marathi and in a loglinear scheme in which each model was as- Marathi-Hindi PBSMT systems were built as the signed a different weight using the Minimum Error primary submission using Moses. The language Rate Training (Och, 2003) tuning algorithm. To model was built first, using KenLM. For Marathi- train and tune the systems, we used 40454 and 1411 Hindi and Hindi-Marathi language pairs, the lan-

420 parallel sentences, respectively, for all language both the language pairs, subword based NMT per- pairs. formed better than PBSMT as its accuracy rate was higher in BLEU and lower in TER metrics, shown 3.3.2 Building Contrastive MT Systems: in Table2. As mentioned in the previous section, Nematus toolkit was used to develop the NMT systems. The 4.2 Analysis training was done on subword and character-level. We used the reference set provided by the shared All the NMT experiments were carried out only task organizers to evaluate both PBSMT and NMT with a data-set that contained sentences with length systems. Even though subword based NMT system of up to 80 words. The neural model is trained on could take advantage of the shared features among 5000 epochs, using Adam with a default learning similar languages, challenges in translating a few rate of 0.002, dropout at 0.01 and mini-batches of linguistics structures acted as a constraint. Exam- 80 and the batch size for the validation was 40. ple 1 shown in Figure2 is one of the challenging Vocabulary size of 30000 for both Marathi-Hindi structures that the system was unable to translate. and Hindi-Marathi language pairs was extracted. In these sentences the systems could not capture Remaining parameters were limited with the use of the correct tense and aspect which is past perfect default hyper-parameters configuration. in source sentence whereas the NMT system trans- 4 Evaluation lated it as simple past. The second most common challenging structures that needed special attention All the systems were evaluated using the reference were the postpositions as shown in Example 2 and set provided by the shared task organizers. The 3 in the figure. In most cases, the system over- standard MT evaluation metrics, BLEU (Papineni generalised the sentences in Marathi and generated et al., 2002) score, RIBES (Isozaki et al., 2010) and unnecessary postposition phrases in Hindi as in Ex- TER (Snover et al., 2006), were used for automatic ample 2. Similarly, we can see in Example 3 while evaluation. These results were prepared on the Pri- translating from Hindi to Marathi both PBSMT and mary and Contrastive system submission which are NMT systems used wrong post-positions. mentioned in the Table2 as P and C, where P stands for Primary and C stands for Contrastive, 5 Conclusion respectively. It gives a quantitative picture of partic- Our experiment results reveal that subword based ular differences across different systems, especially NMT could take advantage of the relation between with reference to evaluation scores (Table2) the similar language to boost the accuracy of neural System BLEU RIBES TER machine translations system in low resource data Hindi-Marathi P 9.38 51.88 91.24 settings. As BPE units are variable-length units Hindi-Marathi C 9.76 52.18 91.49 Marathi-Hindi P 17.38 59.31 81.47 and the vocabularies used are much smaller than Marathi-Hindi C 17.39 58.84 81.15 morpheme and word-level model, the problem of data sparsity does not occur. On the contrary, it Table 2: Accuracy of Hindi↔Marathi MT Systems at provides an appropriate context for translation be- BLEU, RIBES and TER Metrics tween similar languages. However, the quality of data used to train the systems does affect the quality 4.1 Results of translation. Thus, we could conclude that shared features between two languages could be an advan- Overall we see varying performance among the sys- tage to leverage the accuracy of NMT systems for tem submitted to the task, with some performing closely related languages. much better out-of-sample than others. The NUIG- th Panlingua-KMI subword NMT system took 8 po- Acknowledgments sition for both Hindi-Marathi and Marathi-Hindi language pair, across 14 teams. Our subword NMT This publication has emanated from research in systems for Marathi-Hindi language pair showed part supported by the Irish Research Council un- better results in terms of all the three metrics (17.39 der grant number SFI/18/CRT/6223 (CRT-Centre in BLEU, 58.84 in RIBES and 81.15 in TER) while for Research Training in Artificial Intelligence) co- the Hindi-Marathi language pair scored 9.76 in funded by the European Regional Development BLEU, 52.18 in RIBES and 91.24 in TER. Across Fund as well as by the EU H2020 programme un-

421 der grant agreements 731015 (ELEXIS-European Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senel- Lexical Infrastructure). lart, and Alexander M Rush. 2017. Opennmt: Open- We are also grateful to the organizers of WMT Sim- source toolkit for neural machine translation. arXiv preprint arXiv:1701.02810. ilar Translation Shared Task 2020 for providing us the Hindi↔Marathi Parallel Corpus, monolingual Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris and evaluation scores. Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrejˇ Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open References source toolkit for statistical machine translation. In Bharathi Raja Chakravarthi. 2020. Leveraging ortho- Proceedings of the 45th Annual Meeting of the As- graphic information to improve machine translation sociation for Computational Linguistics Companion of under-resourced languages. Ph.D. thesis, NUI Volume Proceedings of the Demo and Poster Ses- Galway. sions, pages 177–180, Prague, Czech Republic. As- sociation for Computational Linguistics. Bharathi Raja Chakravarthi, Mihael Arcan, and John P McCrae. 2018. Improving for under- Philipp Koehn and Rebecca Knowles. 2017. Six chal- resourced languages using machine translation. In lenges for neural machine translation. In Proceed- Proceedings of the 9th Global WordNet Conference ings of the First Workshop on Neural Machine Trans- (GWC 2018), page 78. lation, pages 28–39, Vancouver. Association for Computational Linguistics. Bharathi Raja Chakravarthi, Mihael Arcan, and John P McCrae. 2019a. Comparison of different orthogra- Philipp Koehn, Franz Josef Och, and Daniel Marcu. phies for machine translation of under-resourced 2003. Statistical phrase-based translation. In . In 2nd Conference on Lan- Proceedings of the 2003 Conference of the North guage, Data and Knowledge (LDK 2019). Schloss American Chapter of the Association for Computa- Dagstuhl-Leibniz-Zentrum fuer Informatik. tional Linguistics on Language Technology- Volume 1, pages 48–54. Association for Computa- Bharathi Raja Chakravarthi, Mihael Arcan, and tional Linguistics. John Philip McCrae. 2019b. Wordnet gloss trans- lation for under-resourced languages using multilin- David Matthews. 2007. Machine transliteration of gual neural machine translation. In Proceedings of proper names. Master’s Thesis, University of Edin- the Second Workshop on Multilingualism at the In- burgh, Edinburgh, . tersection of Knowledge Bases and Machine Trans- lation, pages 1–7. Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the Bharathi Raja Chakravarthi, Priya Rani, Mihael Arcan, 41st Annual Meeting of the Association for Compu- and John P McCrae. 2020. A survey of orthographic tational Linguistics, pages 160–167, Sapporo, Japan. information in machine translation. arXiv -prints, Association for Computational Linguistics. pages arXiv–2008. Franz Josef Och and Hermann Ney. 2003. A systematic Kenneth Heafield. 2011. Kenlm: Faster and smaller comparison of various statistical alignment models. language model queries. In Proceedings of the sixth Computational Linguistics, 29(1):19–51. workshop on statistical machine translation, pages 187–197. Association for Computational Linguis- Atul Kr Ojha, Koel Dutta Chowdhury, Chao-Hong tics. Liu, and Karan Saxena. 2018. The RGNLP - chine translation systems for WAT 2018. In Pro- Hideki Isozaki, Tsutomu Hirao, Kevin Duh, Katsuhito ceedings of the 5th Workshop on Asian Translation Sudoh, and Hajime Tsukada. 2010. Automatic eval- (WAT2018). uation of translation quality for distant language pairs. In Proceedings of the 2010 Conference on Atul Kr Ojha, Ritesh Kumar, Akanksha Bansal, and Empirical Methods in Natural Language Processing, Priya Rani. 2019. Panlingua-KMI MT system for pages 944–952. similar language translation task at WMT 2019. In Proceedings of the Fourth Conference on Machine Marcin Junczys-Dowmunt, Roman Grundkiewicz, Translation (Volume 3: Shared Task Papers, Day 2), Tomasz Dwojak, Hieu Hoang, Kenneth Heafield, pages 213–218. Tom Neckermann, Frank Seide, Ulrich Germann, Alham Fikri , Nikolay Bogoychev, Andre´ F. T. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Martins, and Alexandra Birch. 2018. Marian: Fast Jing Zhu. 2002. BLEU: a method for automatic eval- neural machine translation in C++. In Proceedings uation of machine translation. In Proceedings of of ACL 2018, System Demonstrations, pages 116– the 40th annual meeting on association for compu- 121, Melbourne, . Association for Compu- tational linguistics, pages 311–318. Association for tational Linguistics. Computational Linguistics.

422 Rico Sennrich, Orhan Firat, Kyunghyun Cho, Alexan- dra Birch, Barry Haddow, Julian Hitschler, Marcin Junczys-Dowmunt, Samuel Laubli,¨ Antonio Vale- rio Miceli Barone, Jozef Mokry, et al. 2017. Nema- tus: a toolkit for neural machine translation. arXiv preprint arXiv:1703.04357. Rico Sennrich and Biao Zhang. 2019. Revisiting low- resource neural machine translation: A case study. In Proceedings of the 57th Annual Meeting of the As- sociation for Computational Linguistics, pages 211– 221, Florence, Italy. Association for Computational Linguistics. Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin- nea Micciulla, and John Makhoul. 2006. A study of translation edit rate with targeted human annotation. In Proceedings of association for machine transla- tion in the Americas, volume 200. David Vilar, Jan-Thorsten Peter, and Hermann Ney. 2007. Can we translate letters? In Proceedings of the Second Workshop on Statistical Machine Trans- lation, pages 33–39, Prague, Czech Republic. Asso- ciation for Computational Linguistics.

423