Neural Machine Translation for Similar Languages: the Case of Indo-Aryan Languages

Neural Machine Translation for Similar Languages: The Case of Indo-Aryan Languages Santanu Pal1, Marcos Zampieri2 1Wipro AI Lab, India 2Rochester Institute of Technology, USA [email protected] Abstract MT systems on translating between pairs of similar languages without English as a pivot In this paper we present the WIPRO-RIT language (Barrault et al., 2019). The or- systems submitted to the Similar Language ganizers provided participants with training, Translation shared task at WMT 2020. The second edition of this shared task development, and testing parallel data from featured parallel data from pairs/groups three pairs of languages from three different of similar languages from three different language families: Spanish - Portuguese (Ro- language families: Indo-Aryan languages mance languages), Czech - Polish (Slavic lan- (Hindi and Marathi), Romance languages guages), and Hindi - Nepali (Indo-Aryan lan- (Catalan, Portuguese, and Spanish), and guages). Systems were evaluated using auto- South Slavic Languages (Croatian, Ser- matic metrics, namely BLEU (Papineni et al., bian, and Slovene). We report the results obtained by our systems in translating 2002) and TER (Snover et al., 2006). from Hindi to Marathi and from Marathi In SLT 2020, the task organizes once again to Hindi. WIPRO-RIT achieved competi- included an Indo-Aryan language track with tive performance ranking 1st in Marathi to Hindi and 2nd in Hindi to Marathi trans- Hindi and Marathi. Indo-Aryan languages lation among 22 systems. are a sub-family of the Indo-European language family which includes Bengali, Bo- 1 Introduction hjpuri, Hindi, Marathi, and Nepali. These languages are mainly spoken in North and Cen- WMT 2020 is the fifth edition of WMT asa tral India, and some neighbouring countries conference following a series of well-attended such as Nepal, Bangladesh, and Pakistan etc. workshops that date back to 2006. WMT be- The script used in most of these languages are came a well-established conference due to its derived from the ancient Brahmi script and en- blend of research papers and popular shared riched with high grapheme to phoneme corre- tasks on different topics such as translation spondence leading to many orthographic sim- in various domains (e.g. biomedical, news), ilarities across these languages. translation quality estimation, and automatic post-editing. The competitions co-organized In addition to Hindi and Marathi, SLT 2020 with WMT provide important datasets and features two other tracks with similar lan- benchmarks widely used in the MT commu- guages from the following language families: nity. The vast majority of these tasks so far, Romance languages (Catalan, Portuguese, and however, involved training systems to trans- Spanish) and South Slavic Languages (Croa- late to and from English (Bojar et al., 2016, tian, Serbian, and Slovene). In this pa- 2017) while only a few of them addressed the per we describe the WIPRO-RIT submission problem of translating between pairs of lan- to the SLT 2020 Indo-Aryan track. Our guages with less resources. WIPRO-RIT system is based on the model To address this issue, in 2019, the Simi- described in Johnson et al. (2017). WIPRO- lar Language Translation (SLT) shared task RIT achieved competitive performance rank- was introduced at WMT. SLT’s purpose was ing 1st in Marathi to Hindi and 2nd in Hindi to evaluate the performance of state-of-the-art to Marathi translation among 22 systems. 424 Proceedings of the 5th Conference on Machine Translation (WMT), pages 424–429 Online, November 19–20, 2020. c 2020 Association for Computational Linguistics 2 Related Work using a Hindi–English NMT system. The Hindi–English NMT system was trained on With the substantial performance improve- English–Hindi parallel data released in WMT ments brought to MT by neural approaches, a 2014 (Bojar et al., 2014), IITB parallel cor- growing interest in translating between pairs pus (Kunchukuttan et al., 2018), the parallel of similar languages, language varieties, and dataset was collected from news (Siripragada dialects has been observed. Recent studies et al., 2020) and the PMIndia (Haddow and have addressed MT between Arabic dialects Kirefu, 2020) parallel corpus (see Table 1). (Harrat et al., 2019; Shapiro and Duh, 2019) Catalan and Spanish, Croatian and Serbian Data Sources #sentences (Popović et al., 2020), (Costa-jussà, 2017), WMT 273,885 Brazilian and European Portuguese (Costa- News 156,344 jussà et al., 2018), and several pairs of lan- IITB 1,561,840 guages and language varieties such as Brazil- PM India 56,831 ian and European Portuguese, Canadian and Total 2,048,900 European French, and similar languages such Remove duplicates 1,464,419 as Croatian and Serbian, and Indonesian and ∗ Cleaning 961,036 Malay (Lakew et al., 2018). The interest on diatopic language variation Table 1: English–Hindi parallel data statistics. is evidenced by the recent iterations of the Var- ∗Removing noisy mixed language sentences. Dial workshop in which papers on MT applied to similar languages varieties, and dialects We also back-translated 5 million Marathi (Shapiro and Duh, 2019; Myint Oo et al., monolingual sehments using our WIPRO-RIT 2019; Popović et al., 2020) have been pre- CONTRASTIVE 1 system described in more sented along with evaluation campaigns fea- detail Section 6. For Marathi–Hindi we did turing multiple shared tasks on a number of not use any back translation data in our CON- related topics such as cross-lingual morpho- TRASTIVE 2 and PRIMARY submissions. In logical analysis, cross-lingual parsing, dialect the both cases 5 million English–Hindi back- identification, and morphosyntactic tagging translation data provide significant (p < 0:01) (Zampieri et al., 2018, 2019; Găman et al., improvements over CONTRASTIVE 1 (de- 2020). tailed in Section 6). The released WMT 2014 EN-HI data and 3 Data the WMT SLT 2020 data were noisy for our For our experiments, we use the Hindi– purposes, so we apply methods for cleaning Marathi and Marathi–Hindi WMT 2020 SLT (see data statistics in Table 2). data. The released parallel dataset was col- Parallel #sentences lected from news (Siripragada et al., 2020), News 12,349 PMIndia (Haddow and Kirefu, 2020) and Indic PM India 25,897 Wordnet (Bhattacharyya, 2010; Kunchukut- Indic WordNet 11,188 tan, 2020a) datasets. To augment our dataset, Total 49,434 we use English–Hindi parallel data released in Filtered∗ 33923 WMT 2014 (Bojar et al., 2014), consisting of more than 2 million parallel sentences, which Table 2: Data statistics of released SLT Data; is available as an additional resource. We use ∗Filtration methods: (i) remove duplicates and (ii) a subset of 5 million segments of Hindi mono- filtering noisy mixed language sentences. lingual news crawled from ca. 32 million data. We also use a subset 5 million Marathi mono- We performed the following two steps: (i) we lingual data. We performed similar cleaning use the cleaning process described in Pal et al. and pre-processing methods as we described (2015), and (ii) we execute the Moses (Koehn in case of parallel data. et al., 2007) corpus cleaning scripts with min- The five million Hindi monolingual sen- imum and maximum number of tokens set to tences were first back-translated to English 1 and 100, respectively. After cleaning and re- 425 Parallel Sentences L1 ! L2 Source Target Raw HI!MR data देश एकल प्रयासे स आगे बढ़ चुके ह। देश आता सामाईक प्रय配न करतेत. आह Processed TO_MR data देश एकल प्रयासे स आगे बढ़ चुके ह। देश आता सामाईक प्रय配न करतेत. आह Raw MR!HI data देश आता सामाईक प्रय配न करते आहत. देश एकल प्रयासे स आगे बढ़ चुके ह। Processed TO_HI data देश आता सामाईक प्रय配न करतेत. आह देश एकल प्रयासे स आगे बढ़ चुके ह। Raw EN!HI The MoU was signed in February, 2016. data इस एमओयू पर फरवरी, 2016 म ह ताक्षरिकए गए थे। Processed TO_HI The MoU was signed in February, 2016. data इस एमओयू पर फरवरी, 2016 म ह ताक्षरिकए गए थे। Table 3: Multilingual Processed data, indicating TO_XX as target language: moving duplicates, we have 1M EN-HI par- tence aligned multiple language pairs at once, allel sentences. Next, we perform punctua- During inference, we also need to add the tion normalization, and then we use the Moses aforementioned additional token to each input tokenizer to tokenize the English side of the source sentence of the source data to specify parallel corpus with ‘no-escape’ option. Fi- the desired target language. nally, we apply true-casing. For the case of Hindi and Marathi, we use Indic NLP Li- brary1 (Kunchukuttan, 2020b) for tokeniza- 5 Experiments ton. In the next sub-sections we describe the ex- 4 Model Architecture periments we carried out for translating from Hindi to Marathi and from Marathi to Hindi Our model is based on a transformer archi- for WIPRO-RIT’s WMT 2020 SLT shared tecture (Vaswani et al., 2017) built solely task submission. upon such attention mechanisms completely replacing recurrence and convolutions. The transformer uses positional encoding to encode 5.1 Experiment Setup the input and output sequences, and com- putes both self- and cross-attention through To handle out-of-vocabulary words and to re- so-called multi-head attentions, which are fa- duce the vocabulary size, instead of consider- cilitated by parallelization. We use multi-head ing words, we consider subword units (Sen- attention to jointly attend to information at nrich et al., 2016) by using byte-pair encod- different positions from different representa- ing (BPE). In the preprocessing step, instead tion subspaces. of learning an explicit mapping between BPEs We present a single multilingual NMT sys- in the English (EN), Hindi (HI) and Marathi tem based on the transformer architecture (MR), we define BPE tokens by jointly pro- that can translate between multiple languages.

Load more