Optimal Word Segmentation for Neural Machine Translation Into Dravidian Languages
Total Page:16
File Type:pdf, Size:1020Kb
Optimal Word Segmentation for Neural Machine Translation into Dravidian Languages Prajit Dhar Arianna Bisazza Gertjan van Noord University of Groningen {p.dhar, a.bisazza, g.j.m.van.noord}@rug.nl Abstract instance that an average English sentence contains almost ten times as many words as its Kannada Dravidian languages, such as Kannada and equivalent. For the other three languages, the ra- Tamil, are notoriously difficult to translate by state-of-the-art neural models. This stems tio is a bit smaller but the difference with English from the fact that these languages are mor- remains considerable. This indicates why it is im- phologically very rich as well as being low- portant to consider word segmentation algorithms resourced. In this paper, we focus on subword as part of the translation system. segmentation and evaluate Linguistically Moti- In this paper we describe our work on Neural vated Vocabulary Reduction (LMVR) against Machine Translation (NMT) from English into the the more commonly used SentencePiece (SP) Dravidian languages Kannada, Malayalam, Tamil for the task of translating from English into four different Dravidian languages. Addition- and Telugu. We investigated the optimal transla- ally we investigate the optimal subword vocab- tion settings for the pairs and in particular looked at ulary size for each language. We find that SP the effect of word segmentation. The aim of the pa- is the overall best choice for segmentation, and per is to answer the following research questions: that larger subword vocabulary sizes lead to higher translation quality. • Does LMVR, a linguistically motivated word segmentation algorithm, outperform the 1 Introduction purely data-driven SentencePiece? Dravidian languages are an important family of • What is the optimal subword dictionary size languages spoken by about 250 million of people for translating from English into these Dravid- primarily located in Southern India and Sri Lanka ian languages? (Steever, 2019). Kannada (KN), Malayalam (MA), Tamil (TA) and Telugu (TE) are the four most In what follows, we review the relevant previ- spoken Dravidian languages with approximately ous work (Sect. 2), introduce the two segmenters 47, 34, 71 and 79 million native speakers, respec- (Sect. 3), describe the experimental setup (Sect. 4), tively. Together, they account for 93% of all Dra- and present our answers to the above research ques- vidian language speakers. While Kannada, Malay- tions (Sect. 5). alam and Tamil are classified as South Dravidian languages, Telugu is a part of South-Central Dra- 2 Previous Work vidian languages. All four languages are SOV (Subject-Object-Verb) languages with free word 2.1 Translation Systems order. They are highly agglutinative and inflection- Statistical Machine Translation One of the ear- ally rich languages. Additionally, each language liest automatic translation systems for English into has a different writing system. Table 1 presents a Dravidian language was the English!Tamil sys- an English sentence example and its Dravidian- tem by Germann (2001). They trained a hy- language translations. brid rule-based/statistical machine translation sys- The highly complex morphology of the Dravid- tem that was trained on only 5k English-Tamil ian languages under study is illustrated if we com- parallel sentences. Ramasamy et al. (2012) cre- pare translated sentence pairs. The analysis of our ated SMT systems (phrase-based and hierarchical) parallel datasets (section 4.1, Table 3) shows for which were trained on a dataset of 190k parallel 181 Proceedings of the 8th Workshop on Asian Translation, pages 181–190 Bangkok, Thailand (online), August 5-6, 2021. ©2021 Association for Computational Linguistics EN He was born in Thirukkuvalai village in Nagapattinam District on 3rd June, 1924. KN ಅವರು ಗಪಟ ಣಂ ಯ ರುಕು ವಲ ಮದ 1924ರ ಜೂ 3ರಂದು ಜದ ರು. avaru nāgapatṭaṇ aṃ jilleya tirukkuvalay grāmadalli 1924ra jūn 3randu janisiddaru. ML 1924ല ് നാഗപണം ജിയിെല തിരുുവൈള ഗാമിലാണ് അേഹം ജനിത് 1924l nāgapatṭaṇ aṃ jillayile tirukkuvalaị grāmattilān ̣addēham janiccat. நாகபன மாவட வைள ராம அவ 1924-ஆ ஆ TA ஜூ மாத 3-ஆ ேத றதா. nāgappatṭinaṃ māvatṭaṃ tirukkuvalaiḳ kirāmattil avar 1924-ām ānṭ ụ jūn mātam 3-ām tēti pirantār. TE ఆయన గపటణం ౖ మం 1924 3న జం. āyana nāgapatṭaṇ aṃ jillā tirukkuvālai grāmanlō 1924 jūn 3na janmincāru. Table 1: Example sentence in English along with its translation and transliteration in the four Dravidian languages. sentences (henceforth referred to as UFAL). They findings was also reported by Ramesh et al. (2020) also reported that applying pre-processing steps in- for Tamil and Dandapat and Federmann (2018) for volving morphological rules based on Tamil suf- Telugu . fixes improved the BLEU score of the baseline To the best of our knowledge and as of 2021, model to a small extent (from 9.42 to 9.77). For there has not been any scientific publication involv- the Indic languages multilingual tasks of WAT- ing translation to and from Kannada, except for 2018, the Phrasal-based SMT system of Ojha et al. Chakravarthi et al. (2019). One possible reason for (2018) with a BLEU score of 30.53. this could be the fact that sizeable corpora involv- Subsequent papers also focused on SMT sys- ing Kannada (i.e. in the order of magnitude of at tems for Malayalam and Telugu with some notable least thousand sentences) have been readily avail- work including: (Anto and Nisha, 2016; Sreelekha able only since 2019, with the release of the JW300 and Bhattacharyya, 2017, 2018) for Malayalam Corpus (Agić and Vulić, 2019). and (Lingam et al., 2014; Yadav and Lingam, 2017) for Telugu. Multilingual NMT Since 2018 several studies have presented multilingual NMT systems that can Neural Machine Translation On the neural handle English ! Malayalam, Tamil and Telugu machine translation (NMT) side, there have translation (Dabre et al., 2018; Choudhary et al., been a handful of NMT systems trained on 2020; Ojha et al., 2018; Sen et al., 2018; Yu et al., ! English Tamil. On the aforementioned Indic 2020; Dabre and Chakrabarty, 2020). In particular, languages multilingual tasks of WAT-2018, Sen Sen et al. (2018) presented results where the BLEU et al. (2018), Dabre et al. (2018) reported only score improved when comparing monolingual and 11.88 and 18.60 BLEU scores, respectively, for multilingual models. Conversely, Yu et al. (2020) ! English Tamil. The poor performance of these found that NMT systems that were multi-way (In- systems compared to the 30.53 BLEU score of the dic $ Indic) performed worse than English $ In- SMT system (Ojha et al., 2018) showed that those dic systems. NMT systems were not yet suitable for translating To our knowledge, no work so far has explored into the morphologically rich Tamil. the effect of the segmentation algorithm and dictio- However, the following year, Philip et al. (2019) nary size on the four languages: Kannada, Malay- outperformed Ramasamy et al. (2012) on the alam, Tamil and Telugu. UFAL dataset with a BLEU score of 13.05 (the pre- vious best score on this test set was 9.77). They 3 Subword Segmentation Techniques report that techniques such as domain adaptation and back-translation can make training NMT sys- Prior to the emergence of subword segmenters, tems on low-resource languages possible. Similar translation systems were plagued with the issue of 182 Available in: Name Domain Kannada Malayalam Tamil Telugu Bible Religion 18 1 14 ELRC COVID-19 <1 <1 <1 GNOME Technical <1 <1 <1 <1 JW300 Religion 70 45 52 45 KDE Technical 1 <1 <1 <1 NLPC General <1 OpenSubtitles Cinema 26 3 3 CVIT-PIB Press 5 10 10 PMIndia Politics 10 4 3 8 Tanzil Religion 18 9 Tatoeba General <1 <1 <1 <1 Ted2020 General <1 <1 <1 1 TICO-19 COVID-19 <1 Ubuntu Technical <1 <1 <1 <1 UFAL Mixed 11 Wikimatrix General <1 10 18 Wikititles General 1 Table 2: Composition of training corpora. The numbers indicate the relative size (in percentages) of the correspond- ing part for that language. out-of-vocabulary (OOV) tokens. This was partic- To address this, Ataman et al. (2017) proposed a ularly an issue for translations involving agglutina- modification of Morfessor FlatCat (Grönroos et al., tive languages such as Turkish (Ataman and Fed- 2014), called Linguistically Motivated Vocabu- erico, 2018) or Malayalam (Manohar et al., 2020). lary Reduction (LMVR). Specifically, LMVR Various segmentation algorithms were brought for- imposes an extra condition on the cost function of ward to circumvent this issue and in turn, improve Morfessor Flatcat so as to favour vocabularies of translation quality. the desired size. In a comparison of LMVR to BPE, Perhaps the most widely used algorithm in NMT Ataman et al. (2017) reported a +2.3 BLEU im- to date is the language-agnostic Byte Pair Encod- provement on the English-Turkish translation task ing (BPE) by Sennrich et al. (2016). Initially pro- of WMT18. posed by Gage (1994), BPE was repurposed by Given the encouraging results reported on the Sennrich et al. (2016) for the task of subword agglutinative Turkish language, we hypothesise segmentation, and is based on a simple principle that translation into Dravidian languages may also whereby pairs of character sequences that are fre- benefit from a linguistically motivated segmenter, quently observed in a corpus get merged itera- and evaluate LMVR against SP across varying vo- tively until a predetermined dictionary size is at- cabulary sizes. tained. In this paper we use a popular implemen- tation of BPE, called SentencePiece (SP) (Kudo 4 Experimental Setup and Richardson, 2018). 4.1 Training Corpora The parallel training data is mostly taken from the While purely statistical algorithms are able to datasets available for the MultiIndicMT task from segment any token into smaller segments, there is WAT 2021.