Using Interlinear Glosses As Pivot in Low-Resource Multilingual Machine Translation
Total Page:16
File Type:pdf, Size:1020Kb
Using Interlinear Glosses as Pivot in Low-Resource Multilingual Machine Translation Zhong Zhou Lori Levin David R. Mortensen Alex Waibel Carnegie Mellon University Carnegie Mellon University Carnegie Mellon University Carnegie Mellon University [email protected] [email protected] [email protected] Karlsruhe Institute of Technology [email protected] Abstract We demonstrate a new approach to Neu- ral Machine Translation (NMT) for low- resource languages using a ubiquitous lin- guistic resource, Interlinear Glossed Text (IGT). IGT represents a non-English sen- Figure 1: Multilingual NMT. tence as a sequence of English lemmas and morpheme labels. As such, it can serve as a pivot or interlingua for NMT. Our contribution is four-fold. Firstly, we pool IGT for 1,497 languages in ODIN (54,545 glosses) and 70,918 glosses in Ara- paho and train a gloss-to-target NMT sys- Figure 2: Using interlinear glosses as pivot to translate in tem from IGT to English, with a BLEU multilingual NMT in Hmong, Chinese, and German. score of 25.94. We introduce a multi- lingual NMT model that tags all glossed resource settings (Firat et al.(2016)Firat, Cho, text with gloss-source language tags and and Bengio; Zoph and Knight(2016); Dong train a universal system with shared atten- et al.(2015)Dong, Wu, He, Yu, and Wang; Gillick tion across 1,497 languages. Secondly, we et al.(2016)Gillick, Brunk, Vinyals, and Subra- use the IGT gloss-to-target translation as manya; Al-Rfou et al.(2013)Al-Rfou, Perozzi, a key step in an English-Turkish MT sys- and Skiena; Zhou et al.(2018)Zhou, Sperber, and tem trained on only 865 lines from ODIN. Waibel; Tsvetkov et al.(2016)Tsvetkov, Sitaram, Thirdly, we we present five metrics for eval- Faruqui, Lample, Littell, Mortensen, Black, Levin, uating extremely low-resource translation and Dyer). However, there is still a large disparity when BLEU is no longer sufficient and eval- arXiv:1911.02709v3 [cs.CL] 3 Mar 2020 between translation quality in high-resource and uate the Turkish low-resource system using low-resource settings, even when the model is well- BLEU and also using accuracy of match- tuned (Koehn and Knowles(2017); Sennrich and ing nouns, verbs, agreement, tense, and Zhang(2019); Nordhoff et al.(2013)Nordhoff, Ham- spurious repetition, showing large improve- marström, Forkel, and Haspelmath). Indeed, there ments. is room for creativity in low-resource scenarios. 1 Introduction Morphological analysis is useful in reducing Machine polyglotism, training a universal NMT word sparsity in low-resource languages (Habash system with a shared attention through im- and Sadat(2006); Lee(2004); Hajic(2000)).ˇ To- plicit parameter sharing, is very helpful low- ward that end, we leverage a linguistic resource, Interlinear Glossed Text (IGT) (Lehmann and c 2020 The authors. This article is licensed under a Creative Commons 3.0 licence, no derivative works, attribution, CC- Croft(2015)) as shown in Table 2 (Samardzic BY-ND. et al.(2015)Samardzic, Schikowski, and Stoll; Data Example Source language (German) Ich sah ihm den Film gefallen. Interlinear Gloss w/ target-lemma I saw he.DAT the film.ACC like. Target language (English) I saw him like the film. Source language (Hmong) Nwg yeej qhuas nwg. Interlinear gloss w/ target-lemma 3SG always praise 3SG. Target language (English) He always praises himself. Source language (Arapaho) Niine’etii3i’ teesiihi’ coo’oteyou’uHohootino’ nenee3i’ neeyeicii. Interlinear gloss w/ target-lemma live(at)-3PL on/over-ADV IC.hill(y)-0.PLtree-NA.PL IC.it is-3PL timber. Target language (English) Trees make the woods. Table 1: Examples of interlinear glosses in different source languages. Moeller and Hulden(2018)). We propose to use tremely low-resource translation when BLEU interlinear gloss as a pivot to address the harder no longer suffices. Our system using inter- problem of morphological complexity and source of linear glosses achieves an improvement of data sparsity in multilingual NMT. We combine the +44.44% in Noun-Verb Agreements, raising benefits of both multilingual NMT and linguistic in- fluency. formation through the use of interlinear glosses as a We present our cleaned data followed by gloss-to- pivot representation. Our contribution is four-fold. target models and our three-step Turkish-English 1. We present our multilingual model using a NMT in Section 3 and 4. We evaluate in Section 5. single attention in translating from interlinear 2 Related Works glosses into a target language. Our best mul- tilingual NMT result achieves a BLEU score 2.1 Multilingual Neural Machine Translation of 25.94, +5.65 above a single-source single- Multilingual NMT’s objective is to translate target NMT baseline. from any of N input languages to any of M 2. We present two linguistic datasets that we output languages (Firat et al.(2016)Firat, Cho, normalized, cleaned and filtered: the cleaned and Bengio; Zoph and Knight(2016); Dong ODIN dataset includes 54,545 lines of IGT in et al.(2015)Dong, Wu, He, Yu, and Wang; Gillick 1,496 languages (Lewis and Xia(2010); Xia et al.(2016)Gillick, Brunk, Vinyals, and Subra- et al.(2014)Xia, Lewis, Goodman, Crowgey, manya; Al-Rfou et al.(2013)Al-Rfou, Perozzi, and and Bender), and the cleaned Arapaho dataset Skiena; Tsvetkov et al.(2016)Tsvetkov, Sitaram, includes 70,918 lines of IGT (Cowell and Faruqui, Lample, Littell, Mortensen, Black, Levin, O’Gorman(2012); Wagner et al.(2016)Wagner, and Dyer). Many multilingual NMT systems work Cowell, and Hwang). on a universal model with a shared attention mech- 3. We present a three-step approach for ex- anism with Byte-Pair Encoding (BPE) (Johnson tremely low-resource translation. We demon- et al.(2017)Johnson, Schuster, Le, Krikun, Wu, strate it by training only on 865 lines of data. Chen, Thorat, Viégas, Wattenberg, Corrado et al.; (a) We use a morphological analyzer to au- Ha et al.(2016)Ha, Niehues, and Waibel; Zhou tomatically generate interlinear glosses et al.(2018)Zhou, Sperber, and Waibel). Its sim- with source lemma. plicity and implicit parameter sharing helps with (b) We translate the source lemma into tar- low-resource translation and zero-shot translation get lemma in through alignments trained (Johnson et al.(2017)Johnson, Schuster, Le, Krikun, from parallel data. Wu, Chen, Thorat, Viégas, Wattenberg, Corrado (c) We translate from interlinear glosses with et al.; Firat et al.(2016)Firat, Cho, and Bengio). target lemma to target language by using the gloss-to-target multilingual NMT de- 2.2 Morpheme-Level Machine Translation veloped in 1 presented above. To build robustness (Chaudhary 4. We present five metrics for evaluating ex- et al.(2018)Chaudhary, Zhou, Levin, Neu- Data Example 1. Source language (Turkish) Kadin dans ediyor. 2. Interlinear gloss with source-lemma Kadin.NOM dance ediyor-AOR.3.SG. 3. Interlinear gloss with target-lemma Woman.NOM dance do-AOR.3.SG. 4. Target language (English) The woman dances. 1. Source language (Turkish) Adam kadin-i gör-dü. 2. Interlinear gloss with source-lemma Adam.NOM kadin-ACC gör-AOR.3.SG. 3. Interlinear gloss with target-lemma Man.NOM woman-ACC see-PST.3.SG. 4. Target language (English) The man saw the woman. Table 2: Examples of the translation sequence using interlinear glosses. Notation Meaning in translation sequence language (Koehn and Hoang(2007); Yeniterzi and Oflazer(2010)). In the era of NMT, morphological 1 Source language (Turkish) text information and grammatical decomposition that 2 Interlinear gloss with source-lemma are produced by a morphological analyzer are 3 Interlinear gloss with target-lemma employed (García-Martínez et al.(2016)García- 4 Target language (English) text Martínez, Barrault, and Bougares; Burlot Table 3: Notation used in the translation sequence. et al.(2017)Burlot, Garcia-Martinez, Barrault, Bougares, and Yvon; Hokamp(2017)). big, Mortensen, and Carbonell; Cotterell and 2.4 Interlinear Gloss Generation Schütze(2015); Wu et al.(2016)Wu, Schuster, Chen, Le, Norouzi, Macherey, Krikun, Cao, Gao, Interlinear gloss is a linguistic representation of Macherey et al.), researchers work on character- morphosyntactic categories and cross-linguistic lex- level, byte-level (Gillick et al.(2016)Gillick, ical relations (Samardzic et al.(2015)Samardzic, Brunk, Vinyals, and Subramanya; Ling Schikowski, and Stoll; Moeller and Hulden(2018)). et al.(2015)Ling, Trancoso, Dyer, and Black; IGT is used in linguistic publications and field notes Chung et al.(2016)Chung, Cho, and Bengio; to communicate technical facts about languages Tiedemann(2012)), and BPE-level (Sennrich that the reader might not speak, or to convey a par- et al.(2016)Sennrich, Haddow, and Birch; Burlot ticular linguistic analysis to the reader. Typically, et al.(2017)Burlot, Garcia-Martinez, Barrault, there are three lines to an IGT. The first line con- Bougares, and Yvon) translation. Morpheme-level sists of text segments in an object language, which translation allows words to share embedding while we call the source language in this paper. The allowing variation in meanings (Cotterell and third line is a fluent translation in the metalanguage, Schütze(2015); Chaudhary et al.(2018)Chaudhary, which we call the target language. In our work, the Zhou, Levin, Neubig, Mortensen, and Carbonell; target language (metalanguage) is always English. Renduchintala et al.(2019)Renduchintala, Shapiro, The source (object) languages are the 1,496 lan- Duh, and Koehn; Passban et al.(2018)Passban, Liu, guages of the ODIN (Lewis and Xia(2010); Xia and Way; Dalvi et al.(2017)Dalvi, Durrani, Sajjad, et al.(2014)Xia, Lewis, Goodman, Crowgey, and Belinkov, and Vogel), shrinks the vocabulary Bender) database plus Arapaho, for which a large size introduces smoothing (Goldwater and Mc- collection of field notes has been published (Cow- Closky(2005)), and makes fine-grained correction ell and O’Gorman(2012)).