An Exploration of Data Augmentation Techniques for Improving English to Tigrinya Translation

Lidia Kidane♣ Sachin Kumar♦ Yulia Tsvetkov♦ ♣African Institute for Mathematical Sciences, Kigali, Rwanda ♦Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA, USA [email protected], [email protected], [email protected]

Abstract 2018c) has been shown to perform well under low data conditions but is challenging to develop for It has been shown that the performance of neu- ral machine translation (NMT) drops starkly in Tigrinya owing to its complex morphological struc- low-resource conditions, often requiring large ture (Keleta et al., 2016). For many low-resource amounts of auxiliary data to achieve compet- languages, this challenge has led to various propos- itive results. An effective method of generat- als for leveraging monolingual data that exist in ei- ing auxiliary data is back-translation of target ther or both source and target languages, which are language sentences. In this work, we present usually more abundant. Prior approaches include a case study of Tigrinya where we investigate self-training (Imamura and Sumita, 2018), transfer several back-translation methods to generate learning (Zoph et al., 2016) and data-augmentation synthetic source sentences. We find that in low-resource conditions, back-translation by techniques like forward translation (Zhang and pivoting through a higher-resource language Zong, 2016) and back-translation (Sennrich et al., related to the target language, proves most ef- 2016a). fective resulting in substantial improvements over baselines. Back-translation has been used in current state- of-the-art NMT systems, outperforming other ap- 1 Introduction proaches in high resource languages (Ng et al., 2019) and improving performance in low resource Tigrinya is a Semitic language spoken by around 8 conditions (Hoang et al., 2018). The approach million people in the African countries of , involves training a target-to-source (backward) accounting for more than half of its population, model on the available parallel data and using that and where it is used as informal lingua model to generate synthetic translations of a large franca. However, over 60% of the internet’s con- number of monolingual sentences in the target lan- tent is in English, while Tigrinya, for example, ac- guage. The available authentic parallel data is then counts for less than 0.1% of it (W3Techs, 2020). mixed with the generated synthetic parallel data With 40% of the Tigrinya speakers being mono- without differentiating between the two (Sennrich lingual1, this essentially locks away the majority et al., 2016a) to train a final source-to-target (for- of the internet content for them. Availability of ward) model. However, in low resource scenarios, machine translation systems capable for translating arXiv:2103.16789v2 [cs.CL] 4 Apr 2021 the authentic parallel data available is not sufficient English to Tigrinya and vice-versa is an impera- to train a backward model that will generate quality tive for its speakers to be able to function in an synthetic data. increasingly-global online world. Despite recent advances in neural machine trans- In this work, we explore this setting in de- lation (Bahdanau et al., 2015; Vaswani et al., 2017), tail. Combining techniques from transfer learn- such systems are difficult to develop for many ing and back-translation, we propose several data- African languages including Tigrinya primarily due augmentation strategies to improve English-to- to the lack of large amounts of high quality paral- Tigrinya translation. In our experiments, we show lel data. Phrase-based statistical machine transla- that leveraging —a higher resource lan- tion (PBSMT) (Koehn et al., 2003; Lample et al., guage closely related to Tigrinya—for data aug- 1https://african-languages.com/ mentation, gives improvements of up to +7 BLEU tigrinya-language/ points over baselines. 2 Background and Methods one to translate TGT→REL and another to translate REL→SRC. For the former, depending on avail- We first formalize the task setup. Given a source able parallel and monolingual resources in TGT language (SRC), a target language (TGT) and a and REL, the TGT→REL model can be trained ei- typologically related language of TGT, REL, our ther in (1) a supervised manner (we refer to this goal is train a model f(·; θ) which takes a SRC setting as BT-PIVOT-SUP), or (2) in an unsuper- sentence x as input and generates its translation, vised manner (Lample et al., 2018d)(BT-PIVOT- ˆy = f(x; θ). Here, θ are learnable parameters TGT UNSUP). The latter is trained with more easily of f. We are given sentence aligned SRC–TGT, available SRC–REL parallel data. To backtranslate SRC–REL, REL–TGT parallel corpora, and mono- a given TGT sentence, we first translate it to REL lingual corpora in REL and TGT. using the TGT→REL model, and then to SRC using In this work, we use transformer based encoder- the REL→SRC model. decoder models (Vaswani et al., 2017) as f, We assume that the parallel SRC–TGT corpus is 3 Experimental Setup small, which makes training f challenging (Sen- nrich and Zhang, 2019). We now describe ways Datasets We evaluate our methods with English of leveraging the available monolingual data in (EN), Tigrinya (TI) and Amharic (AN) as SRC, TGT TGT and resources in REL to generate synthetic and REL. Both TI and AM are Ge’ez-scripted SRC→TGT sentences which can be augmented with and have considerable morpho- the authentic SRC–TGT corpus to improve the gen- logical and lexical similarity (Feleke, 2017). The eration quality of f. EN–TI and AM–TI parallel data are taken from Opus (JW300) (Tiedemann, 2012) and consist of BT-DIRECT: TGT→SRC This is most common 300K and 36K sentence pairs respectively contain- way to create synthetic parallel data by translat- ing text from religious domain. The EN–AM data ing TGT monolingual data to SRC (Sennrich et al., consists of a total of 900K sentence pairs taken 2016a). The backward model TGT→SRC is trained from Opus (JW300) and Teferra Abate et al.(2018) using the available SRC-TGT parallel data. While (News domain). After deduplication, we created this provides a natural way to utilize monolingual dev/test sets of 2K sentences each for both language data, when the parallel data is scarce, the backward pairs (EN-TI and EN-AM) by randomly sampling model’s quality is as limited as the vanilla forward from the JW300 corpora. We use the remaining model. This results in poor quality synthetic data sentences as training set. To train unsupervised MT which is detrimental, as we show in our experi- models, we use the REL and TGT size of the parallel ments. Hoang et al.(2018) proposed an iterative corpora as the monolingual corpus. To create syn- BT to alleviate this issue, but this technique re- thetic parallel data by back-translation, we create a quires multiple rounds of retraining models in both monolingual Tigrinya corpus by crawling sentences directions which are slow and expensive. from the official website of Eritrean Ministry of In- formation2. After cleaning and deduplication, we BT-INDIRECT: REL→SRC We train a get a corpus with 100K sentences. REL→SRC model using more abundant REL-SRC parallel data and use this model to translate Implementation and Evaluation We use a monolingual data in TGT to SRC. Given that transformer based encoder-decoder model to con- REL and TGT are closely related and written duct all our experiments (Vaswani et al., 2017). in the same script, this can serve as a proxy We use the BASE architecture which consists of back-translation model allowing transfer between 6 encoder and decoder layers with 8 attention the two languages. heads. We first tokenize all the sentences us- ing Moses (Koehn et al., 2007). For each lan- BT-PIVOT: TGT→REL→SRC Despite close- guage pair considered, we then tokenize the cor- ness of REL and TGT, back-translating TGT us- pora using a BPE (Sennrich et al., 2016b) model STD→SRC ing a model can result in noisy trans- trained on the concatenation of the parallel corpora lations which can hurt the final performance of with 32K merge operations. We use OpenNMT- SRC→TGT REL . Here, we exploit closeness of py toolkit (Klein et al., 2017) for all our exper- and TGT using the following method to generate synthetic SRC–TGT data. We train two models, 2https://www.shabait.com/ iments, with the hyperparameters recommended Method BLEU by Vaswani et al.(2017). We train all our super- UNSUP(SRC→TGT) 2.01 vised models (with or without data-augmentation) SUP(SRC→TGT) 7.4 for 200K steps with early stopping based on val- PIVOT:SUP(SRC→REL)+SUP(REL→TGT) 8.4 PIVOT:SUP(SRC→REL)+UNSUP(REL→TGT) 5.6 idation loss. Finally, we evaluate the generated BT-DIRECT 10.9 translations using the BLEU score (Papineni et al., BT-INDIRECT 6.2 2002)3. BT-PIVOT-SUP 11.54 BT-PIVOT-UNSUP 15.52 Baselines We compare the data-augmentation methods described in §2 with the following base- Table 1: BLEU scores obtained for EN-TI translation lines using different baselines and data-augmentation meth- ods described in §2 UNSUP(SRC→TGT) To evaluate the impact of available parallel data and feasibility of translating between unrelated languages SRC and TGT, we tables of the encoder and decoder with the aligned train an unsupervised NMT model to translate SRC embeddings and train the model parameters using to TGT using the available monolingual corpora autoencoding and iterative back-translation based only in the two languages. objectives as described in Lample et al.(2018a). SUP(SRC→TGT) Here, we train a supervised EN→TI model with available parallel data only. 4 Results and Analysis Pivot through REL Here, we train two transla- tion models, a SRC→REL model and REL →TGT The results are detailed in table1. We observe that model. Given a SRC test sentence, we first translate both unsupervised and supervised EN→TI models it to REL, which we then feed to the second model perform poorly owing to unrelatedness of English to generate text in TGT. We experiment with two and Tigrinya and scarcity of parallel data, respec- REL →TGT models leading to two baselines. First, tively. We get some performance improvement trained with parallel supervision, we call this base- by first translating EN to a related language first line (PIVOT:SUP(SRC→REL)+SUP(REL→TGT);), (Amharic in our case) and then translating it to TI. and second, trained in an unsuper- However, the gains diminish if we switch the vised manner, which we refer to as supervised AM→TI model with an unsupervised (PIVOT:SUP(SRC→REL)+UNSUP(REL →TGT)). one. We hypothesize this is due to small size of For unsupervised TGT↔ REL models, we use the monolingual corpora used to train this unsuper- the same architecture for this baseline and for BT- vised model. PIVOT as described above (since it can translate Next, using a simple TI→EN model directly to in both directions). We train this model based augment data to the parallel corpus (BT-DIRECT) on (Lample et al., 2018a) with their recommended also gives some improvement over the best per- settings4 with a few changes as follows: (1) For forming baseline (+2.3 BLEU). We conjecture that each word in TGT vocabulary, we find its neighbors while additional monolingual data for Tigrinya im- in the REL vocabulary within the Levenshtein dis- proves the decoder language model improving the tance of 2 (after removing -marking diacrit- translation fluency, the improvement is hampered ics). This resulted in a dictionary with 1,200 pairs. due to poor back-translations. (2) We train fasttext (Bojanowski et al., 2016) On the other hand, using a AM→EN model for embeddings for both corpora and then align them back-translating TI sentences results in a drop in with supervision (Lample et al., 2018b) from the performance likely due to very noisy examples be- created dictionary. (3) We initialize the embedding ing added to the training corpus.

3While we recognize the limitations of BLEU espe- We get further improvements as we consider cially for evaluating generations in morphologically rich lan- more sophisticated methods involving pivoting guages (Mathur et al., 2020), METEOR (Banerjee and Lavie, through Amharic (BT-PIVOT-SUP). We identify 2005) or embedding based metrics (Zhang* et al., 2020; Sel- lam et al., 2020) are simply not available for low resource two potential reasons: a strong AM→EN model languages like Tigrinya. on account of being trained with a larger parallel 4 We do not experiment with more sophisticated UNMT corpus, a strong TI→AM model owing to the sim- methods (Lample and Conneau, 2019) due to their high mono- lingual resource requirements which are not available for ei- ilarity between two languages despite the parallel ther Amharic or Tigrinya. TI →AM model being small. AM and TI such as named entities or prepositions. The shared vocabulary especially benefits pivoting based back-translation where they are just copied with the TI→AM model resulting in their perfect translations. The AM →EN model then is able to accurately translate it English (since it is trained on Figure 1: Examples of phrases that BT-PIVOT-UNSUP generates accurately with high frequency a larger parallel corpus). This is in contrast with di- rect TI→EN back-translation (BT-DIRECT) trained on a smaller parallel corpus, where these tokens often get mis-translated resulting in poorer final performance. Finally, we present selected examples where BT- PIVOT-UNSUP performs well and compare it with examples where it suffers (compared to the base- line model SUP(EN→TGT) (see figure2). We again observe in the examples that BT-PIVOT-UNSUP is good at generating named entities as well as tokens shared by Amharic and Tigrinya. We also note that BT-PIVOT-UNSUP fails to perform well when translating numerals (which have to be copied), of- ten hallucinating content. We attribute these errors to noise in the back-translated data and domain mismatch between the authentic parallel corpus (containing religious text) and the synthetic parallel corpus (containing government announcements). 5 Conclusion We present and compare different methods of gen- erating synthetic parallel data and evaluate their utility for data-augmentation for low-resource ma- chine translation. With extensive experiments on Figure 2: Selected examples of EN-TI translations gen- English to Tigrinya translation, we show when par- erated by SUP(EN→TI) and the best performing data allel corpora are limited, using higher-resource re- augmentation strategy (BT-PIVOT-UNSUP). lated languages to develop backtranslation models can lead to substantial boost in MT performance. Finally, we get the biggest jump over the best Acknowledgments performing baseline (+7.1 BLEU) on BT-PIVOT- UNSUP where we still pivot through Amharic but This work is supported by the National Science TI→AM model trained in an unsupervised manner. Foundation under Grants No. IIS2007960 and We conjecture that this strong performance is due IIS2040926, and by the Google faculty research to closeness of the two languages and availability award. of large monolingual corpora required for training the unsupervised models. References To understand the influence of Amharic-based data-augmentation on the model performance, we Dzmitry Bahdanau, Kyung Hyun Cho, and Yoshua Bengio. 2015. Neural machine translation by look at the phrases in the test-set which are jointly learning to align and translate. 3rd Inter- most-frequently generated correctly by BT-PIVOT- national Conference on Learning Representations, UNSUP using compareMT (Neubig et al., 2019) to ICLR 2015 ; Conference date: 07-05-2015 Through extract the phrases. A small sample of such phrases 09-05-2015. is presented in figure1. We observe that a major- Satanjeev Banerjee and Alon Lavie. 2005. METEOR: ity of the phrases contain words shared between An automatic metric for MT evaluation with im- proved correlation with human judgments. In Pro- Guillaume Lample and Alexis Conneau. 2019. Cross- ceedings of the ACL Workshop on Intrinsic and Ex- lingual language model pretraining. CoRR, trinsic Evaluation Measures for Machine Transla- abs/1901.07291. tion and/or Summarization, pages 65–72, Ann Ar- bor, Michigan. Association for Computational Lin- Guillaume Lample, Alexis Conneau, Ludovic Denoyer, guistics. and Marc’Aurelio Ranzato. 2018a. Unsupervised machine translation using monolingual corpora only. Piotr Bojanowski, Edouard Grave, Armand Joulin, In International Conference on Learning Represen- and Tomas Mikolov. 2016. Enriching word vec- tations (ICLR). tors with subword information. arXiv preprint arXiv:1607.04606. Guillaume Lample, Alexis Conneau, Marc’Aurelio Ranzato, Ludovic Denoyer, and Herve´ Jegou.´ 2018b. Word translation without parallel data. In Interna- Tekabe Legesse Feleke. 2017. The similarity and mu- tional Conference on Learning Representations. tual intelligibility between Amharic and Tigrigna va- rieties. In Proceedings of the Fourth Workshop on Guillaume Lample, Myle Ott, Alexis Conneau, Lu- NLP for Similar Languages, Varieties and Dialects dovic Denoyer, and Marc’Aurelio Ranzato. 2018c. (VarDial), pages 47–54, Valencia, Spain. Associa- Phrase-based & neural unsupervised machine trans- tion for Computational Linguistics. lation. CoRR, abs/1804.07755.

Vu Cong Duy Hoang, Philipp Koehn, Gholamreza Guillaume Lample, Myle Ott, Alexis Conneau, Lu- Haffari, and Trevor Cohn. 2018. Iterative back- dovic Denoyer, and Marc’Aurelio Ranzato. 2018d. translation for neural machine translation. In Pro- Phrase-based & neural unsupervised machine trans- ceedings of the 2nd Workshop on Neural Machine lation. In Proceedings of the 2018 Conference on Translation and Generation, pages 18–24, Mel- Empirical Methods in Natural Language Processing, bourne, Australia. Association for Computational pages 5039–5049, Brussels, Belgium. Association Linguistics. for Computational Linguistics.

Kenji Imamura and Eiichiro Sumita. 2018. NICT Nitika Mathur, Timothy Baldwin, and Trevor Cohn. self-training approach to neural machine translation 2020. Tangled up in BLEU: Reevaluating the eval- at NMT-2018. In Proceedings of the 2nd Work- uation of automatic machine translation evaluation shop on Neural Machine Translation and Genera- metrics. In Proceedings of the 58th Annual Meet- tion, pages 110–115, Melbourne, Australia. Associ- ing of the Association for Computational Linguistics, ation for Computational Linguistics. pages 4984–4997, Online. Association for Computa- tional Linguistics. Yemane Keleta, Kazuhide Yamamoto, and Ashuboda Marasinghe. 2016. Tigrinya part-of-speech tagging Graham Neubig, Zi-Yi Dou, Junjie Hu, Paul Michel, with morphological patterns and the new nagaoka Danish Pruthi, and Xinyi Wang. 2019. compare-mt: tigrinya corpus. International Journal of Computer A tool for holistic comparison of language genera- Applications, 146:33–41. tion systems. In Proceedings of the 2019 Confer- ence of the North American Chapter of the Asso- Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senel- ciation for Computational Linguistics (Demonstra- lart, and Alexander Rush. 2017. OpenNMT: Open- tions), pages 35–41, Minneapolis, Minnesota. Asso- source toolkit for neural machine translation. In ciation for Computational Linguistics. Proceedings of ACL 2017, System Demonstrations, Nathan Ng, Kyra Yee, Alexei Baevski, Myle Ott, pages 67–72, Vancouver, Canada. Association for Michael Auli, and Sergey Edunov. 2019. Facebook Computational Linguistics. FAIR’s WMT19 news translation task submission. In Proceedings of the Fourth Conference on Ma- Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris chine Translation (Volume 2: Shared Task Papers, Callison-Burch, Marcello Federico, Nicola Bertoldi, Day 1), pages 314–319, Florence, Italy. Association Brooke Cowan, Wade Shen, Christine Moran, for Computational Linguistics. Richard Zens, Chris Dyer, Ondrejˇ Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open Kishore Papineni, Salim Roukos, Todd Ward, and Wei- source toolkit for statistical machine translation. In Jing Zhu. 2002. Bleu: a method for automatic eval- Proceedings of the 45th Annual Meeting of the As- uation of machine translation. In Proceedings of sociation for Computational Linguistics Companion the 40th Annual Meeting of the Association for Com- Volume Proceedings of the Demo and Poster Ses- putational Linguistics, pages 311–318, Philadelphia, sions, pages 177–180, Prague, Czech Republic. As- Pennsylvania, USA. Association for Computational sociation for Computational Linguistics. Linguistics.

Philipp Koehn, Franz Josef Och, and Daniel Marcu. Thibault Sellam, Dipanjan Das, and Ankur Parikh. 2003. Statistical phrase-based translation. In Pro- 2020. BLEURT: Learning robust metrics for text ceedings of HLT/NAACL, pages 48–54. Association generation. In Proceedings of the 58th Annual Meet- for Computational Linguistics. ing of the Association for Computational Linguistics, pages 7881–7892, Online. Association for Computa- Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q. tional Linguistics. Weinberger, and Yoav Artzi. 2020. Bertscore: Eval- uating text generation with bert. In International Rico Sennrich, Barry Haddow, and Alexandra Birch. Conference on Learning Representations. 2016a. Improving neural machine translation mod- els with monolingual data. In Proceedings of the Barret Zoph, Deniz Yuret, Jonathan May, and Kevin 54th Annual Meeting of the Association for Compu- Knight. 2016. Transfer learning for low-resource tational Linguistics (Volume 1: Long Papers), pages neural machine translation. In Proceedings of the 86–96, Berlin, Germany. Association for Computa- 2016 Conference on Empirical Methods in Natu- tional Linguistics. ral Language Processing, pages 1568–1575, Austin, Texas. Association for Computational Linguistics. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016b. Neural machine translation of rare words with subword units. In Proceedings of the 54th An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715– 1725, Berlin, Germany. Association for Computa- tional Linguistics.

Rico Sennrich and Biao Zhang. 2019. Revisiting low- resource neural machine translation: A case study. In Proceedings of the 57th Annual Meeting of the As- sociation for Computational Linguistics, pages 211– 221, Florence, Italy. Association for Computational Linguistics.

Solomon Teferra Abate, Michael Melese, Martha Yi- firu Tachbelie, Million Meshesha, Solomon Atinafu, Wondwossen Mulugeta, Yaregal Assabie, Hafte Abera, Binyam Ephrem, Tewodros Abebe, Wondim- agegnhue Tsegaye, Amanuel Lemma, Tsegaye An- dargie, and Seifedin Shifaw. 2018. Parallel corpora for bi-directional statistical machine translation for seven Ethiopian language pairs. In Proceedings of the First Workshop on Linguistic Resources for Nat- ural Language Processing, pages 83–90, Santa Fe, New Mexico, USA. Association for Computational Linguistics.

Jorg¨ Tiedemann. 2012. Parallel data, tools and inter- faces in OPUS. In Proceedings of the Eighth In- ternational Conference on Language Resources and Evaluation (LREC’12), pages 2214–2218, Istanbul, Turkey. European Language Resources Association (ELRA).

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Pro- cessing Systems, volume 30, pages 5998–6008. Cur- ran Associates, Inc.

W3Techs. 2020. W3techs. usage of content lan- guages for websites. https://w3techs.com/ technologies/overview/content_language.

Jiajun Zhang and Chengqing Zong. 2016. Exploit- ing source-side monolingual data in neural machine translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Process- ing, pages 1535–1545, Austin, Texas. Association for Computational Linguistics.