Using Multiple Subwords to Improve English-Esperanto Automated Literary Translation Quality

Using Multiple Subwords to Improve English-Esperanto Automated Literary Translation Quality Alberto Poncelas1,2, Jan Buts1,2, James Hadley1,2, Andy Way2 1Trinity Centre for Literary and Cultural Translation, Trinity College Dublin, Ireland [email protected],[email protected] 2ADAPT Centre, School of Computing, Dublin City University, Ireland [email protected] Abstract for creating additional training data. A further useful technique for expanding the dataset is back- Building Machine Translation (MT) systems translation (Sennrich et al., 2016a). This procedure for low-resource languages remains challeng- consists of automatically translating a monolin- ing. For many language pairs, parallel data gual text from the target language into the selected are not widely available, and in such cases MT models do not achieve results comparable to source language, and then using the resulting paral- those seen with high-resource languages. lel set as training data so the model benefits from this additional information. Although the quality When data are scarce, it is of paramount im- portance to make optimal use of the limited of these sentence pairs is not as high as that of material available. To that end, in this paper human-translated sentences (the source side con- we propose employing the same parallel sen- tains mistakes produced by the MT system), the tences multiple times, only changing the way pairs are still useful when used as training data, be- the words are split each time. For this pur- cause they do often improve the models (Poncelas pose we use several Byte Pair Encoding mod- et al., 2019a). els, with various merge operations used in their Nonetheless, for some languages, the available configuration. data are in such short supply that MT models used In our experiments, we use this technique to for generating back-translated sentences may pro- expand the available data and improve an MT duce a high proportion of noisy sentences. The use system involving a low-resource language pair, namely English-Esperanto. of noisy sentences for building MT models could ultimately have a negative impact on the quality As an additional contribution, we made avail- of the MT system’s outputs (Goutte et al., 2012), able a set of English-Esperanto parallel data in and therefore they are often removed (Khadivi and the literary domain. Ney, 2005; Taghipour et al., 2010; Popovic´ and 1 Introduction Poncelas, 2020). We propose employing another technique to aug- In this paper, we use the constructed language Es- ment datasets: using the same set of sentences peranto to illustrate potential improvements in the multiple times, but in slightly altered form each automatic translation of material from low-resource time. Specifically, we modify the sentences by us- languages. Languages are considered low-resource ing different Byte Pair Encoding (BPE) (Sennrich when there is little textual material available in the et al., 2016b) merge operations. We perform a form of electronically stored corpora. They pose fine-grained analysis, exploring the use of different significant challenges in the field of Machine Trans- splitting options on the source side, on the target lation (MT), since it is difficult to build models that side, and on both sides. perform adequately using small amounts of data. Multiple techniques have been developed to 2 Previous work improve MT in conditions of data scarcity. A popular approach is to translate indirectly via a This research is inspired by techniques for augment- pivot language (Utiyama and Isahara, 2007; Fi- ing the training set artificially. One of these tech- rat et al., 2017; Liu et al., 2018; Poncelas et al., niques is back-translation (Sennrich et al., 2016a), 2020a). Moreover, indirect translation can be used which involves creating artificial source-side sen- 108 Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages, pages 108–117 Decmber 04, 2020. c 2020 Association for Computational Linguistics tences by translating a monolingual set in the target sists of the compound parts vir [adult human], in language. Similar techniques include the use of sev- [female], and o [entity] (as the ’o’ ending is used eral models to generate sentences (Poncelas et al., for all nouns). The word for “mother”, patrino, 2019b; Soto et al., 2020), or the use of synthetic largely refers to the same semantic categories, and data on the target side (Chinea-Rios et al., 2017; Li is therefore structurally highly similar. et al., 2020). As a consequence of this internal consistency, A technique that involves multiple segmentation Esperanto learners can quickly expand their vo- is subword regularization (Kudo, 2018), in which cabulary by learning to segment words into their candidate sentences with different splits are sam- various parts, which can then be used to construct pled, either probabilistically or using a language new words by morphological analogy. Because of model for training. its affinity with many other languages, and because In the work of Poncelas et al.(2020b), differ- of the thoroughly logical composition of its vocab- ent splits are used to build an English-Thai MT ulary, Esperanto has historically been central to model. As the Thai language does not use whites- several experiments in MT, most notably regarding pace separation between words, different splits can its potential function as a pivot language between be applied, to address the fact that all the words and European languages (Gobbo, 2015). In this study, sub-words are joined together in the final output. however, we focus on automatic translation into More recently, Provilkov et al.(2020) introduced Esperanto for its own sake. BPE-dropout, an improvement on standard BPE consisting of randomly dropping merges when 4 Research Questions training the model, such that a single word can We propose building MT models using training have several segmentations. data composed of a dataset split into multiple vari- 3 The Esperanto language ants with a different configuration of BPE, as pre- sented in Figure1. At the top of the figure, one can This article is concerned with improving MT mod- see that the same parallel set has been processed els for Esperanto, the most successful constructed using BPE with 89,500, 50,000 and 10,000 opera- international language (Blanke, 2009). It was cre- tions (trained separately for each language). The ated in the late nineteenth century, and is said to be MT model represented on the left has been built currently spoken by over 2 million people, spread using the same dataset replicated three times, the across more than 100 countries (Eberhard et al., only difference being that on the target side, dif- 2020). During its first century of development, Es- ferent splits were implemented. Similarly, the MT peranto was principally maintained by means of model in the centre is built with different splits on membership-based organisations. Currently, inter- the source side. The last model, represented on the net applications such as Duolingo are supporting right, combines different splits both on the source the wider spread of the language among new enthu- and the target side. siasts. While many Esperanto speakers have sought In order to evaluate the models, we use a test set to develop the language through translation, the that is split with a single BPE strategy (i.e. using body of work available - particularly in digital for- 89,500 merge operations, the default proposed in mats - remains relatively small, making Esperanto the work of Sennrich et al.(2016b)). Therefore, a clear example of a low-resource language. using different merge operations on the source side Esperanto loosely derives its lexicon from sev- of the training data may not have as big an impact eral Indo-European languages, and shares some as when they are applied to the target side (not all typological characteristics with, among others, Rus- the words will match those in the test set). However, sian, English, and French (Parkvall, 2010). In con- the addition of other BPE configurations could in trast to most natural languages, Esperanto’s most principle still be useful to improve modeling for distinctive characteristic is its regularity. The gram- the source language. mar consists of a very limited set of operations, to In Section5 we describe the settings of the MT which there are, in principle, no exceptions. Fur- and the data used for training. In Section6 we thermore, the language is agglutinative, and its suf- analyze the results achieved by the baseline system. fixes are independently meaningful and invariable. This paper’s experiments are divided into three For instance, virino, the word for “woman”, con- sections. Each of these sections describes and also 109 provides the evaluation of a model. The sections 5.2 Test Set are the following: In order to evaluate the quality of the models, two test sets are translated. The test sets are the same • Combination of dataset with different merge for all models. In addition to tokenization and operations on the target side (Section 7.1). truecase, we also use BPE with 89,500 merge operations. We do not use (or combine) other BPE • Combination of dataset with different merge configurations. The translations are evaluated using operations on the source side (Section 7.2). the BLEU (Papineni et al., 2002) metric. The first test set is taken from the OPUS • Combination of dataset with different merge (Books) dataset (Tiedemann, 2012) (1562 sen- operations on both the source and target side tences). Specifically, the test set consists of ma- (Section 7.3). terial from two texts available in English and in Alice’s In Section8, we compare translation examples Esperanto translation, namely Carroll’s Adventures in Wonderland from the different models and analyze the different (Carroll and Kearney, The Fall of the House of outcomes.

Load more