Facilitating Terminology Translation with Target Lemma Annotations
Total Page:16
File Type:pdf, Size:1020Kb
Facilitating Terminology Translation with Target Lemma Annotations Toms Bergmanisyz and Marcis¯ Pinnisyz yTilde / Vien¯ıbas gatve 75A, Riga, Latvia zFaculty of Computing, University of Latvia / Rain¸a bulv. 19, Riga, Latvia [email protected] Abstract has assumed that the correct morphological forms are apriori known (Hokamp and Liu, 2017; Post Most of the recent work on terminology in- and Vilar, 2018; Hasler et al., 2018; Dinu et al., tegration in machine translation has assumed 2019; Song et al., 2020; Susanto et al., 2020; Dou- that terminology translations are given already gal and Lonsdale, 2020). Thus previous work has inflected in forms that are suitable for the tar- get language sentence. In day-to-day work of approached terminology translation predominantly professional translators, however, it is seldom as a problem of making sure that the decoder’s the case as translators work with bilingual glos- output contains lexically and morphologically pre- saries where terms are given in their dictionary specified target language terms. While useful in forms; finding the right target language form some cases and some languages, such approaches is part of the translation process. We argue come short of addressing terminology translation that the requirement for apriori specified tar- into morphologically complex languages where get language forms is unrealistic and impedes the practical applicability of previous work. In each word can have many morphological surface this work, we propose to train machine trans- forms. lation systems using a source-side data aug- For terminology translation to be viable for trans- mentation method1 that annotates randomly se- lation into morphologically complex languages, lected source language words with their tar- terminology constraints have to be soft. That is, get language lemmas. We show that systems terminology translation has to account for various trained on such augmented data are readily natural language phenomena, which cause words usable for terminology integration in real-life translation scenarios. Our experiments on ter- to have more than one manifestation of their root minology translation into the morphologically morphemes. Multiple root morphemes complicate complex Baltic and Uralic languages show an the application of hard constraint methods, such improvement of up to 7 BLEU points over as constrained-decoding (Hokamp and Liu, 2017). baseline systems with no means for terminol- That is because even after the terminology con- ogy integration and an average improvement straint is striped from the morphemes that encode of 4 BLEU points over the previous work. Re- all grammatical information, the remaining root sults of the human evaluation indicate a 47.7% morphemes still can be too restrictive to be used as absolute improvement over the previous work in term translation accuracy when translating hard constraints because, for many words, there can into Latvian. be more than one root morpheme possible. An il- lustrative example is the consonant mutation in the 1 Introduction Latvian noun vacietis¯ (“the German”) which under- goes the mutation t!s,ˇ thus yielding two variants Translation into morphologically complex lan- of its root morpheme vacie¯ s-ˇ and vaciet-¯ (Bergma- guages involves 1) making a lexical choice for nis, 2020). If either of the forms is used as a hard a word in the target language and 2) finding its constraint for constrained decoding, the other one morphological form that is suitable for the morpho- is excluded from appearing in the sentence’s trans- syntactic context of the target sentence. Most of the lation. recent work on terminology translation, however, We propose a necessary modification for the 1Relevant materials and code: https://github. method introduced by Dinu et al.(2019), which com/tilde-nlp/terminology_translation allows training neural machine translation (NMT) 3105 Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, pages 3105–3111 April 19 - 23, 2021. ©2021 Association for Computational Linguistics EN Src.: faulty engine or in transmission[..] Train Test LV Trg.: atteice dzinej¯ a¯ vai transmisijas [..] ATS WMT17+IATE faultyjw enginejs dzinej¯ a¯jt orjw EN-DE 27.6M 768 581 ETA: transmissionjs transmisijasjt [..] EN-ET 2.4M 768 - EN-LV 22.6M 768 - faultyjw enginejs dzinej¯ sjt orjw TLA: EN-LT 22.1M 768 - transmissionjs transmisijajt [..] Table 2: Training and evaluation data sizes in num- Table 1: Examples of differences in input data in ETA bers of sentences. WMT2017 + IATE stands for the (Dinu et al., 2019) and TLA (this work). Differences of English-German test set from the news translation task inline annotations are marked in bold. jw, js, jt denote of WMT2017 which is annotated with terminology the values of the additional input stream and stand for from the IATE terminology database. regular words, source language annotated words, target language annotations respectively. work, we propose two changes to the approach of systems that are capable of applying terminology Dinu et al.(2019). First, when preparing train- constraints: instead of annotating source-side termi- ing data, instead of using terms found in either 2 nology with their exact target language translations, IATE or Wiktionary as done by Dinu et al.(2019), we annotate randomly selected source language we annotate random source language words. This words with their target language lemmas. First of relaxes the requirement for curated bilingual dictio- all, preparing training data in such a way relaxes naries for training data preparation. Second, rather the requirement for access to bilingual terminology than providing exactly those target language forms resources at the training time. Second, we show that are used in the target sentence, we use target that the model trained on such data does not learn lemma annotations (TLA) instead (see Table1 for to simply copy inline annotations as in the case examples). We hypothesise that in order to ben- of Dinu et al.(2019), but learns copy-and-inflect efit from such annotations, the NMT model will behaviour instead, thus addressing the need for soft have to learn copy-and-inflect behaviour instead of terminology constraints. simple copying as proposed by Dinu et al.(2019). Our results show that the proposed approach not Our work is similar to work by Exel et al.(2020) only relaxes the requirement for apriori specified in which authors also aim to achieve copy-and- target language forms but also yields substantial inflect behaviour. However, authors limit their an- improvements over the previous work (Dinu et al., notations to only those terms for which their base 2019) when tested on the morphologically complex forms differ by no more than two characters from Baltic and Uralic languages. the forms required in the target language sentence. Thus wordforms undergoing longer affix change or 2 Method: Target Lemma Annotations inflections accompanied by such linguistic phenom- ena as consonant mutation, consonant gradation or To train NMT systems that allow applying termi- other stem change are never included in training nology constraints Dinu et al.(2019) prepare train- data. ing data by amending source language terms with their exact target annotations (ETA). To inform 3 Experimental Setup the NMT model about the nature of each token Languages and Data. As our focus is on mor- (i.e., whether it is a source language term, its target phologically complex languages, in our experi- language translation or a regular source language ments we translate from English into Latvian and word), the authors use an additional input stream— Lithuanian (Baltic branch of the Indo-European lan- source-side factors (Sennrich and Haddow, 2016). guage family) as well as Estonian (Finnic branch Their method, however, is limited to cases in which of the Uralic language family). For comparabil- the provided annotation matches the required target ity with the previous work, we also use English- form and can be copied verbatim, thus performing German (Germanic branch of the Indo-European poorly in cases where the surface forms of terms language family). For all language pairs, we use in the target language differ from those used to all data that is available in the Tilde Data Libarary annotate source language sentences (Dinu et al., with an exception for English-Estonian for which 2019). This constitutes a problem for the method’s practical applicability in real-life scenarios. In this 2https://iate.europa.eu 3106 Figure 1: Example of forms used in human evaluation. we use data from WMT 2018. The size of the par- nrich and Haddow, 2016) with the dimensionality allel corpora after pre-processing using the Tilde of 8 for systems using inline target lemma anno- MT platform (Pinnis et al., 2018) and filtering tools tations. We train all models using early stopping (Pinnis, 2018) is given in Table2. with the patience of 10 based on their development To prepare data with TLA, we first lemmatise set perplexity (Prechelt, 1998). and part-of-speech (POS) tag the target language Evaluation Methods and Data. In previous side of parallel corpora. For lemmatisation and work, methods were tested on general domain data5 POS tagging, we use pre-trained Stanza3 (Qi et al., annotated with exact surface forms of general- 2020) models. We then use fast align4 (Dyer et al., domain words from IATE and Wiktionary. Al- 2013) to learn word alignments between the target though data constructed in such a way is not only ar- language lemmas and source language inflected tificial but also gives an oversimplified view on ter- words. We only annotate verbs or nouns. To gen- minology translation, we do use the data from IATE erate sentences with varying proportions of anno- to validate our re-implementation of the method tated and unannotated words, we first generate a from Dinu et al.(2019). Other than that, we test sentence level annotation threshold uniformly at on the Automotive Test Suite6 (ATS): a data set random from the interval [0:6; 1:0).