Neural Machine Translation Between Similar South-Slavic Languages

Neural Machine Translation between similar South-Slavic languages Maja Popovic,´ Alberto Poncelas ADAPT Centre, School of Computing Dublin City University, Ireland [email protected] Abstract (NMT) has not been investigated yet for these languages. This paper describes the ADAPT-DCU ma- In this work, we first compare bilingual and mul- chine translation systems built for the WMT tilingual systems in order to determine whether 2020 shared task on Similar Language Trans- joining Serbian and Croatian data is useful. After- lation. We explored several set-ups for wards, we investigate additional cleaning of remain- NMT for Croatian–Slovenian and Serbian– ing misaligned segments by using character n-gram Slovenian language pairs in both translation directions. Our experiments focus on differ- matching scores (Popovic´, 2015). The beauty of ent amounts and types of training data: we the method for similar languages is that it can be ap- first apply basic filtering on the OpenSubti- plied directly to the given training corpus providing tles training corpora, then we perform addi- matching scores for each pair of the source-target tional cleaning of remaining misaligned seg- segments. For distant languages, translation of one ments based on character n-gram matching. side of the training corpus would be required. Fi- Finally, we make use of additional monolin- nally, we make use of monolingual data in each of gual data by creating synthetic parallel data through back-translation. Automatic evalua- the three languages by creating additional synthetic tion shows that multilingual systems with joint parallel training sets via back-translation (Sennrich Serbian and Croatian data are better than bilin- et al., 2016a; Poncelas et al., 2018; Burlot and gual, as well as that character-based cleaning Yvon, 2018). leads to improved scores while using less data. The results also confirm once more that adding 2 Language properties back-translated data further improves the per- formance, especially when the synthetic data Common properties All three languages, Croa- is similar to the desired domain of the devel- tian, Serbian and Slovenian, belong to the South- opment and test set. This, however, might Western Slavic branch. As Slavic languages, they come at a price of prolonged training time, es- have a very rich inflectional morphology for all pecially for multitarget systems. word classes: six cases and three genders for all nouns, pronouns, adjectives and determiners. For 1 Introduction verbs, person and many tenses are expressed by the suffix so that the subject pronoun is often omitted. Machine translation (MT) between closely related There are two verb aspects, so that many verbs have languages is, in principle, less challenging than perfective and imperfective form(s) depending on translation between distantly related languages, but the duration of the described action. As for syntax, it is still far from being solved. While MT be- all three languages have quite a free word order, tween closely related South-Western Slavic lan- and neither language uses articles, either definite guages, Croatian, Slovenian and Serbian based on or indefinite. In addition to this, multiple negation the rule-based (RBMT) and the phrase-based (PB- is always used. SMT) approaches has been investigated in the last years (Etchegoyhen et al., 2014; Petkovski et al., Croatian and Serbian Croatian and Serbian ex- 2014; Klubickaˇ et al., 2016; Arcanˇ et al., 2016; hibit a large overlap in vocabulary and a strong Popovic´ et al., 2016a), to the best of our knowledge, morpho-syntactic similarity so that the speakers the new state-of-the-art neural machine translation can understand each other without difficulties. Nev- 430 Proceedings of the 5th Conference on Machine Translation (WMT), pages 430–436 Online, November 19–20, 2020. c 2020 Association for Computational Linguistics ertheless, there is a number of small but notable lang. set domain # sentences and also frequently occurring differences between sl-hr train Subtitles 11 213 386 them. The largest differences between the two dev PR publications 2457 languages are in vocabulary: some words are com- test PR publications 2582 pletely different, some however differ only by one sl-sr train Subtitles 11 780 062 or two letters. Apart from lexical differences, there dev PR publications 1259 are also structural differences mainly concerning test PR publications 1260 verbs: modal verb constructions, future tense, as well as conditional. Table 1: Corpus statistics. Slovenian Even though Slovenian is very closely related to Croatian and Serbian, and the languages Croatian and vice versa was shown to be helpful share a large degree of mutual intelligibility, a num- for the PBSMT approach (Popovic´ and Ljubesiˇ c´, ber of Croatian/Serbian speakers may have difficul- 2014; Popovic´ et al., 2016b). However, our prelim- ties with Slovenian and the other way round. The inary experiments in this direction indicated that nature of the lexical differences is similar to the one this technique is not helpful for the NMT approach. between Croatian and Serbian, namely a number of words is completely different and a number only The original parallel data were filtered in order differs by one or two letters. However, the amount to eliminate noisy parts: too long segments (more of different words is much larger. In addition to than 100 words), segment pairs with dispropor- that, the set of overlapping words includes a num- tional sentence lengths, segments with more than ber of false friends (e.g. brati means to pluck in 1/3 of non-alphanumeric characters, as well as du- Croatian and Serbian but to read in Slovenian). plicate segment pairs were removed. The statistics The amount of grammatical differences is also of the remaining subtitles together with the devel- larger and includes local word order, verb mood opment and test sets is shown in Table1. The and/or tense formation, question structure, usage development and test sets were provided by the of cases, structural properties for certain conjunc- organisers and originate from Public Relations pub- tions, as well as some other structural differences. lications of a business intelligence company. Another important difference is the Slovenian dual grammatical number which refers to two entities (apart from singular for one and plural for more 3.1 Additional cleaning of OpenSubtitles than two). It requires additional set of pronouns, as well as additional sets for noun, adjective and verb While a large number of noisy parts and misaligned inflexion rules not existing either in Croatian or in segments was removed from OpenSubtitles by the Serbian. basic filtering procedure, a number of misaligned segments still remained. In order to remove these, 3 Data we applied additional cleaning based on the char- For training, we used publicly available OPUS1 acter n-gram F-score chrF usually used for MT parallel corpora (Tiedemann, 2012) indicated by evaluation (Popovic´, 2015). For the purpose of the workshop organisers. OpenSubtitles is indi- cleaning, the chrF score is calculated for each pair cated for all translation directions. For Croatian– of segments in the training data. Due to simi- Slovenian, other corpora are indicated too, but they larity between the languages, the scores between are either not sentence-aligned (JW300) or are ex- the properly aligned segments are higher than the tremely noisy (DGT, MultiParaCrawl). Therefore, scores of misaligned segments. Nevertheless, the we decided to use only OpenSubtitles for all trans- languages are sufficiently different so that some lation directions. properly aligned short segments (or single words) can have low scores, too. Still, if those words It is worth noting that the organisers also in- also appear in longer sentences, they will not be dicated the SETIMES News parallel Croatian– removed. Preliminary experiments with different Serbian corpus. Developing an additional Croatian– thresholds showed that keeping the segments with Serbian MT system for converting Serbian data into the chrF score equal or greater than 20 is the best 1http://opus.nlpl.eu/ option. 431 3.2 Using monolingual data lion best ranked Croatian sentences were translated In addition to the parallel OpenSubtitles corpora, into Slovenian. we also used the monolingual data in each of the Translation into Slovenian: Slovenian is the target three languages which were indicated by the or- language for two translation directions, and we ganisers, namely the mixed-domain data collected wanted to have equally relevant Slovenian sen- from Web, hrWac, slWac and hrWac (Ljubesiˇ c´ and tences for both directions. Therefore, we did not Erjavec, 2011; Ljubesiˇ c´ and Klubickaˇ , 2014). As take the first two million sentences for one source a first step, we removed too long and too short language and the second two million for the other, sentences, keeping those between 5 and 60 words. because the Slovenian sentences for the first source Then, we removed sentences with more than 1/3 language would be more relevant than those for the of non-alphanumeric characters, sentences with second source language. Instead, we took the first URLs, as well as duplicate sentences. four million best ranked Slovenian sentences, and Then, we wanted to rank these sentences accord- then translated every odd sentence into Serbian and ing to the relevance for our experiments, namely every even sentence into Croatian. according to their similarity to the development corpus. For this purpose, we used Feature Decay 4 MT systems Algorithm (FDA) (Biçici and Yuret, 2011). This All our systems are built using the Sockeye imple- method iteratively selects sentences from an ini- mentation (Hieber et al., 2018) of the Transformer tial set S based on the number of n-grams which architecture (Vaswani et al., 2017). The systems overlap with an in-domain text Seed and adds these operate on sub-word units generated by byte-pair sentences to a selected set Sel. In addition, in order encoding (BPE) (Sennrich et al., 2016b).

Load more