The of submissions to the WMT19 news translation task

Aarne Talman,∗ † Umut Sulubacak,∗ Raul´ Vazquez,´ ∗ Yves Scherrer,∗ Sami Virpioja,∗ Alessandro Raganato,∗ † Arvi Hurskainen∗ and Jorg¨ Tiedemann∗

∗University of Helsinki †Basement AI {name.surname}@helsinki.fi

Abstract neural machine translation. We spent substantial effort on filtering data in order to reduce noise— In this paper, we present the University of Helsinki submissions to the WMT 2019 shared especially in the web-crawled data sets—and to task on news translation in three language match the target domain of news data. pairs: English–German, English–Finnish and The resulting training sets, after applying the Finnish–English. This year, we focused first steps described below, are for 15.7M sentence on cleaning and filtering the training data pairs for English–German, 8.5M sentence pairs using multiple data-filtering approaches, re- for English–Finnish, and 12.3M–26.7M sentence sulting in much smaller and cleaner training pairs (different samplings of back-translations) for sets. For English–German, we trained both sentence-level transformer models and com- Finnish–English. pared different document-level translation ap- 2.1 Pre-processing proaches. For Finnish–English and English– Finnish we focused on different segmentation For each language, we applied a series of pre- approaches, and we also included a rule-based processing steps using scripts available in the system for English–Finnish. Moses decoder (Philipp Koehn, 2007): 1 Introduction • replacing unicode punctuation, The University of Helsinki participated in the • removing non-printing characters, WMT 2019 news translation task with four pri- mary submissions. We submitted neural ma- • normalizing punctuation, chine translation systems for English-to-Finnish, • tokenization. Finnish-to-English and English-to-German, and a rule-based machine translation system for In addition to these steps, we replaced a number English-to-Finnish. of English contractions with the full form, e.g. Most of our efforts for this year’s WMT focused “They’re” → “They are”. After the above steps, on data selection and pre-processing (Section2), we applied a Moses truecaser model trained for in- sentence-level translation models for English- dividual languages, and finally a byte-pair encod- to-German, English-to-Finnish and Finnish-to- ing (BPE) (Sennrich et al., 2016b) segmentation English (Section3), document-level translation using a set of codes for either language pair. models for English-to-German (Section4), and For English–German, we initially pre-processed a comparison of different word segmentation ap- the data using only punctuation normalization and proaches for Finnish (Section 3.3). The final sub- tokenization. We subsequently trained an En- mitted NMT systems are summarized in Section5, glish truecaser model using all monolingual En- while the rule-based machine translation system is glish data as well as the English side of all paral- described in Section 3.4. lel English–German datasets except the Rapid cor- pus (in which non-English characters were miss- 2 Pre-processing, data filtering and ing from a substantial portion of the German sen- back-translation tences). We also repeated the same for German. Afterwards, we used a heuristic cleanup script1 in It is well known that data pre-processing and se- lection has a huge effect on translation quality in 1Shared by Marcin Junczys-Dowmunt. Retrieved

412 Proceedings of the Fourth Conference on Machine Translation (WMT), Volume 2: Shared Task Papers (Day 1) pages 412–423 Florence, Italy, August 1-2, 2019. c 2019 Association for Computational Linguistics order to filter suspicious samples out of Rapid, and • removing sentences with extremely long then truecased all parallel English–German data words: We excluded all sentence pairs with (including the filtered Rapid) using these models. words of 40 or more characters; Finally, we trained BPE codes with 35 000 sym- • removing sentence pairs that include HTML bols jointly for English–German on the truecased or XML tags; parallel sets. For all further experiments with English–German data, we applied the full set of • decoding common HTML/XML entities; tokenization steps as well as truecasing and BPE segmentation. • removing empty alignments (while keeping document boundaries intact); For English–Finnish, we first applied the stan- dard tokenization pipeline. For English and • removing pairs where the sequences of non- Finnish respectively, we trained truecaser models zero digits occurring in either sentence do not on all English and Finnish monolingual data as match; well as the English and Finnish side of all paral- lel English–Finnish datasets. As we had found to • removing pairs where one sentence is termi- be optimal in our previous year submission (Ra- nated with a punctuation mark and the other ganato et al., 2018), we trained a BPE model is either missing terminal punctuation or ter- using a vocabulary of 37 000 symbols, trained minated with another punctuation mark. jointly only on the parallel data. Furthermore, for Language identifiers: There is a surprisingly some experiments, we also used domain labeling. large amount of text segments in a wrong lan- We marked the datasets with 3 different labels: guage in the provided parallel training data. This hNEWSi for the development and test data from is especially true for the ParaCrawl and Rapid 2015, 2016, 2017, hEPi for Europarl, and hWEBi data sets. This is rather unexpected as a basic for ParaCrawl and Wikititles. language identifier certainly must be part of the crawling and extraction pipeline. Nevertheless, af- 2.2 Data filtering ter some random inspection of the data, we found it necessary to apply off-the-shelf language identi- For data filtering we applied four types of filters: fiers to the data for removing additional erroneous (i) rule-based heuristics, (ii) filters based on lan- text from the training data. In particular, we ap- guage identification, (iii) filters based on word plied the Compact Language Detector version 2 alignment models, and (iv) language model filters. (CLD2) from the Google Chrome project (using 2 Heuristic filters: The first step in cleaning the the Python interface from pycld2 ), and the widely data refers to a number of heuristics (largely in- used langid.py package (Lui and Baldwin, 2012) spired by (Stahlberg et al., 2018)) including: to classify each sentence in the ParaCrawl, Com- monCrawl, Rapid and Wikititles data sets. We re- • removing all sentence pairs with a length moved all sentence pairs in which the language of difference ratio above a certain threshold: one of the aligned sentences was not reliably de- for CommonCrawl, ParaCrawl and Rapid we tected. For this, we required the correct language used a threshold of 3, for WikiTitles a thresh- ID from both classifiers, the reliable-flag set to old of 2, and for all other data sets a threshold “True” by CLD2 with a reliability score of 90 or of 9; more, and the detection probability of langid.py to be at least 0.9. • removing pairs with short sentences: for Word alignment filter: Statistical word align- CommonCrawl, ParaCrawl and Rapid we re- ment models implement a way of measuring the quired a minimum number of four words; likelihood of parallel sentences. IBM-style align- ment models estimate the probability p(f | a, e) • removing pairs with very long sentences: we of a foreign sentence f given an ”emitted” sen- restricted all data to a maximum length of tence e and an alignment a between them. Train- 100 words; ing word alignment models and aligning large cor- pora is very expensive using traditional methods from https://gist.github.com/emjotde/ 4c5303e3b2fc501745ae016a8d1e8e49 2https://github.com/aboSamoor/pycld2

413 and implementations. Fortunately, we can rely on is low. The intuition is that both models should be eflomal3, an efficient word aligner based on Gibbs roughly similarly surprised when observing sen- sampling (Ostling¨ and Tiedemann, 2016). Re- tences that are translations of each other. In order cently, the software has been updated to allow the to make the values comparable, we trained our lan- storage of model priors that makes it possible to guage models on parallel data sets. initialize the aligner with previously stored model For English–Finnish, we used news test data parameters. This is handy for our filtering needs from 2015-2017 as the only available in-domain as we can now train a model on clean parallel data parallel training data, and for English–German and apply that model to estimate alignment proba- we added the NewsCommentary data set to the bilities of noisy data sets. news test sets from 2008-2018. As both data We train the alignment model on Europarl and sets are small, and we aimed for an efficient and news test sets from previous WMTs for English– cheap filter, we opted for a traditional n-gram lan- Finnish, and NewsCommentary for English– guage model in our experiments. To further avoid German. For both language pairs, we train a data sparseness and to improve comparability be- Bayesian HMM alignment model with fertilities tween source and target language, we also based in both directions and estimate the model priors our language models on BPE-segmented texts us- from the symmetrized alignment. We then use ing the same BPE codes as for the rest of the those priors to run the alignment of the noisy data training data. VariKN (Siivola et al., 2007b,a)4 sets using only a single iteration of the final model is the perfect toolkit for the purposes of estimat- to avoid a strong influence of the noisy data on ing n-gram language models with subword units. alignment parameters. As it is intractable to esti- It implements Kneser-Ney growing and revised mate a fully normalized conditional probability of Kneser-Ney pruning methods with the support of a sentence pair under the given higher-level word n-grams of varying size and the estimation of alignment model, eflomal estimates a score based word likelihoods from text segmented in subword on the maximum unnormalized log-probability of units. In our case, we set the maximum n-gram links in the last sampling iteration. In practice, this size to 20, and the pruning threshold to 0.002. seems to work well, and we take that value to rank Finally, we computed cross-entropies for each sentence pairs by their alignment quality. In our sentence in the noisy parallel training data and experiments, we set an arbitrary threshold of 7 for stored 5 values as potential features for filtering: that score, which seems to balance recall and pre- H(S, qs), H(T, qt), avg(H(S, qs),H(T, qt)), cision well according to some superficial inspec- max(H(S, qs),H(T, qt)) and abs(H(S, qs) − tion of the ranked data. The word alignment filter (T, qt)). Based on some random inspection, we is applied to all web data as well as to the back- selected a threshold of 13 for the average cross- translations of monolingual news. entropy score, and a threshold of 4 for the cross- entropy difference score. For English–Finnish, we Language model filter: The most traditional opted for a slightly more relaxed setup to increase data filtering method is probably to apply a lan- coverage, and set the average cross-entropy to 15 guage model. The advantage of language mod- and the difference threshold to 5. We applied the els is that they can be estimated from monolin- language model filter to all web data and to the gual data, which may be available in sufficient back-translations of monolingual news. amounts even for the target domain. In our ap- proach, we opted for a combination of source and Applying the filter to WMT 2019 data: The target language models and focused on the com- impact of our filters on the data provided by WMT parison between scores coming from both mod- 2019 is summarized in Tables1,2 and3. els. The idea is to prefer sentence pairs for which We can see that the ParaCrawl corpus is the not only the cross-entropy of the individual sen- one that is the most affected by the filters. A lot tences (H(S, qs) and H(T, qt)) is low with respect of noise can be removed, especially by the lan- to in-domain LMs, but also the absolute differ- guage model filter. The strict punctuation filter ence between the cross-entropies (abs(H(S, qs) − also has a strong impact on that data set. Natu- H(T, qt))) for aligned source and target sentences rally, web data does not come with proper com-

3Software available from https://github.com/ 4VariKN is available from https://vsiivola. robertostling/eflomal github.io/variKN/

414 EN–DE EN–FI nal translation scores reflect that we do not seem to lose substantial amounts of performance even CommonCrawl 3.2% with the strict filters. Nevertheless, for English– Europarl 0.8% 2.8% Finnish, we still opted for a more relaxed setup News-Commentary 0.2% to increase coverage, as the strict version removed ParaCrawl 0.6% over 87% of the ParaCrawl data. Rapid 13.2% 5.2% It is also interesting to note the differences of WikiTitles 8.0% 4.0% individual filters on different data sets. The word alignment filter seems to reject a large portion of Table 1: Basic heuristics for filtering – percentage of lines removed. For English–Finnish the statistics for the CommonCrawl data set whereas it does not af- ParaCrawl are not available because the cleanup script fect other data sets that much. The importance was applied after other filters. of language identification can be seen with the ParaCrawl data whereas other corpora seem to be % rejected much cleaner with respect to language. Filter CC ParaCrawl Rapid 2.3 Back-translation LM average CE 31.9% 62.0% 12.7% We furthermore created synthetic training data by LM CE diff 19.0% 12.7% 6.9% back-translating news data. We translated the Source lang ID 4.0% 30.7% 7.3% monolingual English news data from the years Target lang ID 8.0% 22.7% 6.2% 2007–2018, from which we used a filtered and Wordalign 46.4% 3.1% 8.4% sampled subset of 7M sentences for our Finnish– Number 15.3% 16.0% 5.0% English systems, and the Finnish data from years Punct 0.0% 47.4% 18.7% 2014–2018 using our WMT 2018 submissions. total 66.7% 74.7% 35.1% We also used the back-translations we generated for the WMT 2017 news translation task, where Table 2: Percentage of lines rejected by each filter for we used an SMT model to create 5.5M sentences English–German data sets. Each line can be rejected of back-translated data from the Finnish news2014 by several filters. The total of rejected lines is the last and news2016 corpora (Ostling¨ et al., 2017). row of the table. For the English–German back-translations, we trained a standard transformer model on all the % rejected available parallel data and translated the monolin- ParaCrawl Rapid gual German data into English. The BLEU score for our back-translation model is 44.24 on news- Filter strict relax strict relax test 2018. We applied our filtering pipeline to the LM avg CE 62.5% 40.0% 50.7% 21.4% back-translated pairs, resulting in 10.3M sentence LM CE diff 35.4% 25.7% 44.8% 31.1% pairs. In addition to the new back-translations, we Src lang ID 37.2% 37.2% 11.9% 11.9% also included back-translations from the WMT16 Trg lang ID 29.1% 29.1% 8.5% 8.5% data by Sennrich et al.(2016a). Wordalign 8.3% 8.3% 8.3% 8.3% Number 16.8% 16.8% 6.7% 6.7% 3 Sentence-level approaches Punct 54.6% 3.3% 23.7% 7.6% In this section we describe our sentence- total 87.9% 64.2% 62.2% 54.8% level translation models and the experiments in the English-to-German, English-to-Finnish and Table 3: Percentage of lines rejected by each filter for Finnish-to-English translation directions. English–Finnish data sets. The strict version is the same as for English–German, and the relax version ap- 3.1 Model architectures plies relaxed thresholds. We experimented with both NMT and rule-based systems. All of our neural sentence-level models plete sentences that end with proper final punctu- are based on the transformer architecture (Vaswani ation marks, and the filter might remove quite a et al., 2017). We used both the OpenNMT- bit of the useful data examples. However, our fi- py (Klein et al., 2017) and MarianNMT (Junczys-

415 Dowmunt et al., 2018) frameworks. Our experi- BLEU news2018 ments focused on the following: Single model 44.61 • Ensemble models: using ensembles with a 5 save-points 46.65 combination of independent runs and save- 5 save-points + 4 fine-tuned 47.45 points from a single training run. Table 4: English–German development results compar- • Left-to-right and right-to-left models: Trans- ing the best single model, an ensemble of 5 save-points, former models with decoding of the output in and an ensemble of 5 save-points and 4 independent left-to-right and right-to-left order. runs fine-tuned on in-domain data. The English-to-Finnish rule-based system is an enhanced version of the WMT 2018 rule-based • 3 independent models trained for left-to-right system (Raganato et al., 2018). decoding, 3.2 English–German • 3 independent models trained for right-to-left decoding, Our sentence-level models for the English-to- German direction are based on ensembles of in- • 3 save-points based on continued training of dependent runs and different save-points as well one of the left-to-right decoding models. as save-points fine-tuned on in-domain data. For The save-points were added later as we found our submission, we used an ensemble of 9 models out that models kept on improving when using containing: larger mini-batches and less frequent validation in • 4 save-points with the lowest development early stopping. Table5 lists the results of various perplexity taken from a model trained for models on the development test data from 2018. 300 000 training steps. BLEU news2018 • 5 independent models fine-tuned with in- Model Basic Fine-tuned domain data. L2R run 1 43.63 45.31 All our sentence-level models for the English– L2R run 2 43.52 45.14 German language pair are trained on filtered L2R run 3 43.33 44.93 versions of Europarl, NewsCommentary, Rapid, L2R run3 cont’d 1 43.65 45.11 CommonCrawl, ParaCrawl, Wikititles, and back- L2R run3 cont’d 2 43.76 45.43 translations. For in-domain fine-tuning, we use L2R run3 cont’d 3 43.53 45.67 newstest 2011–2016. Our submission is com- posed of transformer-big models implemented in Ensemble all L2R 44.61 46.34 OpenNMT-py with 6 layers of hidden size 4096, Rescore all L2R 46.49 16 attention heads, and a dropout of 0.1. The dif- R2L run 1 42.14 43.80 ferences in development performance between the R2L run 2 41.96 43.67 best single model, an ensemble of save-points of R2L run 3 42.17 43.91 a single training run and our final submission are reported in Table4. We gain 2 BLEU points with Ensemble all R2L 43.03 44.70 the ensemble of save-points, and an additional 0.8 Rescore all R2L 44.73 points by adding in-domain fine-tuned models into Rescore all L2R+R2L 46.98 the ensemble. This highlights the well-known ef- fectiveness of ensembling and domain adaptation Table 5: English–German results from individual Mar- for translation quality. ianNMT transformer models and their combinations Furthermore, we trained additional models us- (cased BLEU). ing MarianNMT with the same training data and fine-tuning method. In this case, we also included There are various trends that are interesting to right-to-left decoders that are used as a comple- point out. First of all, fine-tuning gives a consis- ment in the standard left-to-right decoders in re- tent boost of 1.5 or more BLEU points. Our initial scoring approaches. In total, we also end up with runs were using a validation frequency of 5 000 9 models including: steps and a single GPU with dynamic mini-batches

416 that fit in 13G of memory. The stopping crite- 46.5 rion was set to 10 validation steps without improv- 46.0 ing cross-entropy on heldout data (newstest 2015 45.5 + 2016). Later on, we switched to multi-GPU 45.0 training with two GPUs and early stopping of 20 44.5 validation steps. The dynamic batching method 44.0 of MarianNMT produces larger minibatches once

BLEU news 2018 43.5 there is more memory available, and multi-GPU L2R settings simply multiply the working memory for 43.0 R2L that purpose. We realized that this change enabled 1 2 4 8 16 32 64 the system to continue training substantially, and Beam size Table5 illustrates the gains of that process for the third L2R model. Figure 1: The effect of beam size on translation perfor- Another observation is that right-to-left decod- mance. All results use model ensembles and the scores ing models in general work less well compared are case-sensitive. to the corresponding left-to-right models. This is also apparent with the fine-tuned and ensemble imented with rule-based word segmentation based models that combine independent runs. The dif- on Omorfi (Pirinen, 2015). Omorfi is a morpho- ference is significant with about 1.5 BLEU points logical analyzer for Finnish with a large-coverage or more. Nevertheless, they still contribute to lexicon. Its segmentation tool5 splits a word form the overall best score when re-scoring n-best lists into morphemes as defined by the morphological from all models in both decoding directions. In rules. In particular, it distinguishes prefixes, in- this example, re-scoring is done by simply sum- fixes and suffixes through different segmentation ming individual scores. Table5 also shows that re- markers: scoring is better than ensembles for model combi- Intia→ ←n ja Japani→ ←n pa¨a¨→ ←ministeri→ nations with the same decoding direction because India GEN and Japan GEN prime minister they effectively increase the beam size as the hy- ←t tapaa→ ←vat Tokio→ ←ssa potheses from different models are merged before PL meet 3PL Tokyo INE re-ranking the combined and re-scored n-best lists. While Omorfi provides word segmentation The positive effect of beam search is further il- based on morphological principles, it does not rely lustrated in Figure1. All previous models were on any frequency cues. Therefore, the standard run with a beam size of 12. As we can see, the BPE algorithm is run over the Omorfi-segmented general trend is that larger beams lead to improved text in order to split low-frequency morphemes. performance, at least until the limit of 64 in our In this experiment, we compare two models for experiments. Beam size 4 is an exception in the each translation direction: left-to-right models. • One model segmented with the standard BPE 3.3 English–Finnish and Finnish–English algorithm (joint vocabulary size of 50 000, vocabulary frequency threshold of 50). The problem of open-vocabulary translation is particularly acute for morphologically rich lan- • One model where the Finnish side is pre- guages like Finnish. In recent NMT research, segmented with Omorfi, and both the Omorfi- the standard approach consists of applying a segmented Finnish side and the English side word segmentation algorithm such as BPE (Sen- are segmented with BPE (same parameters as nrich et al., 2016b) or SentencePiece (Kudo and above). Richardson, 2018) during pre-processing. In re- All models are trained on filtered versions of cent WMT editions, various alternative segmenta- Europarl, ParaCrawl, Rapid, Wikititles, news- tion approaches were examined for Finnish: hy- dev2015 and newstest2015 as well as back- brid models that back off to character-level rep- translations. Following our experiments at WMT resentations (Ostling¨ et al., 2017), and variants 5https://flammie.github.io/ of the Morfessor unsupervised morphology algo- omorfi/pages/usage-examples.html# rithm (Gronroos¨ et al., 2018). This year, we exper- morphological-segmentation

417 2018 (Raganato et al., 2018), we also use domain Table6 compares BLEU scores for the 2017 to labels (hEPi for Europarl, hWebi for ParaCrawl, 2019 test sets. The Omorfi-based system shows Rapid and Wikititles, and hNEWSi for newsdev, consistent improvements when used on the source newstest and the back-translations). We use new- side, i.e. from Finnish to English. However, due stest2016 for validation. All models are trained to timing constraints, we were not able to integrate with MarianNMT, using the standard Transformer the Omorfi-based segmentation into our final sub- architecture. mission systems. In any case, the difference ob- Figures2 and3 show the evolution of BLEU served in the news2019 set after submission dead- scores on news2016 during training. For English– line is within the bounds of random variation. Finnish, the Omorfi-segmented system shows slightly higher results during the first 40 000 train- Data set ∆ BLEU ∆ BLEU ing steps, but is then outperformed by the plain EN-FI FI-EN BPE-segmented system. For Finnish–English, the news2017 −0.47 +0.36 Omorfi-segmented system obtains higher BLEU news2018 −0.61 +0.38 scores much longer, until both systems converge news2019 +0.19 +0.04 after about 300 000 training steps. Table 6: BLEU score differences between Omorfi- segmented and BPE-segmented models. Positive val- 25 ues indicate that the Omorfi+BPE model is better, neg- ative values indicate that the BPE model is better. 20 We tested additional transformer models seg- 15 mented with the SentencePiece toolkit, using a 10 shared vocabulary of 40k tokens trained only on the parallel corpora. We do this with the pur- 5 pose of comparing the use of a software tailored BPE specifically for (Omorfi) with a 0 Omorfi+BPE more general segmentation one. These models were trained with the same specifications as the 0 100,000 200,000 300,000 previous ones, including the transformer hyperpa- Training steps rameters, the train and development data and the domain-labeling. Since we used OpenNMT-py to Figure 2: Evolution of English–Finnish BLEU scores train these models, it is difficult to know whether (on y-axis) during training. the differences come from the segmentation or the toolkit. We, however, find it informative to present these results. Table7 presents the obtained BLEU 30 scores with both systems. We notice that both systems yield similar scores for both translation directions. SentencePiece 20 models are consistently ahead of Omorfi+BPE, but this difference is so small that it cannot be consid- ered convincing nor significant. 10 Our final models for English-to-Finnish are standard transformer models with BPE-based seg- BPE mentation, trained using MarianNMT with the same settings and hyper-parameters as the other 0 Omorfi+BPE experiments. We used the filtered training data us- 0 100,000 200,000 300,000 400,000 ing the relaxed settings of the language model fil- Training steps ter to obtain better coverage for this language pair. The provided training data is much smaller and Figure 3: Evolution of Finnish–English BLEU scores we also have less back-translated data at our dis- (on y-axis) during training. posal, which motivated us to lower the threshold

418 Model news news same filtered training data using the strict set- 2017 2019 tings of the language model filter including the back-translations produced for English monolin- SentencePiece EN-FI 25.60 20.60 gual news. Omorfi+BPE EN-FI 25.50 20.13 31.50 25.00 SentencePiece EN-FI BLEU news2017 31.21 24.06 Omorfi+BPE FI-EN Model L2R R2L Table 7: BLEU scores comparison between Sentence- Run 1 32.26 31.70 Piece and Omorfi+BPE-segmented models. Run 2 31.91 31.83 Run 3 32.68 31.81 of taking examples from web-crawled data. Do- Ensemble 33.23 33.03 main fine-tuning is done as well using news test Rescored 33.34 32.98 sets from 2015, 2016 and 2018. The results on – L2R+R2L 33.95 development test data from 2017 are listed in Ta- Top (with ParaCrawl) 34.6 ble8. Top (without ParaCrawl) 25.9

BLEU news2017 Table 9: Results from individual MarianNMT trans- Model L2R R2L former models and their combinations for Finnish to English (cased BLEU). Results denoted as top refer to Run 1 27.68 28.01 the top systems reported at the on-line evaluation ma- Run 2 28.64 28.77 trix (accessed on May 16, 2019), one trained with the Run 3 28.64 28.41 2019 data sets and one with 2017 data. Ensemble 29.54 29.76 Rescored 29.60 29.72 In contrast to English-to-German, models in the – L2R+R2L 30.66 two decoding directions are quite similar again and the difference between left-to-right and right- Top matrix 21.7 to-left models is rather small. The importance of the new data sets from 2019 are visible again and Table 8: Results from individual MarianNMT trans- our system performs similarly, but still slightly be- former models and their combinations for English to low the best system that has been submitted this Finnish (cased BLEU). The top matrix result refers to the best system reported in the on-line evaluation ma- year to the on-line evaluation matrix on the 2017 trix (accessed on May 16, 2019). test set. 3.4 The English–Finnish rule-based system A striking difference to English–German is that Since the WMT 2018 challenge, there has been right-to-left decoding models are on par with the development in four areas of translation process in other direction. The scores are substantially higher the rule-based system for English–Finnish: than the currently best (post-WMT 2017) sys- tem reported in the on-line evaluation matrix for 1. The standard method in handling English this test set, even though this also refers to a noun compounds was to treat them as mul- transformer with a similar architecture and back- tiword expressions (MWE). This method al- translated monolingual data. This system does not lows many kinds of translations, even mul- contain data derived from ParaCrawl, which was tiple translation, which can be handled in not available at the time, and the improvements we semantic disambiguation. However, be- achieve demonstrate the effectiveness of our data cause noun compounding is a common phe- filtering techniques from the noisy on-line data. nomenon, also a default handling method was For Finnish-to-English, we trained MarianNMT developed for such cases, where two or more models using the same transformer architecture consecutive nouns are individually translated as for the other language pairs. Table9 shows and glued together as a single word. The sys- the scores of individual models and their com- tem works so that if the noun combination is binations on the development test set of news not handled as MWE, the second strategy is from WMT 2017. All models are trained on the applied (Hurskainen, 2018a).

419 2. The translation of various types of questions the previously translated sentence with the current has been improved. Especially the transla- source sentence. The concatenation approaches tion of indirect questions was defective, be- we tested are listed below. cause the use ofif in the role of initiating the indirect question was not implemented. • MT-concat-source: (2+1) Concatenating pre- The conjunctionif is ambiguous, because it vious source sentence with the current source is used also for initiating the conditional sentence (Tiedemann and Scherrer, 2017). clause (Hurskainen, 2018b). (3+1a) Concatenating the previous two sen- tences with the current source sentence. 3. Substantial rule optimizing was carried out. (3+1b) Concatenating the previous, the cur- When rules are added in development pro- rent and the next sentence in the source lan- cess, the result is often not optimal. There guages. are obsolete rules and the rules may need • MT-concat-target: (1t+1s+1) Concatenating new ordering. As a result, a substantial the previously translated (target) sentence number of rules (30%) were removed and with the current source sentence. others were reordered. This has effect on translation speed but not on translation re- • MT-concat-source-target: (2+2) Concatenat- sult (Hurskainen, 2018c). ing the previous with the current source sen- tence and translate into the previous and 4. Temporal subordinate clauses, which start the current target sentence (Tiedemann and with the conjunction when or while, can Scherrer, 2017). Only the second sentence in be translated with corresponding subordinate the translation will be kept for evaluation of clauses in Finnish. However, such clauses are the translation quality. often translated with participial phrase con- structions. Translation with such construc- Extended context models only make sense with tions was tested. The results show that al- coherent training data. Therefore, we ran exper- though they can be implemented, they are iments only with the training data that contain prone to mistakes (Hurskainen, 2018d). translated documents, i.e. Europarl, NewsCom- mentary, Rapid and the back-translations of the These improvements to the translation system German news from 2018. Hence, the baseline is contribute to fluency and accuracy of translations. lower than a sentence-level model on the com- plete data sets provided by WMT. Table 10 sum- 4 Document-level approaches marizes the results on the development test data To evaluate the effectiveness of various document- (news 2018). level translation approaches for the English– German language pair, we experimented with a BLEU news2018 number of different approaches which are de- System Shuffled Coherent scribed below. In order to test the ability of the Baseline 38.96 38.96 system to pick up document-level information, we 2+1 36.62 37.17 also created a shuffled version of the news data 3+1a 33.90 34.30 from 2018. We then test our systems on both the 3+1b 34.14 34.39 original test set with coherent test data divided 1t+1s+1 36.82 37.24 into short news documents and the shuffled test set 2+2 38.53 39.08 with broken coherence. Table 10: Comparison of concatenation approaches for 4.1 Concatenation models English–German document-level translation. Some of the previously published approaches use concatenation of multiple source-side sentences in The results overall are rather disappointing. All order to extend the context of the currently trans- but one of the concatenation models underperform lated sentence (Tiedemann and Scherrer, 2017). In and cannot beat the sentence-level baseline. Note addition to the source-side concatenation model, that the concat-target model (1t+1s+1) even refers we also tested an approach where we concatenate to an oracle experiment in which the reference

420 translation of the previous sentence is fed into the quite a different setup in terms of hyperparame- translation model for translating the current source ters and dimensions of layer components etc. We sentence. As this is not very successful, we did applied the basic settings following the documen- not even try to run a proper evaluation with sys- tation of the software. In particular, the model tem output provided as target context during test- includes 4 layers and 8 attention heads, and the ing. Besides the shortcomings, we can neverthe- dimensionality of the hidden layers is 512. We less see a consistent pattern that the extended con- applied a sublayer and attention dropout of 0.1 text models indeed pick up information from dis- and trained the sentence-level model for about 3.5 course. For all models we observe a gain of about epochs. We selected monolingual source-side con- half a BLEU point when comparing the shuffled text for our experiments and hierarchical docu- to the non-shuffled versions of the test set. This is ment attention with sparse softmax. Otherwise, interesting and encourages us to study these mod- we also apply the default parameters suggested els further in future work, possibly with different in the documentation with respect to optimizers, data sets, training procedures and slightly different learning rates and dropout. Unfortunately, the re- architectures. sults do not look very promising as we can see in Table 11. The document-level model does not 4.2 Hierarchical attention models even reach the performance of the sentence-level A number of approaches have been developed model even though we trained until convergence to utilize the attention mechanism to capture ex- on development data with patience of 10 reporting tended context for document-level translation. We steps, which is quite disappointing. Overall, the experimented with the two following models: scores are below the standard transformer models of the other experiments, and hence, we did not try • NMT-HAN: Sentence-level transformer to further optimize the results using that model. model with a hierarchical attention network For the NMT-HAN model we used the imple- to capture the document-level context (Mi- mentation of Miculicich et al.(2018) with the culicich et al., 2018). recommended hyperparameter values and settings. • selectAttn: Selective attention model for The system is based on the OpenNMT-py imple- context-aware neural machine transla- mentation of the transformer. The model includes tion (Maruf et al., 2019). 6 hidden layers on both the encoder and decoder side with a dimensionality of 512 and the multi- For testing the selectAttn model, we used the head attention has 8 attention heads. We applied same data with document-level information as we a sublayer and attention dropout of 0.1. The tar- applied in the concatenation models. For NMT- get and source vocabulary size is 30K. We trained HAN we had to use a smaller training set due to the sentence-level model for 20 epochs after which lack of resources and due to the implementation we further fine-tuned the encoder side hierarchi- not supporting data shards. For NMT-HAN we cal attention for 1 epoch and the joint encoder- used only Europarl, NewsCommentary and Rapid decoder hierarchical attention for 1 epoch. The re- for training. Table 11 summarizes the results on sults for the NMT-HAN model are disappointing. the development test data. Both of the tested mod- The document-level model performs significantly els need to be trained on sentence-level first, be- worse than the sentence-level model. fore tuning the document-level components. 5 Results from WMT 2019 Model Sentence-level Document-level Table 12 summarizes our results from the WMT NMT-HAN 35.03 31.73 2019 news task. We list the official score from selectAttn 35.26 34.75 the submitted systems and post-WMT scores that Table 11: Results (case-sensitive BLEU) of the hierar- come from models described above. For Finnish– chical attention models on the coherent newstest 2018 English and English–Finnish, the submitted sys- dataset. tems correspond to premature single models that did not converge yet. Our submitted English– The architecture of the selective attention model German model is the ensemble of 9 models de- is based on the general transformer model but with scribed in Section 3.2.

421 Language pair Model BLEU document-level approaches really worked, with some even having a negative effect on translation English–German submitted 41.4 quality. L2R+R2L 42.95 Finnish–English submitted 26.7 Acknowledgments L2R+R2L 27.80 English–Finnish submitted 20.8 The work in this paper was sup- rule-based 8.9 ported by the FoTran project, L2R+R2L 23.4 funded by the European Research Council (ERC) under the European Table 12: Final results (case-sensitive BLEU scores) Union’s Horizon 2020 research on the 2019 news test set; partially obtained after the and innovation programme (grant deadline. agreement No 771113), and the MeMAD project funded by the Eu- The ensemble results clearly outperform those ropean Union’s Horizon 2020 Re- results but were not ready in time. We are still be- search and Innovation Programme low the best performing system from the official under grant agreement No 780069. participants of this year’s campaign but the final The authors gratefully acknowledge the support models perform in the top-range of all the three of the Academy of through project 314062 tasks. For English–Finnish, our final score would from the ICT 2023 call on Computation, Machine end up on a third place (12 submissions from 8 Learning and Artificial Intelligence and project participants), for Finnish–English it would be the 270354/273457. fourth-best participant (out of 9), and English– German fifth-best participant (out of 19 with 28 References submissions). Stig-Arne Gronroos,¨ Sami Virpioja, and Mikko Ku- 6 Conclusions rimo. 2018. Cognate-aware morphological segmen- tation for multilingual neural translation. In Pro- ceedings of the Third Conference on Machine Trans- In this paper, we presented our submission for lation: Shared Task Papers, pages 386–393, Bel- the WMT 2019 news translation task in three lan- gium, Brussels. Association for Computational Lin- guage pairs: English–German, English–Finnish guistics. and Finnish–English. Arvi Hurskainen. 2018a. Compound nouns in English For all the language pairs we spent considerable to Finnish machine translation. Technical Reports in time on cleaning and filtering the training data, Language Technology 32, University of Helsinki. which resulted in a significant reduction of train- Arvi Hurskainen. 2018b. Direct and indirect questions ing examples without a negative impact on trans- in English to Finnish machine translation. Technical lation quality. Reports in Language Technology 33, University of For English–German we focused both on Helsinki. sentence-level neural machine translation models Arvi Hurskainen. 2018c. Optimizing rules in English as well as document-level models. For English– to Finnish machine translation. Technical Reports in Finnish, our submissions consists of an NMT Language Technology 34, University of Helsinki. system as well as a rule-based system whereas Arvi Hurskainen. 2018d. Participial phrases in English the Finnish–English system is an NMT system. to Finnish machine translation. Technical Reports in For the English–Finnish and Finnish–English lan- Language Technology 35, University of Helsinki. guage pairs, we compared the impact of different segmentation approaches. Our results show that Marcin Junczys-Dowmunt, Roman Grundkiewicz, Tomasz Dwojak, Hieu Hoang, Kenneth Heafield, the different segmentation approaches do not sig- Tom Neckermann, Frank Seide, Ulrich Germann, nificantly impact BLEU scores. However, our ex- Alham Fikri Aji, Nikolay Bogoychev, Andre´ F. T. periments highlight the well-known fact that en- Martins, and Alexandra Birch. 2018. Marian: Fast sembling and domain adaptation have a significant neural machine translation in C++. In Proceedings of ACL 2018, System Demonstrations, pages 116– positive impact on translation quality. 121, Melbourne, Australia. Association for Compu- One surprising finding was that none of the tational Linguistics.

422 Guillaume Klein, Yoon Kim, Yuntian Deng, Jean The University of Helsinki submissions to the Senellart, and Alexander M. Rush. 2017. Open- WMT18 news task. In Proceedings of the Third NMT: Open-source toolkit for neural machine trans- Conference on Machine Translation: Shared Task lation. In Proc. ACL. Papers, pages 488–495, Belgium, Brussels. Associ- ation for Computational Linguistics. Taku Kudo and John Richardson. 2018. Sentence- Piece: A simple and language independent subword Rico Sennrich, Barry Haddow, and Alexandra Birch. tokenizer and detokenizer for neural text processing. 2016a. Improving neural machine translation mod- In Proceedings of the 2018 Conference on Empirical els with monolingual data. In Proceedings of the Methods in Natural Language Processing: System 54th Annual Meeting of the Association for Compu- Demonstrations, pages 66–71, Brussels, Belgium. tational Linguistics (Volume 1: Long Papers), vol- Association for Computational Linguistics. ume 1, pages 86–96.

Marco Lui and Timothy Baldwin. 2012. langid.py: An Rico Sennrich, Barry Haddow, and Alexandra Birch. off-the-shelf language identification tool. In Pro- 2016b. Neural machine translation of rare words ceedings of the ACL 2012 System Demonstrations, with subword units. In Proceedings of the 54th An- pages 25–30, Jeju Island, Korea. Association for nual Meeting of the Association for Computational Computational Linguistics. Linguistics (Volume 1: Long Papers), pages 1715– 1725, Berlin, Germany. Association for Computa- Sameen Maruf, Andre´ F. T. Martins, and Gholamreza tional Linguistics. Haffari. 2019. Selective attention for context-aware neural machine translation. In Proceedings of the Vesa Siivola, Mathias Creutz, and Mikko Kurimo. 2019 Conference of the North American Chapter of 2007a. Morfessor and VariKN machine learn- the Association for Computational Linguistics: Hu- ing tools for speech and language technology. In man Language Technologies, Volume 1 (Long and 8th Annual Conference of the International Speech Short Papers), pages 3092–3102, Minneapolis, Min- Communication Association (Interspeech 2007), nesota. Association for Computational Linguistics. Antwerp, Belgium, August 27-31, 2007, pages 1549– 1552. ISCA. Lesly Miculicich, Dhananjay Ram, Nikolaos Pappas, and James Henderson. 2018. Document-level neu- Vesa Siivola, Teemu Hirsimaki,¨ and Sami Virpioja. ral machine translation with hierarchical attention 2007b. On growing and pruning Kneser-Ney networks. In Proceedings of the Conference on smoothed n-gram models. IEEE Trans. Audio, Empirical Methods in Natural Language Processing Speech & Language Processing, 15(5):1617–1624. (EMNLP). Felix Stahlberg, Adria´ de Gispert, and Bill Byrne. Robert Ostling,¨ Yves Scherrer, Jorg¨ Tiedemann, 2018. The ’s machine Gongbo Tang, and Tommi Nieminen. 2017. The translation systems for WMT18. In Proceedings of Helsinki neural machine translation system. In the Third Conference on Machine Translation, Vol- Proceedings of the Second Conference on Machine ume 2: Shared Task Papers, pages 508–516, Bel- Translation, pages 338–347, Copenhagen, Den- gium, Brussels. Association for Computational Lin- mark. Association for Computational Linguistics. guistics.

Robert Ostling¨ and Jorg¨ Tiedemann. 2016. Effi- Jorg¨ Tiedemann and Yves Scherrer. 2017. Neural ma- cient word alignment with Markov Chain Monte chine translation with extended context. In Proceed- Carlo. Prague Bulletin of Mathematical Linguistics, ings of the Third Workshop on Discourse in Machine 106:125–146. Translation.

Alexandra Birch Chris Callison-Burch Marcello Fed- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob erico Nicola Bertoldi Brooke Cowan Wade Shen Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Christine Moran Richard Zens Chris Dyer On- Kaiser, and Illia Polosukhin. 2017. Attention is all drej Bojar Alexandra Constantin Evan Herbst you need. In I. Guyon, U. V. Luxburg, S. Bengio, Philipp Koehn, Hieu Hoang. 2007. Moses: Open H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar- source toolkit for statistical machine translation. nett, editors, Advances in Neural Information Pro- CoRR, Annual Meeting of the Association for Com- cessing Systems 30, pages 5998–6008. Curran As- putational Linguistics (ACL). sociates, Inc.

Tommi A. Pirinen. 2015. Omorfi — free and open source morphological lexical database for Finnish. In Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015), pages 313–315, Vilnius, Lithuania. Linkoping¨ Uni- versity Electronic Press, Sweden.

Alessandro Raganato, Yves Scherrer, Tommi Niemi- nen, Arvi Hurskainen, and Jorg¨ Tiedemann. 2018.

423