The University of Helsinki submissions to the WMT19 news translation task Aarne Talman,∗ y Umut Sulubacak,∗ Raul´ Vazquez,´ ∗ Yves Scherrer,∗ Sami Virpioja,∗ Alessandro Raganato,∗ y Arvi Hurskainen∗ and Jorg¨ Tiedemann∗ ∗University of Helsinki yBasement AI [email protected] Abstract neural machine translation. We spent substantial effort on filtering data in order to reduce noise— In this paper, we present the University of Helsinki submissions to the WMT 2019 shared especially in the web-crawled data sets—and to task on news translation in three language match the target domain of news data. pairs: English–German, English–Finnish and The resulting training sets, after applying the Finnish–English. This year, we focused first steps described below, are for 15.7M sentence on cleaning and filtering the training data pairs for English–German, 8.5M sentence pairs using multiple data-filtering approaches, re- for English–Finnish, and 12.3M–26.7M sentence sulting in much smaller and cleaner training pairs (different samplings of back-translations) for sets. For English–German, we trained both sentence-level transformer models and com- Finnish–English. pared different document-level translation ap- 2.1 Pre-processing proaches. For Finnish–English and English– Finnish we focused on different segmentation For each language, we applied a series of pre- approaches, and we also included a rule-based processing steps using scripts available in the system for English–Finnish. Moses decoder (Philipp Koehn, 2007): 1 Introduction • replacing unicode punctuation, The University of Helsinki participated in the • removing non-printing characters, WMT 2019 news translation task with four pri- mary submissions. We submitted neural ma- • normalizing punctuation, chine translation systems for English-to-Finnish, • tokenization. Finnish-to-English and English-to-German, and a rule-based machine translation system for In addition to these steps, we replaced a number English-to-Finnish. of English contractions with the full form, e.g. Most of our efforts for this year’s WMT focused “They’re” ! “They are”. After the above steps, on data selection and pre-processing (Section2), we applied a Moses truecaser model trained for in- sentence-level translation models for English- dividual languages, and finally a byte-pair encod- to-German, English-to-Finnish and Finnish-to- ing (BPE) (Sennrich et al., 2016b) segmentation English (Section3), document-level translation using a set of codes for either language pair. models for English-to-German (Section4), and For English–German, we initially pre-processed a comparison of different word segmentation ap- the data using only punctuation normalization and proaches for Finnish (Section 3.3). The final sub- tokenization. We subsequently trained an En- mitted NMT systems are summarized in Section5, glish truecaser model using all monolingual En- while the rule-based machine translation system is glish data as well as the English side of all paral- described in Section 3.4. lel English–German datasets except the Rapid cor- pus (in which non-English characters were miss- 2 Pre-processing, data filtering and ing from a substantial portion of the German sen- back-translation tences). We also repeated the same for German. Afterwards, we used a heuristic cleanup script1 in It is well known that data pre-processing and se- lection has a huge effect on translation quality in 1Shared by Marcin Junczys-Dowmunt. Retrieved 412 Proceedings of the Fourth Conference on Machine Translation (WMT), Volume 2: Shared Task Papers (Day 1) pages 412–423 Florence, Italy, August 1-2, 2019. c 2019 Association for Computational Linguistics order to filter suspicious samples out of Rapid, and • removing sentences with extremely long then truecased all parallel English–German data words: We excluded all sentence pairs with (including the filtered Rapid) using these models. words of 40 or more characters; Finally, we trained BPE codes with 35 000 sym- • removing sentence pairs that include HTML bols jointly for English–German on the truecased or XML tags; parallel sets. For all further experiments with English–German data, we applied the full set of • decoding common HTML/XML entities; tokenization steps as well as truecasing and BPE segmentation. • removing empty alignments (while keeping document boundaries intact); For English–Finnish, we first applied the stan- dard tokenization pipeline. For English and • removing pairs where the sequences of non- Finnish respectively, we trained truecaser models zero digits occurring in either sentence do not on all English and Finnish monolingual data as match; well as the English and Finnish side of all paral- lel English–Finnish datasets. As we had found to • removing pairs where one sentence is termi- be optimal in our previous year submission (Ra- nated with a punctuation mark and the other ganato et al., 2018), we trained a BPE model is either missing terminal punctuation or ter- using a vocabulary of 37 000 symbols, trained minated with another punctuation mark. jointly only on the parallel data. Furthermore, for Language identifiers: There is a surprisingly some experiments, we also used domain labeling. large amount of text segments in a wrong lan- We marked the datasets with 3 different labels: guage in the provided parallel training data. This hNEWSi for the development and test data from is especially true for the ParaCrawl and Rapid 2015, 2016, 2017, hEPi for Europarl, and hWEBi data sets. This is rather unexpected as a basic for ParaCrawl and Wikititles. language identifier certainly must be part of the crawling and extraction pipeline. Nevertheless, af- 2.2 Data filtering ter some random inspection of the data, we found it necessary to apply off-the-shelf language identi- For data filtering we applied four types of filters: fiers to the data for removing additional erroneous (i) rule-based heuristics, (ii) filters based on lan- text from the training data. In particular, we ap- guage identification, (iii) filters based on word plied the Compact Language Detector version 2 alignment models, and (iv) language model filters. (CLD2) from the Google Chrome project (using 2 Heuristic filters: The first step in cleaning the the Python interface from pycld2 ), and the widely data refers to a number of heuristics (largely in- used langid.py package (Lui and Baldwin, 2012) spired by (Stahlberg et al., 2018)) including: to classify each sentence in the ParaCrawl, Com- monCrawl, Rapid and Wikititles data sets. We re- • removing all sentence pairs with a length moved all sentence pairs in which the language of difference ratio above a certain threshold: one of the aligned sentences was not reliably de- for CommonCrawl, ParaCrawl and Rapid we tected. For this, we required the correct language used a threshold of 3, for WikiTitles a thresh- ID from both classifiers, the reliable-flag set to old of 2, and for all other data sets a threshold “True” by CLD2 with a reliability score of 90 or of 9; more, and the detection probability of langid.py to be at least 0.9. • removing pairs with short sentences: for Word alignment filter: Statistical word align- CommonCrawl, ParaCrawl and Rapid we re- ment models implement a way of measuring the quired a minimum number of four words; likelihood of parallel sentences. IBM-style align- ment models estimate the probability p(f j a; e) • removing pairs with very long sentences: we of a foreign sentence f given an ”emitted” sen- restricted all data to a maximum length of tence e and an alignment a between them. Train- 100 words; ing word alignment models and aligning large cor- pora is very expensive using traditional methods from https://gist.github.com/emjotde/ 4c5303e3b2fc501745ae016a8d1e8e49 2https://github.com/aboSamoor/pycld2 413 and implementations. Fortunately, we can rely on is low. The intuition is that both models should be eflomal3, an efficient word aligner based on Gibbs roughly similarly surprised when observing sen- sampling (Ostling¨ and Tiedemann, 2016). Re- tences that are translations of each other. In order cently, the software has been updated to allow the to make the values comparable, we trained our lan- storage of model priors that makes it possible to guage models on parallel data sets. initialize the aligner with previously stored model For English–Finnish, we used news test data parameters. This is handy for our filtering needs from 2015-2017 as the only available in-domain as we can now train a model on clean parallel data parallel training data, and for English–German and apply that model to estimate alignment proba- we added the NewsCommentary data set to the bilities of noisy data sets. news test sets from 2008-2018. As both data We train the alignment model on Europarl and sets are small, and we aimed for an efficient and news test sets from previous WMTs for English– cheap filter, we opted for a traditional n-gram lan- Finnish, and NewsCommentary for English– guage model in our experiments. To further avoid German. For both language pairs, we train a data sparseness and to improve comparability be- Bayesian HMM alignment model with fertilities tween source and target language, we also based in both directions and estimate the model priors our language models on BPE-segmented texts us- from the symmetrized alignment. We then use ing the same BPE codes as for the rest of the those priors to run the alignment of the noisy data training data. VariKN (Siivola et al., 2007b,a)4 sets using only a single iteration of the final model is the perfect toolkit for the purposes of estimat- to avoid a strong influence of the noisy data on ing n-gram language models with subword units. alignment parameters. As it is intractable to esti- It implements Kneser-Ney growing and revised mate a fully normalized conditional probability of Kneser-Ney pruning methods with the support of a sentence pair under the given higher-level word n-grams of varying size and the estimation of alignment model, eflomal estimates a score based word likelihoods from text segmented in subword on the maximum unnormalized log-probability of units.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages12 Page
-
File Size-