Tilde at WMT 2020: News Task Systems

Tilde at WMT 2020: News Task Systems Rihards Krislauksˇ yz and Marcis¯ Pinnisyz yTilde / Vienibas gatve 75A, Riga, Latvia zFaculty of Computing, University of Latvia / Rain¸a bulv. 19, Riga, Latvia [email protected] Abstract The paper is further structured as follows: Sec- tion2 describes the data used to train our NMT This paper describes Tilde’s submission to the systems, Section3 describes our efforts to identify WMT2020 shared task on news translation the best-performing recipes for training of our final for both directions of the English$Polish lan- systems, Section5 summarises the results of our guage pair in both the constrained and the un- final systems, and Section6 concludes the paper. constrained tracks. We follow our submissions form the previous years and build our base- 2 Data line systems to be morphologically motivated sub-word unit-based Transformer base models For training of the constrained NMT systems, we that we train using the Marian machine trans- used data from the WMT 2020 shared task on lation toolkit. Additionally, we experiment news translation1. For unconstrained systems, we with different parallel and monolingual data 2 selection schemes, as well as sampled back- used data from the Tilde Data Library . The 10 translation. Our final models are ensembles of largest publicly available datasets that were used Transformer base and Transformer big models to train the unconstrained systems were Open Sub- which feature right-to-left re-ranking. titles from the Opus corpus (Tiedemann, 2016), ParaCrawl (Banon´ et al., 2020) (although it was 1 Introduction discarded due to noise found in the corpus), DGT Translation Memories (Steinberger et al., 2012), This year, we developed both constrained and un- Microsoft Translation and User Interface Strings constrained NMT systems for the English$Polish Glossaries3 from multiple releases up to 2018, the language pair. We base our methods on the sub- Tilde MODEL corpus (Rozis and Skadin¸sˇ, 2017), missions of the previous years (Pinnis et al., 2017b, WikiMatrix (Schwenk et al., 2019), Digital Corpus 2018, 2019) including methods for parallel data of the European Parliament (Hajlaoui et al., 2014), filtering from Pinnis(2018). Specifically, we lean JRC-Acquis (Steinberger et al., 2006), Europarl on Pinnis(2018) and Junczys-Dowmunt(2018) for (Koehn, 2005), and the QCRI Educational Domain data selection and filtering, (Pinnis et al., 2017b) Corpus (Abdelali et al., 2014). for morphologically motivated sub-word units and synthetic data generation, Edunov et al.(2018) for 2.1 Data Filtering and Pre-Processing sampled back-translation and finally Morishita et al. First, we filtered data using Tilde’s parallel data fil- (2018) for re-ranking with right-to-left models. We tering methods (Pinnis, 2018) that allow discarding use the Marian toolkit (Junczys-Dowmunt et al., sentence pairs that are corrupted, have low content 2018) to train models of Transformer architecture overlap, feature wrong language content, feature (Vaswani et al., 2017). too high non-letter ratio, etc. The exact filter con- Although document level NMT as showcased by figuration is defined in the paper by (Pinnis, 2018). (Junczys-Dowmunt, 2019) have yielded promising Then, we pre-processed all data using Tilde’s results for the English-German language pair, we parallel data pre-processing workflow that nor- were not able to collect sufficient document level 1 data for the English-Polish language pair. As a http://www.statmt.org/wmt20/translation-task.html 2https://www.tilde.com/products-and-services/data- result, all our systems this year translate individual library sentences. 3https://www.microsoft.com/en-us/language/translations 175 Proceedings of the 5th Conference on Machine Translation (WMT), pages 175–180 Online, November 19–20, 2020. c 2020 Association for Computational Linguistics Lang. Filtered the corpora, therefore, this was not done for the Scenario Raw pair Tilde +DCCEF datasets that were used for the experiments docu- En ! Pl 4.3M (c) 10.8M 6.5M mented in Section3. ! Pl En 4.3M For backtranslation experiments, we used all En ! Pl 23.3M (u) 55.4M 31.5M available monolingual data from the WMT shared Pl ! En 24.1M task on news translation. In order to make use of En ! Pl 21.6M (u) w/o PC 48.8M 27.0M Pl ! En 21.3M the Polish CommonCrawl corpus, we scored sentences using the in-domain language models and Table 1: Parallel data statistics before and after filter- selected top-scoring sentences as additional mono- ing. (c) - constrained, (u) - unconstrained, “w/o PC” - lingual data for back-translation. “without ParaCrawl”. Many of the data processing steps were sped up via parallelization with GNU Parallel (Tange, 2011). malizes punctuation (quotation marks, apostro- phes, decodes HTML entities, etc.), identifies non- 3 Experiments translatable entities and replaces them with place- holders (e.g., e-mail addresses, Web site addresses, In this section, we describe the details of the meth- XML tags, etc.), tokenises the text using Tilde’s reg- ods used and experiments performed to identify ular expression-based tokeniser, and applies true- the best-performing recipes for training of Tilde’s casing. NMT systems for the WMT 2020 shared task on In preliminary experiments, we identified also news translation. All experiments that are de- that morphology-driven word splitting (Pinnis et al., scribed in this section were carried out on the con- 2017a) for English$Polish allowed us to increase strained datasets unless specifically indicated that translation quality by approximately 1 BLEU point. also unconstrained datasets were used. The finding complies with our findings from previous years (Pinnis et al., 2018, 2017b). Therefore, 3.1 NMT architecture we applied morphology-driven word splitting also All NMT systems that are described further have for this year’s experiments. the Transformer architecture (Vaswani et al., 2017). Then, we trained baseline NMT models (see We trained the systems using the Marian toolkit Section 3.2) and language models, which are nec- (Junczys-Dowmunt et al., 2018). The Transformer essary for dual conditional cross-entropy filtering base model configuration was used throughout the (DCCEF) (Junczys-Dowmunt, 2018) in order to experiments except for the experiments with the big select parallel data that is more similar to the news model configuration that are described in Section5. domain (for usefulness of DCCEF, refer to Sec- We used gradient accumulation over multiple physi- tion 3.3). For in-domain (i.e., news) and out-of- cal batches (the --optimizer-delay parameter in Mar- domain language model training, we used four ian) to increase the effective batch size to around monolingual datasets of 3.7M and 10.6M sen- 1600 sentences in the base model experiments and tences4 for the constrained and unconstrained sce- 1000 sentences in big model experiments. The narios respectively. Once the models were trained, Adam optimizer with a learning rate of 0.0005 and We filtered parallel data using DCCEF. The parallel with 1600 warm-up update steps (i.e., the learning data statistics before and after filtering are given in rate linearly rises during warm-up; afterwards de- Table1. cays proportionally to the inverse of the square root For our final systems, we also generated syn- of the step number) was used. For language model thetic data by randomly replacing one to three con- training, a learning rate of 0.0003 was used. tent words on both source and target sides with unknown token identifiers. This has shown to in- 3.2 Baseline models crease robustness of NMT systems when dealing We trained baseline models using the Transformer with rare or unknown phenomena (Pinnis et al., base configuration as defined in Section 3.1. The 2017a). This process almost doubles the size of validation results for the baseline NMT systems are provided in Table2. As we noticed last year that 4The sizes correspond to the smallest monolingual in- domain dataset, which in both cases were news in Polish. the ParaCrawl corpus contained a large proportion For other datasets, random sub-sets were selected. (by our estimates up to 50%) (Pinnis et al., 2019) 176 System En ! Pl Pl ! En strategies for data filtering in the preparation of the back-translated data. Ng et al.(2019) have de- Constrained scribed a method for domain data extraction from Baseline 21.67 32.69 general domain monolingual data using domain +DCCEF 22.19 33.45 and out-of-domain language models. We compared Unconstrained said method with a simpler alternative of using only Baseline 21.86 33.08 an in-domain language model for in-domain data +DCCEF 22.51 30.86 scoring. We sorted the monolingual data according Baseline w/o ParaCrawl 24.29 29.47 to the scores produced by the in-domain language +DCCEF 22.60 28.59 model or by the combination of in-domain and out-of-domain language model scores and experi- Table 2: Comparison of baseline NMT systems trained mented with different cutoff points when selecting on data that were prepared with and without DCCEF. data for back-translation. Considering the above, we carried out experi- of machine translated content, we trained baseline ments along two dimensions – 1) monolingual data systems with and without ParaCrawl. It can be seen selection strategy, which was either combined or that when training the En ! Pl unconstrained sys- in-domain, signifying whether the combined score tem using ParaCrawl, we loose over 2 BLEU points. of both language models or just the score from This is because most machine translated content the in-domain language model was used, respec- is on the non-English (in this case Polish) side. tively, and 2) the bitext and synthetic data mixture For the Pl ! En direction, the machine-translated selection strategy, which was one of: content acts as back-translated data and, therefore, does not result in quality degradation.

Load more