Edinburgh's Statistical Machine Translation Systems for WMT16

Edinburgh’s Statistical Machine Translation Systems for WMT16 Philip Williams1, Rico Sennrich1, Maria Nadejde˘ 1, Matthias Huck2, Barry Haddow1, Ondrejˇ Bojar3 1School of Informatics, University of Edinburgh 2Center for Information and Language Processing, LMU Munich 3Institute of Formal and Applied Linguistics, Charles University in Prague [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] Abstract core syntax-based setup and experiments in Sec- tions 4 and 5. This paper describes the University of Ed- inburgh’s phrase-based and syntax-based 2 Phrase-based System Overview submissions to the shared translation tasks of the ACL 2016 First Conference on Ma- 2.1 Preprocessing chine Translation (WMT16). We sub- The training data was preprocessed us- mitted five phrase-based and five syntax- ing scripts from the Moses toolkit. based systems for the news task, plus one We first normalized the data using the phrase-based system for the biomedical normalize-punctuation.perl script, task. then performed tokenization (using the -a op- tion), and then truecasing. We did not perform 1 Introduction any corpus filtering other than the standard Moses method, which removes sentence pairs with Edinburgh’s submissions to the WMT 2016 news extreme length ratios, and sentences longer than translation task fall into two distinct groups: neu- 80 tokens. ral translation systems and statistical translation systems. In this paper, we describe the statisti- 2.2 Word Alignment cal systems, which includes a mix of phrase-based and syntax-based approaches. We also include a For word alignment we used fast_align (Dyer et al., 2013)—except for German English, brief description of our phrase-based submission ↔ to the WMT16 biomedical translation task. Our where we used MGIZA++ (Gao and Vo- neural systems are described separately in Sen- gel, 2008)—followed by the standard nrich et al. (2016a). grow-diag-final-and symmetrization In most cases, our statistical systems build on heuristic. last year’s, incorporating recent modelling refine- 2.3 Language Models ments and adding this year’s new training data. For Romanian—a new language this year—we Our default approach to language modelling was paid particular attention to language-specific pro- to train individual models on each monolingual cessing of diacritics. For English Czech, we ex- corpus (except CommonCrawl) and then linearly- → perimented with a string-to-tree system, first using interpolate them to produce a single model. For Treex1 (formerly TectoMT; Popel and Žabokrt- some systems, we added separate neural or Com- ský, 2010) to produce Czech dependency parses, monCrawl LMs. Here we outline the various ap- then converting them to constituency representa- proaches and then in Section 3 we describe the tion and extracting GHKM rules. combination used for each language pair. In the next two sections, we describe the phrase- based systems, first describing the core setup in Interpolated LMs For individual monolingual Section 2 and then describing system-specific ex- corpora, we first used lmplz (Heafield et al., 2013) tensions and experimental results for each individ- to train count-based 5-gram language models with ual language pair in Section 3. We describe the modified Kneser-Ney smoothing (Chen and Good- man, 1998). We then used the SRILM toolkit 1http://ufal.mff.cuni.cz/treex (Stolcke, 2002) to linearly interpolate the models 399 Proceedings of the First Conference on Machine Translation, Volume 2: Shared Task Papers, pages 399–410, Berlin, Germany, August 11-12, 2016. c 2016 Association for Computational Linguistics using weights tuned to minimize perplexity on the 2.6 Decoding development set. In decoding we applied cube pruning (Huang and CommonCrawl LMs Our CommonCrawl lan- Chiang, 2007) with a stack size of 5000 (reduced guage models were trained in the same way as the to 1000 for tuning), Minimum Bayes Risk de- individual corpus-specific standard models, but coding (Kumar and Byrne, 2004), a maximum were not linearly-interpolated with other LMs. In- phrase length of 5, a distortion limit of 6, 100- stead, the log probabilities of CommonCrawl LMs best translation options and the no-reordering- were added as separate features of the systems’ over-punctuation heuristic (Koehn and Haddow, linear models. 2009). Neural LMs For some of our phrase-based sys- 3 Phrase-based Experiments tems we experimented with feed-forward neural 3.1 Finnish English network language models, both trained on target → n-grams only, and on “joint” or “bilingual” n- Similar to last year (Haddow et al., 2015), we built an unconstrained system for Finnish English us- grams (Devlin et al., 2014; Le et al., 2012). For → training these models we used the NPLM toolkit ing data extracted from OPUS (Tiedemann, 2012). (Vaswani et al., 2013), for which we have now im- Our parallel training set was the same as we used plemented gradient clipping to address numerical previously, but the language model training set issues often encountered during training. was extended with the addition of the news2015 monolingual corpus and the large WMT16 En- 2.4 Baseline Features glish CommonCrawl corpus. We used news- We follow the standard approach to SMT of scor- dev2015 for tuning, and newsdev2015 for testing ing translation hypotheses using a weighted lin- during system development. ear combination of features. The core features One clear problem that we noted with our sub- of our model are a 5-gram LM score (i.e. log mission from last year was the large number of probability), phrase translation and lexical trans- OOVs, which were then copied directly into the lation scores, word and phrase penalties, and a lin- English output. This is undoubtedly due to the ag- ear distortion score. The phrase translation prob- glutinative nature of Finnish, and probably was the abilities are smoothed with Good-Turing smooth- cause of our system being poorly judged by human ing (Foster et al., 2006). We used the hierarchi- evaluators, despite having a high BLEU score. To cal lexicalized reordering model (Galley and Man- address this, we split the Finnish input into sub- ning, 2008) with 4 possible orientations (mono- word units at both train and test time. In particular, tone, swap, discontinuous left and discontinuous we applied byte pair encoding (BPE) to split the right) in both left-to-right and right-to-left direc- Finnish source into smaller units, greatly reduc- tion. We also used the operation sequence model ing the vocabulary size. BPE is a technique which (OSM) (Durrani et al., 2013) with 4 count based has been recently used to good effect in neural ma- supportive features. We further employed domain chine translation (Sennrich et al., 2016b), where indicator features (marking which training cor- the models cannot handle large vocbaularies. It is pus each phrase pair was found in), binary phrase actually a merging algorithm, originally designed count indicator features, sparse phrase length fea- for compression, and works by starting with a tures, and sparse source word deletion, target word maximally split version of the training corpus (i.e. insertion, and word translation features (limited to split to characters) and iteratively merging com- the top K words in each language, typically with mon clusters. The merging continues for a speci- K = 50). fied number of iterations, and the merges are col- lected up to form the BPE model. At test time, 2.5 Tuning the recorded merges are applied to the test corpus, Since our feature set (generally around 500 to with the result that there are no OOVs in the test 1000 features) was too large for MERT, we used data. For the experiments here, we used 100,000 k-best batch MIRA for tuning (Cherry and Fos- BPE merges to create the model. ter, 2012). To speed up tuning we applied thresh- Applying BPE to Finnish English was clearly → old pruning to the phrase table, based on the direct effective at addressing the unknown word prob- translation model probability. lem, and in many cases the resulting translations 400 are quite understandable, e.g. into two parts randomly (so as to balance the “born source yös Intian on sanottu olevan kiinnostunut English” and “born Romanian” portions), using puolustusyhteistyösopimuksesta Japanin one for tuning and one for testing. For building the kanssa. final system, and for the contrastive experiments, base India is also said to be interested in puolus- we used the whole of newsdev2016 for tuning, and tusyhteistyösopimuksesta with Japan. newstest2016 for testing. bpe India is also said to be interested in defence In early experiments we noted that both the cooperation agreement with Japan. training and the development data were inconsis- reference India is also reportedly hoping for a tent in their use of diacritics leading to problems deal on defence collaboration between the with OOVs and sparse statistics. To address this two nations. we stripped off all diacritics from the Romanian However applying BPE to Finnish can also re- texts and the result was a significant increase in sult in some rather odd translations when it over- performance in our development setup. We also zealously splits: experimented with different language model com- source Balotelli oli vielä kaukana huippu- binations during development, with our submit- vireestään. ted system using three different language model base Balotelli was still far from huippuvireestään. features: a neural LM trained on just news2015 bpe Baloo, Hotel was still far from the peak of its monolingual, an n-gram language model trained vitality. on the WMT16 English CommonCrawl corpus, reference Balotelli is still far from his top tune. and a linear interpolation of language models We built four language models: an interpolated trained on all other WMT16 English corpora.

Edinburgh's Statistical Machine Translation Systems for WMT16

(Meta-) Evaluation of Machine Translation

Public Project Presentation and Updates

8.14: Annual Public Report

Statistical and Hybrid Machine Translation Between All European Languages

Abstract the Circle of Meaning: from Translation

Improving Statistical Machine Translation Efficiency by Triangulation

Factored Translation Models and Discriminative Training For

Convergence of Translation Memory and Statistical Machine Translation

Statistical Machine Translation

Improved Learning for Machine Translation

A Survey of Statistical Machine Translation

Statistical Machine Translation: the Basic, the Novel, and the Speculative