This document is part of the Research and Innovation Action “Quality Translation 21 (QT21)”. This project has received funding from the European Union’s Horizon 2020 program for ICT under grant agreement no. 645452.

Deliverable D1.5 Improved Learning for

Ondřej Bojar (CUNI), Jan-Thorsten Peter (RWTH), Weiyue Wang (RWTH), Tamer Alkhouli (RWTH), Yunsu Kim (RWTH), Miloš Stanojević (UvA), Khalil Sima’an (UvA), and Jon Dehdari (DFKI)

Dissemination Level: Public

31st July, 2016 Quality Translation 21 D1.5: Improved Learning for Machine Translation

Grant agreement no. 645452 Project acronym QT21 Project full title Quality Translation 21 Type of action Research and Innovation Action Coordinator Prof. Josef van Genabith (DFKI) Start date, duration 1st February, 2015, 36 months Dissemination level Public Contractual date of delivery 31st July, 2016 Actual date of delivery 31st July, 2016 Deliverable number D1.5 Deliverable title Improved Learning for Machine Translation Type Report Status and version Final (Version 1.0) Number of pages 78 Contributing partners CUNI, DFKI, RWTH, UVA WP leader CUNI Author(s) Ondřej Bojar (CUNI), Jan-Thorsten Peter (RWTH), Weiyue Wang (RWTH), Tamer Alkhouli (RWTH), Yunsu Kim (RWTH), Miloš Stanojević (UvA), Khalil Sima’an (UvA), and Jon Dehdari (DFKI) EC project officer Susan Fraser The partners in QT21 are: • Deutsches Forschungszentrum für Künstliche Intelligenz GmbH (DFKI), Germany • Rheinisch-Westfälische Technische Hochschule Aachen (RWTH), Germany • Universiteit van Amsterdam (UvA), Netherlands • Dublin City University (DCU), Ireland • (UEDIN), United Kingdom • Karlsruher Institut für Technologie (KIT), Germany • Centre National de la Recherche Scientifique (CNRS), France • Univerzita Karlova v Praze (CUNI), Czech Republic • Fondazione Bruno Kessler (FBK), Italy • University of Sheffield (USFD), United Kingdom • TAUS b.v. (TAUS), Netherlands • text & form GmbH (TAF), Germany • TILDE SIA (TILDE), Latvia • Hong Kong University of Science and Technology (HKUST), Hong Kong

For copies of reports, updates on project activities and other QT21-related information, contact: Prof. Stephan Busemann, DFKI GmbH [email protected] Stuhlsatzenhausweg 3 Phone: +49 (681) 85775 5286 66123 Saarbrücken, Germany Fax: +49 (681) 85775 5338 Copies of reports and other material can also be accessed via the project’s homepage: http://www.qt21.eu/

© 2016, The Individual Authors No part of this document may be reproduced or transmitted in any form, or by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval system, without permission from the copyright owner.

Page 2 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

Contents

1 Executive Summary 4

2 Joint Translation and Reordering Sequences 5

3 Alignment-Based Neural Machine Translation 5

4 The QT21/HimL Combined Machine Translation System 6

5 Improved BEER 6

6 Particle Swarm Optimization for MERT 6

7 CharacTER: Translation Edit Rate on Character Level 6

8 Bag-of-Words Input Features for Neural Network 7

9 Vocabulary Reduction for Phrase Table Smoothing 7

10 Faster and Better Word Classes for Word Alignment 8

References 8

Appendices 10

Appendix A Joint Translation and Reordering Sequences 10

Appendix B Alignment-Based Neural Machine Translation 21

Appendix C The QT21/HimL Combined Machine Translation System 33

Appendix D Beer for MT evaluation and tuning 45

Appendix E Particle Swarm Optimization Submission for WMT16 Tuning Task 51

Appendix F CharacTER: Translation Edit Rate on Character Level 58

Appendix G Exponentially Decaying Bag-of-Words Input Features 58

Appendix H Study on Vocabulary Reduction for Phrase Table Smoothing 65

Appendix I BIRA: Improved Predictive Exchange Word Clustering 73

Page 3 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

1 Executive Summary

This deliverable reports on the progress in Task 1.3 Improved Learning for Machine Translation, as specified in the project proposal:

Task 1.3 Improved Learning for Machine Translation [M01–M36] (CUNI, DFKI, RWTH, UVA) Experiments performed within this task will address the second ob- jective, namely full structured prediction, with discriminative and integrated train- ing. Focus will be on better correlation of the training criterion with the target metrics, avoidance of overfitting by incorporating certain so far unused smoothing techniques into the training process, and efficient algorithm to allow the use of all training data in all phases of the training process.

The first focus point of Task 1.3 is constructing full structured predictions. We analyzed two different approaches to achieve this. RWTH shows in Section 2 how to model bilingual sentence pairs on a word level together with reordering information as one sequence – joint translation and reordering (JTR) sequences. Paired with a phrased-based machine translation system it gave a significant improvement of up to 2.2 Bleu points over the baseline. RWTH proposes in Section 3 an alignment-based neural machine translation approach as an alternative to the popular attention-based approach. They demonstrate competitive results on the IWSLT 2013 German→English and BOLT Chinese→English tasks. Beyond systems designed within a unified framework with a single search for the best trans- lation, we also experiment with a high-level combination of systems, which gives us the best possible performance. In this joint effort the groups from the QT21 and the HimL project managed to build the best system for the English→Romanian translation task of the ACL 2016 First Conference on Machine Translation (WMT 2016); see Section 4. These systems depend on automatic ways to optimize parameters with a good training criterion. The next focus of this Task is therefore to find a better correlation of the training criterion with the target metrics. To this end, we organized and also participated in the Tuning Task at WMT 2016. The organization of the task itself fits in WP4 and the details of the task as a whole will be described in the corresponding deliverable there. In this deliverable, we include our submissions to the Tuning Task. The Beer evaluation metric for MT was improved by the UvA team to be both higher quality and significantly faster to compute, which makes it competitive with Bleu for tuning; see Section 5. Another submission to the Tuning Task was contributed by CUNI, who based the standard Mert on Particle Swarm Optimization; see Section 6. Additionally RWTH introduced a character level TER, where the edit distance is calculated on character level, while the shift edit is performed on word level. It was not tested for tuning yet, but shows very high correlation with human judgment as described in Section 7. Even with better metrics in place we still need to avoid overfitting to rare events while learning from them. We developed multiple ways to avoid this by applying smoothing in different variations. One approach is the usage of bag-of-words input for feed-forward neural networks with individually-trained decay rates, which gave comparable results to LSTM networks as described in Section 8 by RWTH. The second approach from RWTH is based on smoothing the standard phrase translation probability by reducing the vocabulary with a word-label mapping. That empirically showed that smoothing is not significantly affected by the choice of vocabulary and is more effective for large-scale translation tasks; see Section 9. Section 10 presents work by DFKI on smoothing using word classes for morphologically- rich languages. These languages have large vocabularies, which increase training times and data sparsity. The research developed multiple techniques to improve both the scalability and quality of word clusters for word alignment.

Page 4 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

2 Joint Translation and Reordering Sequences

In joint work with WP2, RWTH introduced a method that converts bilingual sentence pairs and their word alignments into joint translation and reordering (JTR) sequences. This combines interdepending lexical and alignment dependencies into a single framework. A main advantage of JTR sequences is that it can be modeled in a similar way as a language model, allowing it to be used in well-known methods. RWTH tested three different methods:

• count based n-gram models with modified Kneser-Ney smoothing, a well-tested technique for language modeling

• feedforward neural networks (FFNN), a model that should generalize better for unseen events

• recurrent neural networks (RNN), good generalization, and can in principle take the com- plete history into account

Comparisons between the count-based JTR model and the operation sequence model (OSM) Durrani et al. (2013), both used in phrase-based decoding, showed that the JTR model per- formed at least as good as OSM, with a slight advantage for JTR. In comparison to the OSM, the JTR model operates on words, leading to a smaller vocabulary size. Moreover, it utilizes simpler reordering structures without gaps and only requires one log-linear feature to be tuned, whereas the OSM needs five. The strongest combination of count and neural network models yields an improvement over the phrase-based system by up to 2.2 Bleu points on the German→English IWSLT task. This combination also outperforms OSM by up to 1.2 Bleu points on the BOLT Chinese→English tasks. This work appeared as a long paper at EMNLP 2015 (Guta et al., 2015) and is available in Appendix A.

3 Alignment-Based Neural Machine Translation

Neural machine translation (NMT) has emerged recently as a successful alternative to traditional phrase-based systems. In NMT, neural networks are used to generate translation candidates during decoding. This is in contrast to the integration of neural networks in phrase-based systems, where the networks are used to score, but not to generate the translation hypotheses. The neural networks used in NMT so far use an attention mechanism to allow the decoder to pay more attention to certain parts of the source sentence when generating a target word. The attention part is implemented as a probability distribution over the source positions. In contrast to the attention mechanism, RWTH proposes an alignment-based approach using the well-known hidden Markov Model (HMM). In this approach, translation is a generative process of two steps: alignment and translation. In contrast to the attention-based case, this method is more flexible as it allows training the alignment and translation model separately, while still retaining the possibility of performing joint training using forced-decoding. We combine the models using a log-linear framework and tune the model weights using minimum error rate training (Mert) training. Our models use large target vocabularies of up to 149K words, made feasible by a class-factored output layer. By limiting decoding to the top scoring classes, we speed up decoding significantly. We demonstrate that the alignment-based approach is another viable NMT approach, which is capable of slightly outperforming attention-based NMT on the BOLT Chinese→English task. This work was done in cooperation with WP2 and will appear at WMT 2016 (Alkhouli et al., 2016) and is available in Appendix B.

Page 5 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

4 The QT21/HimL Combined Machine Translation System

In a joint effort of groups from RWTH Aachen University, LMU Munich, Charles University in Prague, University of Edinburgh, University of Sheffield, Karlsruhe Institute of Technol- ogy, LIMSI, University of Amsterdam, and Tilde, we provided the strongest system for the English→Romanian translation task of the ACL 2016 First Conference on Machine Transla- tion (WMT 2016). The submission is a system combination which combines twelve different statistical machine translation systems built by the participating groups. The systems are combined using RWTH’s system combination approach. It combines all translations into one confusion network and extracts the most likely translation by finding the best path through the network. The final official evaluation shows an improvement of 1.0 Bleu point by the system combination compared to the best single system on newstest2016. Examination of the translations by the single systems showed that the combination of all systems also improved the number of produced words with the correct morphology. This work was done in cooperation with WP2 and will appear at WMT 2016 (Peter et al., 2016a) and is available in Appendix C.

5 Improved BEER

Beer is a trained evaluation metric with a linear model that combines features capturing character n-grams and permutation trees introduced by UvA. This year the Beer learning algorithm has been improved (linear SVM instead of logistic regression) and some features that are relatively slow to compute were removed (paraphrasing, syntax and permutation trees) which resulted in a very large speed-up. This speed-up was essential for fast tuning of MT systems: tuning with Beer is now as fast as tuning with Bleu. The implementation is for some languages now even more accurate. An additional change in Beer is that the usual training for ranking is replaced by a compromise: the initial model is trained for ranking (RR) with ranking SVM and then the output from SVM is scaled using a trained regression model to approximate absolute judgment (DA). This work on Beer appeared at WMT 2015 (Stanojević and Sima’an, 2015) and is available in Appendix D.

6 Particle Swarm Optimization for MERT

In Kocur and Bojar (2016) (full text in Appendix E), CUNI describes a replacement of the core optimization algorithm in the standard Minimum Error Rate Training (Och, 2003) with Particle Swarm Optimization (PSO, Eberhart et al., 1995). The expected benefit of PSO is the highly parallelizable structure of the algorithm. CUNI implemented PSO for the Moses toolkit and took part in the WMT 16 Tuning Task to see how the weights selected by PSO perform in extrinsic evaluation. PSO indeed runs faster and delivers weights leading to Bleu scores very similar to the Mert ones on the development set. Unfortunately, the manual evaluation for Tuning Task ranks the system optimized with PSO lower, although still within the same top cluster of participating systems that cannot be distinguished with statistical significance. (Almost all systems participating in the Tuning Task ended up in this cluster.) The experiments so far were limited to optimizing towards Bleu, but possibly, other MT evaluation metrics would benefit from the PSO style of exploration of the parameter space more than Bleu does.

7 CharacTER: Translation Edit Rate on Character Level

RWTH introduced CharacTER, a novel character-level metric inspired by the commonly applied translation edit rate (TER). It is defined as the minimum number of character edits required to

Page 6 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation adjust a hypothesis, until it completely matches the reference, normalized by the length of the hypothesis sentence. CharacTER calculates the character level edit distance while performing the shift edit on word level. Unlike the strict matching criterion in TER, a hypothesis word is considered to match a reference word and could be shifted, if the edit distance between them is below a threshold value. The Levenshtein distance between the reference and the shifted hypothesis sequence is computed at a character level. Also, the lengths of hypothesis sequences, rather than reference sequences, are used for normalizing the edit distance, which effectively counters the issue that shorter translations normally achieve lower TER. The experimental results and evaluations showed that CharacTER represents a metric with high human correlations on the system-level, especially for morphologically rich languages, which benefit from the character level information. It outperforms other strong metrics for translations out of English direction, while the concept is simple and straightforward. This work was done in cooperation with WP2 and will appear at WMT 2016 (Wang et al., 2016) and is available in Appendix F.

8 Bag-of-Words Input Features for Neural Network

Neural network models recently achieved consistent improvements in statistical machine trans- lation. However, most networks only use simple one-hot encoded input vectors of words as their input. RWTH investigated exponentially decaying bag-of-words input features for feed-forward neural network translation models. Their work showed that the performance of bag-of-words inputs can be improved by using and training decay rates along with other weight parameters. The decay rates are used to fine-tune the effect of distant words on the current translation. Different kinds of decay rates are investigated. The decay rate dependent on the aligned source word performed best. It provides improvements on average by 0.5 Bleu points on three different translation tasks (IWSLT 2013 German→English, WMT 2015 German→English, and BOLT Chinese→English) on top of a state-of-the-art phrase-based system, a baseline which already includes a neural network translation model. The model was able to slightly outperform a bidirectional LSTM translation model on the given tasks and the experiments showed that using a bag-of-word vector outperformed using a larger source-side window for the feed-forward neural network on IWSLT. This work will appear at ACL 2016 (Peter et al., 2016b) and is available in Appendix G.

9 Vocabulary Reduction for Phrase Table Smoothing

To address the sparsity problem of standard phrase translation, RWTH reduced the vocabulary size by mapping words into a smaller label space, eventually training denser distributions. They develop a smoothed translation model which maps each word in a phrase at a time to its respec- tive label, yielding by up to 0.7 Bleu points over a standard phrase-based SMT baseline. They evaluate the smoothing models using various vocabularies with different sizes and structures, showing that different word-label mappings are almost equally effective for the vocabulary re- duction. This allows the use of any type of word-level label, e.g. a randomized vocabulary, for the smoothing, which saves a considerable amount of effort to optimize the structure and gran- ularity of the label vocabulary. This result emphasizes the fundamental sparsity of the standard phrase translation model. Tests of the vocabulary reduction in translation scenarios of different scales showed that the smoothing works better with more parallel corpora. It implies that OOV handling is more crucial than smoothing phrase translation models for low-resource translation tasks. This work will appear at WMT 2016 (Kim et al., 2016) and is available in Appendix H.

Page 7 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

10 Faster and Better Word Classes for Word Alignment

Word clusters are useful for many NLP tasks including training neural network language models, but current increases in datasets are outpacing the ability of word clusterers to handle them. Little attention has been paid thus far on inducing high-quality word clusters at a large scale. The predictive exchange algorithm is quite scalable, but sometimes does not provide as good perplexity as other slower clustering algorithms. DFKI introduced the bidirectional, interpolated, refining, and alternating (BIRA) predictive exchange algorithm. It improves upon the predictive exchange algorithm’s perplexity by up to 18%, giving it perplexities comparable to the slower two-sided exchange algorithm, and better perplexities than the slower Brown clustering algorithm. The Bira implementation by DFKI is fast, clustering a 2.5 billion token English News Crawl corpus in 3 hours. It also reduces machine translation training time while preserving translation quality. The implementation is portable and freely available. This work appeared at NAACL 2016 (Dehdari et al., 2016) and is available in Appendix I.

References

Tamer Alkhouli, Gabriel Bretschner, Jan-Thorsten Peter, Mohammed Hethnawi, Andreas Guta, and Hermann Ney. 2016. Alignment-based neural machine translation. In ACL 2016 First Conference on Machine Translation. Berlin, Germany.

Jon Dehdari, Liling Tan, and Josef van Genabith. 2016. BIRA: Improved predictive exchange word clustering. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL). Association for Computational Linguistics, San Diego, CA, USA, pages 1169–1174.

Nadir Durrani, Alexander Fraser, Helmut Schmid, Hieu Hoang, and Philipp Koehn. 2013. Can Markov models over minimal translation units help phrase-based SMT? In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Sofia, Bulgaria, pages 399–405.

Russ C Eberhart, James Kennedy, et al. 1995. A new optimizer using particle swarm theory. In Proceedings of the sixth international symposium on micro machine and human science. New York, NY, volume 1, pages 39–43.

Andreas Guta, Tamer Alkhouli, Jan-Thorsten Peter, Joern Wuebker, and Hermann Ney. 2015. A comparison between count and neural network models based on joint translation and re- ordering sequences. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon, Portugal, pages 1401–1411.

Yunsu Kim, Andreas Guta, Joern Wuebker, and Hermann Ney. 2016. A comparative study on vocabulary reduction for phrase table smoothing. In ACL 2016 First Conference on Machine Translation. Berlin, Germany.

Viktor Kocur and Ondřej Bojar. 2016. Particle Swarm Optimization Submission for WMT16 Tuning Task. In Proceedings of the First Conference on Machine Translation, Volume 2: Shared Task Papers. Association for Computational Linguistics, Berlin, Germany, pages 515– 521.

Franz Josef Och. 2003. Minimum Error Rate Training in Statistical Machine Translation. In Proc. of the Association for Computational Linguistics. Sapporo, Japan.

Jan-Thorsten Peter, Tamer Alkhouli, Hermann Ney, Matthias Huck, Fabienne Braune, Alexan- der Fraser, Aleš Tamchyna, Ondřej Bojar, Barry Haddow, Rico Sennrich, Frédéric Blain, Lu- cia Specia, Jan Niehues, Alex Waibel, Alexandre Allauzen, Lauriane Aufrant, Franck Burlot,

Page 8 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

Elena Knyazeva, Thomas Lavergne, François Yvon, Stella Frank, and Marcis Pinnis. 2016a. The QT21/HimL combined machine translation system. In ACL 2016 First Conference on Machine Translation. Berlin, Germany.

Jan-Thorsten Peter, Weiyue Wang, and Hermann Ney. 2016b. Exponentially decaying bag-of- words input features for feed-forward neural network in statistical machine translation. In Annual Meeting of the Assoc. for Computational Linguistics. Berlin, Germany.

Miloš Stanojević and Khalil Sima’an. 2015. Evaluating MT systems with BEER. The Prague Bulletin of Mathematical Linguistics 104:17–26.

Weiyue Wang, Jan-Thorsten Peter, Hendrik Rosendahl, and Hermann Ney. 2016. CharacTER: Translation edit rate on character level. In ACL 2016 First Conference on Machine Transla- tion. Berlin, Germany.

Page 9 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

A Joint Translation and Reordering Sequences

A Comparison between Count and Neural Network Models Based on Joint Translation and Reordering Sequences

Andreas Guta, Tamer Alkhouli, Jan-Thorsten Peter, Joern Wuebker, Hermann Ney Human Language Technology and Pattern Recognition Group RWTH Aachen University Aachen, Germany {surname}@cs.rwth-aachen.de

Abstract This work proposes word-based translation models that are potentially capable of capturing We propose a conversion of bilingual long-range dependencies. We do this in two steps: sentence pairs and the corresponding First, given bilingual sentence pairs and the asso- word alignments into novel linear se- ciated word alignments, we convert the informa- quences. These are joint translation tion into uniquely defined linear sequences. These and reordering (JTR) uniquely defined sequenecs encode both word reordering and trans- sequences, combining interdepending lation information. Thus, they are referred to as lexical and alignment dependencies on joint translation and reordering (JTR) sequences. the word level into a single framework. Second, we train an n-gram model with modi- They are constructed in a simple manner fied Kneser-Ney smoothing (Chen and Goodman, while capturing multiple alignments 1998) on the resulting JTR sequences. This yields and empty words. JTR sequences can a model that fuses interdepending reordering and be used to train a variety of models. translation dependencies into a single framework. We investigate the performances of n- Although JTR n-gram models are closely re- gram models with modified Kneser-Ney lated to the operation sequence model (OSM) smoothing, feed-forward and recur- (Durrani et al., 2013b), there are three main dif- rent neural network architectures when ferences. To begin with, the OSM employs min- estimated on JTR sequences, and com- imal translation units (MTUs), which are essen- pare them to the operation sequence tially atomic phrases. As the MTUs are extracted model (Durrani et al., 2013b). Evalua- sentence-wise, a word can potentially appear in tions on the IWSLT German→English, multiple MTUs. In order to avoid overlapping WMT German→English and BOLT translation units, we define the JTR sequences Chinese→English tasks show that JTR on the level of words. Consequently, JTR se- models improve state-of-the-art phrase- quences have smaller vocabulary sizes than OSM based systems by up to 2.2 BLEU. sequences and lead to models with less sparsity. Moreover, we argue that JTR sequences offer a 1 Introduction simpler reordering approach than operation se- quences, as they handle reorderings without the Standard phrase-based machine translation (Och need to predict gaps. Finally, when used as an et al., 1999; Zens et al., 2002; Koehn et al., 2003) additional model in the log-linear framework of uses relative frequencies of phrase pairs to esti- phrase-based decoding, an n-gram model trained mate a translation model. The phrase table is ex- on JTR sequences introduces only one single fea- tracted from a bilingual text aligned on the word ture to be tuned, whereas the OSM additionally level, using e.g. GIZA++ (Och and Ney, 2003). Al- uses 4 supportive features (Durrani et al., 2013b). though the phrase pairs capture internal dependen- Experimental results confirm that this simplifica- cies between the source and target phrases aligned tion does not make JTR models less expressive, as to each other, they fail to model dependencies that their performance is on par with the OSM. extend beyond phrase boundaries. Phrase-based decoding involves concatenating target phrases. Due to data sparsity, increasing the n-gram or- The burden of ensuring that the result is linguisti- der of count-based models beyond a certain point cally consistent falls on the language model (LM). becomes useless. To address this, we resort to neu-

Page 10 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

ral networks (NNs), as they have been successfully words and do not include reordering information. applied to machine translation recently (Sunder- Durrani et al. (2011) developed the OSM which meyer et al., 2014; Devlin et al., 2014). They are combined dependencies on bilingual word pairs able to score any word combination without re- and reordering information into a single frame- quiring additional smoothing techniques. We ex- work. It used an own decoder that was based on n- periment with feed-forward and recurrent trans- grams of MTUs and predicted single translation or lation networks, benefiting from their smoothing reordering operations. This was further advanced capabilities. To this end, we split the linear se- in (Durrani et al., 2013a) by a decoder that was quence into two sequences for the neural transla- capable of predicting whole sequences of MTUs, tion models to operate on. This is possible due to similar to a phrase-based decoder. In (Durrani et the simplicity of the JTR sequence. We show that al., 2013b), a slightly enhanced version of OSM the count and NN models perform well on their was integrated into the log-linear framework of own, and that combining them yields even better the Moses system (Koehn et al., 2007). Both the results. BILM (Stewart et al., 2014) and the OSM (Durrani In this work, we apply n-gram models with et al., 2014) can be smoothed using word classes. modified Kneser-Ney smoothing during phrase- Guta et al. (2015) introduced the extended trans- based decoding and neural JTR models in rescor- lation model (ETM), which operates on the word ing. However, using a phrase-based system is not level and augments the IBM models by an addi- required by the model, but only the initial step to tional bilingual word pair and a reordering opera- demonstrate the strength of JTR models, which tion. It is implemented into the log-linear frame- can be applied independently of the underlying de- work of a phrase-based decoder and shown to be coding framework. While the focus of this work is competitive with a 7-gram OSM. on the development and comparison of the models, The JTR n-gram models proposed within this the long-term goal is to decode using JTR mod- work can be seen as an extension of the ETM. els without the limitations introduced by phrases, Nevertheless, JTR models utilize linear sequences in order to exploit the full potential of JTR mod- of dependencies and combine the translation of els. The JTR models are estimated on word align- bilingual word pairs and reoderings into a sin- ments, which we obtain using GIZA++ in this pa- gle model. The ETM, however, features separate per. The future aim is to also generate improved models for the translation of individual words and word alignments by a joint optimization of both reorderings and provides an explicit treatment of the alignments and the models, similar to the train- multiple alignments. As they operate on linear se- ing of IBM models (Brown et al., 1990; Brown et quences, JTR count models can be implemented al., 1993). In the long run, we intend to achieve a using existing toolkits for n-gram language mod- consistency between decoding and training using els, e.g. the KenLM toolkit (Heafield et al., 2013). the introduced JTR models. An HMM approach for word-to-phrase align- ments was presented in (Deng and Byrne, 2005), 2 Previous Work showing performance similar to IBM Model 4 on In order to address the downsides of the phrase the task of bitext alignment. Feng et al. (2013) translation model, various approaches have been propose several models which rely only on the in- taken. Marino˜ et al. (2006) proposed a bilingual formation provided by the source side and pre- language model (BILM) that operates on bilin- dict reorderings. Contrastingly, JTR models in- gual n-grams, with an own n-gram decoder re- corporate target information as well and predict quiring monotone alignments. The lexical re- both translations and reorderings jointly in a sin- ordering model introduced in (Tillmann, 2004) gle framework. was integrated into phrase-based decoding. Crego Zhang et al. (2013) explore different Markov and Yvon (2010) adapted the approach to BILMs. chain orderings for an n-gram model on MTUs The bilingual n-grams are further advanced in in rescoring. Feng and Cohn (2013) present an- (Niehues et al., 2011), where they operate on non- other generative word-based Markov chain trans- monotone alignments within a phrase-based trans- lation model which exploits a hierarchical Pitman- lation framework. Compared to our JTR models, Yor process for smoothing, but it is only applied their BILMs treat jointly aligned source words as to induce word alignments. Their follow-up work minimal translation units, ignore unaligned source (Feng et al., 2014) introduces a Markov-model on

Page 11 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

MTUs, similar to the OSM described above. Algorithm 1 JTR Conversion Algorithm Recently, neural machine translation has J I I 1: procedure JTRCONVERSION( f1 , e1, b1) K emerged as an alternative to phrase-based decod- 2: g1 ← /0 ing, where NNs are used as standalone models to 3:// last translated source position j0 4: j0← 0 decode source input. In (Sutskever et al., 2014), 5: for i ← 1 to I do a recurrent NN was used to encode a source 6: if ei is unaligned then 7:// align ei to the empty word ε sequence, and output a target sentence once the K 8: APPEND(g1 , hε,eii) source sentence was fully encoded in the network. 9: continue The network did not have any explicit treatment 10:// ei is aligned to at least one source word of alignments. Bahdanau et al. (2015) introduced 11: j ← first source position in bi 12: if j = j0 then soft alignments as part of the network architecture. 13:// ei is aligned to the same f j as ei−1 In this work, we make use of hard alignments K 14: APPEND(g1 , hσ,eii) instead, where we encode the alignments in the 15: continue 16: if j 6= j0 + 1 then source and target sequences, requiring no mod- 17:// alignment step is non-monotone J I K 0 ifications of existing feed-forward and recurrent 18: REORDERINGS( f1 , b1, g1 , j , j) NN architectures. Our feed-forward models are 19: // 1-to-1 translation: f j is aligned to ei K based on the architectures proposed in (Devlin et 20: APPEND(g1 , h f j,eii) 0 al., 2014), while the recurrent models are based 21: j ← j 22:// generate all other f j that are also on (Sundermeyer et al., 2014). Further recent 23:// aligned to the current target word ei research on applying NN models for extended 24: for all remaining j in bi do APPEND gK h f , i context was carried out in (Le et al., 2012; Auli 25: ( 1 , j σ ) 26: j0 ← j et al., 2013; Hu et al., 2014). All of these works 27:// check last alignment step at sentence end focus on lexical context and ignore the reordering 28: if j0 6= J then 29:// last alignment step is non-monotone aspect covered in our work. J I K 0 30: REORDERINGS( f1 , b1, g1 , j , J + 1) gK 3 JTR Sequences 31: return 1 32: 33:// called when a reordering class is appended The core idea of this work is the interpretation of J I K 0 34: procedure REORDERINGS( f1 , b1, g1 , j , j) a bilingual sentence pair and its word alignment 35:// check if the predecessor is unaligned as a linear sequence of K joint translation and re- 36: if f j−1 is unaligned then ordering (JTR) tokens gK. Formally, the sequence 37:// get unaligned predecessors 1 j−1 K J I I 38: f j ← unaligned predecessors of f j g1 ( f1 ,e1,b1) is a uniquely defined interpretation 0 J I 39:// check if the alignment step to the first of a given source sentence f1 , its translation e1 and 40:// unaligned predecessor is monotone I 0 the inverted alignment b1, where bi denotes the 41: if j0 6= j + 1 then 42:// non-monotone: add reordering class ordered sequence of source positions j aligned to K 43: APPEND(g , ∆ 0 ) target position i. We drop the explicit mention of 1 j , j0 44:// translate unaligned predecessors by ε ( f J,eI ,bI ) to allow for a better readability. Each 1 1 1 45: for f ← f j0 to f j−1 do K JTR token is either an aligned bilingual word pair 46: APPEND(g1 , h f ,εi) h f ,ei or a reordering class ∆ j0 j. 47: else 48:// non-monotone: add reordering class Unaligned words on the source and target side K 49: APPEND(g , ∆ 0 ) are processed as if they were aligned to the empty 1 j , j word ε. Hence, an unaligned source word f gener- ates the token h f ,εi, and an unaligned target word e the token hε,ei. words. Similar to Feng and Cohn (2013), we clas- 0 Each word of the source and target sentences is sify the reordered source positions j and j by ∆ j0 j: to appear in the corresponding JTR sequence ex- actly once. For multiply-aligned target words e,  the first source word f that is aligned to e gener- step backward (←), j = j0 − 1  0 0 ates the token h f ,ei. All other source words f , ∆ j0 j = jump forward (y), j > j + 1 that are also aligned to e, are processed as if they  0 jump backward (x), j < j − 1. were aligned to the artificial word σ. Thus, each of these f 0 generates a token h f 0,σi. The same approach is applied to multiply-aligned source The reordering classes are illustrated in Figure 1.

Page 12 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

i i i − 1 i i − 1 i − 1

j j0 j0 j j j0 (a) step backward (←) (b) jump forward (y) (c) jump backward (x) Figure 1: Overview of the different reordering classes in JTR sequences.

3.1 Sequence Conversion . Algorithm 1 presents the formal conversion of a code bilingual sentence pair and its alignment into the your K K enter corresponding JTR sequence g1 . At first, g1 is initialized by an empty sequence (line 2). For each , target position i = 1,...,I it is extended by at least field one token. During the generation process, we store Command the last visited source position j0 (line 4). If a tar- the get word ei is in .

• unaligned, we align it to the empty word ε im ein Sie Code K Feld Ihren and append hε,eii to the current g1 (line 8), geben Befehl • if it is aligned to the same f j as ei−1, we only add hσ,eii (line 14), Figure 2: This example illustrates the JTR se- quence gK for a German→English sentence pair • otherwise we append h f j,eii (line 20) and 1 • in case there are more source words aligned including the word-to-word alignment. to ei, we additionally append h f j,σi for each of these (line 24). token has to be generated right before h.,.i is

Before a token h f j,eii is generated, we have to generated. Therefore, there is no forward jump check whether the alignment step from j0 to j is from hCode,codei to h.,.i, but a monotone step monotone (line 16). In case it is not, we have to to hein,εi followed by h.,.i. deal with reorderings (line 34). We define that 3.2 Training of Count Models a token h f j−1,εi is to be generated right before As the JTR sequence gK is a unique interpretation the generation of the token containing f j. Thus, 1 if f is not aligned, we first determine the con- of a bilingual sentence pair and its alignment, the j−1 J I I j−1 probability p( f ,e ,b ) can be computed as: tiguous sequence of unaligned predecessors f 1 1 1 j0 0 J I I K (line 38). Next, if the step from j to j0 is not p( f1 ,e1,b1) = p(g1 ). (1) monotone, we add the corresponding reordering The probability of gK can be factorized and ap- class (line 43). Afterwards we append all h f ,εi 1 j0 proximated by an n-gram model. to h f j−1,εi. If f j−1 is aligned, we do not have to K process unaligned source words and only append K k−1 p(g1 ) = ∏ p(gk|gk−n+1) (2) the corresponding reordering class (line 49). k=1 Figure 2 illustrates the generation steps of a Within this work, we first estimate the Viterbi JTR sequence, whose result is presented in Ta- alignment for the bilingual training data using ble 1. The alignment steps are denoted by the ar- GIZA++ (Och and Ney, 2003). Secondly, the con- rows connecting the alignment points. The first version presented in Algorithm 1 is applied to ob- dashed alignment point indicates the hε,,i token tain the JTR sequences, on which we estimate an that is generated right after the hFeld,fieldi to- n-gram model with modified Kneser-Ney smooth- ken. The second dashed alignment point indicates ing as described in (Chen and Goodman, 1998) us- 1 the hein,εi token, which corresponds to the un- ing the KenLM toolkit (Heafield et al., 2013). aligned source word ein. Note, that the hein,εi 1https://kheafield.com/code/kenlm/

Page 13 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

k gk sk tk De→En, this led to a baseline weaker by 0.2 BLEU 1 y δ y than the one described in Section 5. In order to 2 him,ini im in have an unconstrained and fair baseline, we there- 3 hσ,thei σ the after removed this constraint and forced such dele- 4 y δ y 5 hBefehl,Commandi Befehl Command tion tokens to be generated at the end of the se- 6 ← δ ← quence. Hence, we accept that the JTR model 7 hFeld,fieldi Feld field 8 hε,,i ε , might compute the wrong score in these special 9 x δ x cases. 10 hgeben,enteri geben enter 11 hSie,σi Sie σ 4 Neural Networks 12 y δ y 13 hIhren,youri Ihren your 14 hCode,codei Code code Usually, smoothing techniques are applied to 15 hein,εi ein ε count-based models to handle unseen events. A 16 h.,.i .. neural network does not suffer from this, as it is able to score unseen events without additional Table 1: The left side of this table presents the JTR smoothing techniques. In the following, we will tokens gk corresponding to Figure 2. The right describe how to adapt JTR sequences to be used side shows the source and target tokens sk and tk with feed-forward and recurrent NNs. obtained from the JTR tokens gk. They are used The first thing to notice is the vocabulary size, for the training of NNs (cf. Section 4). mainly determined by the number of bilingual word pairs, which constituted atomic units in the count-based models. NNs that compute probabil- 3.3 Integration into Phrase-based Decoding ity values at the output layer evaluate a softmax Basically, each phrase table entry is annotated function that produces normalized scores that sum with both the word alignment information, which up to unity. The softmax function is given by: also allows to identify unaligned source words, i−1 oe (e1 ) and the corresponding JTR sequence. The JTR i−1 e i p(ei|e ) = (3) model is added to the log-linear framework as an 1 |V| o (ei−1) ∑ e w 1 additional n-gram model. Within the phrase-based w=1

decoder, we extend each search state such that it where oei and ow are the raw unnormalized output additionally stores the JTR model history. layer values for the words ei and w, respectively, In comparison to the OSM, the JTR model does and |V| is the vocabulary size. The output layer i−1 not predict gaps. Local reorderings within phrases is a function of the context e1 . Computing the are handled implicitly. On the other hand, we rep- denominator is expensive for large vocabularies, resent long-range reorderings between phrases by as it requires computing the output for all words. the coverage vector and limit them by reordering Therefore, we split JTR tokens gk and use indi- constraints. vidual words as input and output units, such that Phrase-pairs ending with unaligned source the NN receives jumps, source and target words as words at their right boundary prove to be a prob- input and outputs target words and jumps. Hence, lem during decoding. As shown in Subsection 3.1, the resulting neural model is not a LM, but a trans- the conversion from word alignments to JTR se- lation model with different input and output vo- K quences assumes that each token corresponding to cabularies. A JTR sequence g1 is split into its K K an unaligned source word is generated immedi- source and target parts s1 and t1 . The construc- K ately before the token corresponding to the closest tion of the JTR source sequence s1 proceeds as aligned source position to its right. However, if a follows: Whenever a bilingual pair is encountered, phrase ends with an unaligned f j as its rightmost the source word is kept and the target word is dis- source word, the generation of the h f j,εi token has carded. In addition, all jump classes are replaced K to be postponed until the next word f j+1 is to be by a special token δ. The JTR target sequence t1 is translated or, even worse, f j+1 has already been constructed similarly by keeping the target words translated before. and dropping source words, and the jump classes To address this issue, we constrained the phrase are also kept. Table 1 shows the JTR source and table extraction to discard entries with unaligned target sequences corresponding to JTR sequence source tokens at the right boundary. For IWSLT of Figure 2.

Page 14 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

Due to the design of the JTR sequence, pro- layer with the target input tk−1 as well, that is, ducing the source and target JTR sequences is we aggregate the embeddings of the input source straightforward. The resulting sequences can then word sk and the input target word tk−1 before they be used with existing NN architectures, without are fed into the forward layer. Due to recurrency, k−1 k further modifications to the design of the net- the forward layer encodes the parts (t1 ,s1), and K works. This results in powerful models that re- the backward layer encodes sk , and together they k−1 K quire little effort to implement. encode (t1 ,s1 ), which is used to score the out- put target word tk. For the sake of comparison 4.1 Feed-forward Neural JTR to FFNN and count models, we also experiment First, we will apply a feed-forward NN (FFNN) to with a recurrent model that does not include future the JTR sequence. FFNN models resemble count- source information, this is obtained by replacing K k based models in using a predefined limited context the term s1 with s1 in Eq. 5. It will be referred size, but they do not encounter the same smooth- to as the unidirectional recurrent neural network ing problems. In this work, we use a FFNN similar (URNN) model in the experiments. to that proposed in (Devlin et al., 2014), defined Note that the JTR source and target sides as: include jump information, therefore, the RNN K model described above explicitly models reorder- p(tK|sK) ≈ p(t |tk−1,sk ). 1 1 ∏ k k−n k−n (4) ing. In contrast, the models proposed in (Sunder- k=1 meyer et al., 2014) do not include any jumps, and It scores the JTR target word tk at position k us- hence do not provide an explicit way of includ- ing the current source word sk, and the history of ing word reordering. In addition, the JTR RNN n JTR source words. In addition, the n JTR target models do not require the use of IBM-1 lexica to words preceding tk are used as context. The FFNN resolve multiply-aligned words. As discussed in computes the score by looking up the vector em- Section 3, these cases are resolved by aligning the beddings of the source and target context words, multiply-aligned word to the first word on the op- concatenating them, then evaluating the rest of the posite side. network. We reduce the output layer to a short- The integration of the NNs into the decoder is list of the most frequent words, and compute word not trivial, due to the dependence on the target class probabilities for the remaining words. context. In the case of RNNs, the context is un- bounded, which would affect state recombination, 4.2 Recurrent Neural JTR and lead to less variety in the beam used to prune Unlike feed-forward NNs, recurrent NNs (RNNs) the search space. Therefore, the RNN scores are enable the use of unbounded context. Following computed using approximations instead (Auli et (Sundermeyer et al., 2014), we use bidirectional al., 2013; Alkhouli et al., 2015). In (Alkhouli et recurrent NNs (BRNNs) to capture the full JTR al., 2015), it is shown that approximate RNN inte- source side. The BRNN uses the JTR target side gration into the phrase-based decoder has a slight as well as the full JTR source side as context, and advantage over n-best rescoring. Therefore, we it is given by: apply RNNs in rescoring in this work, and to al- K low for a direct comparison between FFNNs and K K k−1 K p(t1 |s1 ) = ∏ p(tk|t1 ,s1 ) (5) RNNs, we apply FFNNs in rescoring as well. k=1

This equation is realized by a network that uses 5 Evaluation forward and backward recurrent layers to capture We perform experiments on the large- the complete source sentence. By a forward layer scale IWSLT 20132 (Cettolo et al., we imply a recurrent hidden layer that processes 2014) German→English, WMT 20153 a given sequence from left to right, while a back- German→English and the DARPA BOLT ward layer does the processing backwards, from Chinese→English tasks. The statistics for the right to left. The source sentence is basically split bilingual corpora are shown in Table 2. Word at a given position k, then past and future represen- alignments are generated with the GIZA++ toolkit tations of the sentence are recursively computed by the forward and backward layers, respectively. 2http://www.iwslt2013.org To include the target side, we provide the forward 3http://www.statmt.org/wmt15/

Page 15 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

IWSLT WMT BOLT German English German English Chinese English Sentences 4.32M 4.22M 4.08M Run. Words 108M 109M 106M 108M 78M 86M Vocabulary 836K 792K 814K 773K 384K 817K

Table 2: Statistics for the bilingual training data of the IWSLT 2013 German→English, WMT 2015 German→English, and the DARPA BOLT Chinese→English translation tasks.

(Och and Ney, 2003). We use a standard phrase- 5.1 Tasks description based translation system (Koehn et al., 2003). The domain of IWSLT consists of lecture-type The decoding process is implemented as a beam talks presented at TED conferences which are also search. All baselines contain phrasal and lexical available online4. All systems are optimized on smoothing models for both directions, word and the dev2010 corpus, named dev here. Some phrase penalties, a distance-based reordering of the OSM and JTR systems are trained on the model, enhanced low frequency features (Chen TED portions of the data containing 138K sen- et al., 2011), a hierarchical reordering model tences. To estimate the 4-gram LM, we addi- (HRM) (Galley and Manning, 2008), a word tionally make use of parts of the Shuffled News, class LM (Wuebker et al., 2013) and an n-gram LDC English Gigaword and 109-French-English LM. The lexical and phrase translation models of corpora, selected by a cross-entropy difference cri- all baseline systems are trained on all provided terion (Moore and Lewis, 2010). In total, 1.7 bil- bilingual data. The log-linear feature weights are lion running words are taken for LM training. The tuned with minimum error rate training (MERT) BOLT Chinese→English task is evaluated on the (Och, 2003) on BLEU (Papineni et al., 2001). All “discussion forum” domain. The 5-gram LM is systems are evaluated with MultEval (Clark et al., trained on 2.9 billion running words in total. The 2011). The reported BLEU scores are averaged in-domain data consists of a subset of 67.8K sen- over three MERT optimization runs. tences and we used a set of 1845 sentences for tun- ing. The evaluation set test1 contains 1844 and test2 1124 sentences. For the WMT task, we All LMs, OSMs and count-based JTR models used the target side of the bilingual data and all are estimated with the KenLM toolkit (Heafield et monolingual data to train a pruned 5-gram LM on al., 2013). The OSM and the count-based JTR a total of 4.4 billion running words. We concate- model are implemented in the phrasal decoder. nated the newstest2011 and newstest2012 NNs are used only in rescoring. The 9-gram corpora for tuning the systems. FFNNs are trained with two hidden layers. The short lists contain the 10k most frequent words, 5.2 Results and all remaining words are clusterd into 1000 → word classes. The projecton layer has 17 × 100 We start with the IWSLT 2013 German English nodes, the first hidden layer 1000 and the sec- task, where we compare between the different JTR ond 500. The RNNs have LSTM architectures. and OSM models. The results are shown in Ta- n The URNN has 2 hidden layers while the BRNN ble 3. When comparing the in-domain -gram has one forward, one backward and one addi- JTR model trained using Kneser-Ney smoothing n tional hidden layer. All layers have 200 nodes, (KN) to OSM, we observe that the -gram KN . LEU while the output layer is class-factored using 2000 JTR model improves the baseline by 1 4 B test eval11 classes. For the count-based JTR model and OSM on both and . The OSM model we tuned the n-gram size on the tuning set of each performs similarly, with a slight disadvantage on eval11 task. For the full data, 7-grams were used for the . In comparison, the FFNN of Eq. (4) im- . . LEU IWSLT and WMT tasks, and 8-grams for BOLT. proves the baseline by 0 7–0 9 B , compared to . . LEU When using in-domain data, smaller n-gram sizes the slightly better 0 8–1 1 B achieved by the were used. All rescoring experiments used 1000- URNN. The difference between the FFNN and the best lists without duplicates. 4http://www.ted.com/

Page 16 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

data dev test eval11 part of the source input when scoring target words. This information is not used by the KN model. baseline full 33.3 30.8 35.7 Moreover, the BRNN is able to score word com- +OSM TED 34.5 32.2 36.8 binations unseen in training, while the KN model +FFNN TED 34.0 31.7 36.4 uses backing off to score unseen events. +URNN TED 34.2 31.9 36.5 When training the KN, FFNN, and OSM mod- +BRNN TED 34.4 32.1 36.8 els on the full data, we observe less gains in com- +KN TED 34.6 32.2 37.1 parison to in-domain data training. However, com- +BRNN TED 35.0 32.8 37.7 bining the KN models trained on in-domain and +OSM full 34.1 31.6 36.5 full data gives additional gains, which suggests +FFNN full 33.9 31.5 36.0 that although the in-domain model is more adapted +KN full 34.2 31.6 36.6 to the task, it still can gain from out-of-domain data. Adding the FFNN on top improves the com- +KN TED 34.9 32.4 37.1 bination. Note here that the FFNN sees the same +FFNN TED 35.2 32.7 37.2 information as the KN model, but the difference is +FFNN full 35.1 32.7 37.2 that the NN operates on the word level rather than +BRNN TED 35.5 33.0 37.4 the word-pair level. Second, the FFNN is able to +BRNN TED 35.4 33.0 37.3 handle unseen sequences by design, without the need for the backing off workaround. The BRNN Table 3: Results measured in BLEU for the IWSLT improves the combination more than the FFNN, German→English task. as the model captures an unbounded source and target history in addition to an unbounded future source context. Combining the KN, FFNN and train data test1 test2 BRNN JTR models leads to an overall gain of 2.2 baseline 18.1 17.0 BLEU on both dev and test. +OSM indomain 18.8 17.2 Next, we present the BOLT Chinese→English +FFNN indomain 18.6 17.6 results, shown in Table 4. Comparing n-gram +BRNN indomain 18.6 17.6 KN JTR and OSM trained on the in-domain data +KN indomain 18.8 17.5 shows they perform equally well on test1, im- proving the baseline by 0.7 BLEU, with a slight ad- +OSM full 18.5 17.2 vantage for the JTR model on test2. The feed- +FFNN full 18.4 17.4 forward and the recurrent in-domain networks +KN full 18.8 17.3 yield the same results in comparison to each other. +KN indomain 19.0 17.7 Training the OSM and JTR models on the full data +FFNN full 19.2 18.3 yields slightly worse results than in-domain train- +RNN indomain 19.3 18.4 ing. However, combining the two types of training improves the results. This is shown when adding Table 4: Results measured in BLEU for the BOLT the in-domain KN JTR model on top of the model Chinese→English task. trained on full data, improving it by up to 0.4 BLEU. Rescoring with the feed-forward and the recurrent network improves this even further, sup- URNN is that the latter captures the unbounded porting the previous observation that the n-gram source and target history that extends until the be- KN JTR and NNs complement each other. The ginning of the sentences, giving it an advantage combination of the 4 models yields an overall im- over the FFNN. The performance of the URNN provement of 1.2–1.4 BLEU. can be improved by including the future part of the Finally, we compare KN JTR and OSM models source sentence, as described in Eq. (5), resulting on the WMT German→English task in Table 5. in the BRNN model. Next, we explore whether the The two models perform almost similar to each models are additive. When rescoring the n-gram other. The JTR model improves the baseline by KN JTR output with the BRNN, an additional im- up to 0.7 BLEU. Rescoring the KN JTR with the provement of 0.6 BLEU is obtained. There are two FFNN improves it by up to 0.3 BLEU leading to an reasons for this: The BRNN includes the future overall improvement between 0.5 and 1.0 BLEU.

Page 17 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

newstest 6 Conclusion 2013 2014 2015 We introduced a method that converts bilingual baseline 28.1 28.6 29.4 sentence pairs and their word alignments into joint +OSM 28.6 28.9 30.0 translation and reordering (JTR) sequences. They +FFNN 28.7 28.9 29.7 combine interdepending lexical and alignment de- +KN 28.8 28.9 29.9 pendencies into a single framework. A main ad- +FFNN 29.1 29.1 30.0 vantage of JTR sequences is that a variety of mod- els can be trained on them. Here, we have esti- mated n-gram models with modified Kneser-Ney Table 5: Results measured in BLEU for the WMT smoothing, FFNN and RNN architectures on JTR German→English task. sequences. We compared our count-based JTR model to the OSM, both used in phrase-based decoding, and 5.3 Analysis showed that the JTR model performed at least as good as OSM, with a slight advantage for JTR. In To investigate the effect of including jump infor- comparison to the OSM, the JTR model operates mation in the JTR sequence, we trained a BRNN on words, leading to a smaller vocabulary size. using jump classes and another excluding them. Moreover, it utilizes simpler reordering structures The BRNNs were used in rescoring. Below, we without gaps and only requires one log-linear fea- demonstrate the difference between the systems: ture to be tuned, whereas the OSM needs 5. Due to the flexibility of JTR sequences, we can ap- source: wir kommen spater¨ noch auf diese Leute zuruck¨ . ply them also to FFNNs and RNNs. Utilizing reference: We’ll come back to these people later . two count models and applying both networks in Hypothesis 1: rescoring gains the overall highest improvement JTR source: wir kommen δ zuruck¨ δ spater¨ noch auf over the phrase-based system by up to 2.2 BLEU, diese Leute δ . on the German→English IWSLT task. The com- bination outperforms OSM by up to 1.2 BLEU on JTR target: we come y back x later σ to these people the BOLT Chinese→English tasks. y . Hypothesis 2: The JTR models are not dependent on the JTR source: wir kommen spater¨ noch auf diese Leute phrase-based framework, and one of the long- zuruck¨ . term goals is to perform standalone decoding with JTR target: we come later σ on these guys back . the JTR models independently of phrase-based systems. Without the limitations introduced by phrases, we believe that JTR models could per- Note the German verb “zuruckkommen”,¨ which form even better. In addition, we aim to use JTR is split into “kommen” and “zuruck”.¨ German models to obtain the alignment, which would then places “kommen” at the second position and be used to train the JTR models in an iterative “zuruck”¨ towards the end of the sentence. Unlike manner, achieving consistency and hoping for im- German, the corresponding English phrase “come proved models. back” has the words adjacent to each other. We found that the system including jumps prefers the Acknowledgements correct translation of the verb, as shown in Hy- pothesis 1 above. The system translates “kom- This work has received funding from the Euro- men” to “come”, jumps forward to “zuruck”,¨ pean Union’s Horizon 2020 research and innova- translates it to “back”, then jumps back to continue tion programme under grant agreement no 645452 translating the word “spater”.¨ In contrast, the sys- (QT21). This material is partially based upon tem that excludes jump classes is blind to this sep- work supported by the DARPA BOLT project un- aration of words. It favors Hypothesis 2 which is der Contract No. HR0011- 12-C-0015. Any opin- a strictly monotone translation of the German sen- ions, findings and conclusions or recommenda- tence. This is also reflected by the BLEU scores, tions expressed in this material are those of the where we found the system including jump classes authors and do not necessarily reflect the views of outperforming the one without by up to 0.8 BLEU. DARPA.

Page 18 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

References Yonggang Deng and William Byrne. 2005. Hmm word and phrase alignment for statistical machine transla- Tamer Alkhouli, Felix Rietig, and Hermann Ney. 2015. tion. In Proceedings of Human Language Technol- Investigations on phrase-based decoding with recur- ogy Conference and Conference on Empirical Meth- rent neural network language and translation mod- ods in Natural Language Processing, pages 169– els. In Proceedings of the EMNLP 2015 Tenth 176, Vancouver, British Columbia, Canada, October. Workshop on Statistical Machine Translation, Lis- bon, Portugal, September. to appear. Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas Michael Auli, Michel Galley, Chris Quirk, and Geof- Lamar, Richard Schwartz, and John Makhoul. 2014. frey Zweig. 2013. Joint Language and Translation Fast and Robust Neural Network Joint Models for Modeling with Recurrent Neural Networks. In Con- Statistical Machine Translation. In 52nd Annual ference on Empirical Methods in Natural Language Meeting of the Association for Computational Lin- Processing, pages 1044–1054, Seattle, USA, Octo- guistics, pages 1370–1380, Baltimore, MD, USA, ber. June. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- Nadir Durrani, Helmut Schmid, and Alexander Fraser. gio. 2015. Neural machine translation by jointly 2011. A joint sequence translation model with in- learning to align and translate. In International Con- tegrated reordering. In Proceedings of the 49th An- ference on Learning Representations, San Diego, nual Meeting of the Association for Computational Calefornia, USA, May. Linguistics: Human Language Technologies, pages 1045–1054, Portland, Oregon, USA, June. Peter F. Brown, John Cocke, Stephan A. Della Pietra, Vincent J. Della Pietra, Fredrick Jelinek, John D. Nadir Durrani, Alexander Fraser, and Helmut Schmid. Lafferty, Robert L. Mercer, and Paul S. Rossin. 2013a. Model with minimal translation units, but 1990. A Statistical Approach to Machine Transla- decode with phrases. In Proceedings of the 2013 tion. Computational Linguistics, 16(2):79–85, June. Conference of the North American Chapter of the Association for Computational Linguistics: Human Peter F. Brown, Stephan A. Della Pietra, Vincent J. Language Technologies, pages 1–11, Atlanta, Geor- Della Pietra, and Robert L. Mercer. 1993. The gia, June. Mathematics of Statistical Machine Translation: Pa- rameter Estimation. Computational Linguistics, Nadir Durrani, Alexander Fraser, Helmut Schmid, 19(2):263–311, June. Hieu Hoang, and Philipp Koehn. 2013b. Can markov models over minimal translation units help Mauro Cettolo, Jan Niehues, Sebastian Stuker,¨ Luisa phrase-based smt? In Proceedings of the 51st An- Bentivogli, and Marcello Federico. 2014. Report on nual Meeting of the Association for Computational the 11th iwslt evaluation campaign, iwslt 2014. In Linguistics (Volume 2: Short Papers), pages 399– International Workshop on Spoken Language Trans- 405, Sofia, Bulgaria, August. lation, pages 2–11, Lake Tahoe, CA, USA, Decem- ber. Nadir Durrani, Philipp Koehn, Helmut Schmid, and Alexander Fraser. 2014. Investigating the useful- Stanley F. Chen and Joshuo Goodman. 1998. An ness of generalized word representations in smt. In Empirical Study of Smoothing Techniques for Lan- COLING, Dublin, Ireland, August. guage Modeling. Technical Report TR-10-98, Com- puter Science Group, Harvard University, Cam- Yang Feng and Trevor Cohn. 2013. A markov bridge, MA, August. model of machine translation using non-parametric Boxing Chen, Roland Kuhn, George Foster, and bayesian inference. In 51st Annual Meeting of the Howard Johnson. 2011. Unpacking and transform- Association for Computational Linguistics, pages ing feature functions: New ways to smooth phrase 333–342, Sofia, Bulgaria, August. tables. In MT Summit XIII, pages 269–275, Xiamen, China, September. Minwei Feng, Jan-Thorsten Peter, and Hermann Ney. 2013. Advancements in reordering models for sta- Jonathan H. Clark, Chris Dyer, Alon Lavie, and tistical machine translation. In Annual Meeting Noah A. Smith. 2011. Better hypothesis test- of the Assoc. for Computational Linguistics, pages ing for statistical machine translation: Controlling 322–332, Sofia, Bulgaria, August. for optimizer instability. In 49th Annual Meet- ing of the Association for Computational Linguis- Yang Feng, Trevor Cohn, and Xinkai Du. 2014. Fac- tics:shortpapers, pages 176–181, Portland, Oregon, tored markov translation with robust modeling. In June. Proceedings of the Eighteenth Conference on Com- putational Natural Language Learning, pages 151– Josep Maria Crego and Franc¸ois Yvon. 2010. Improv- 159, Ann Arbor, Michigan, June. ing reordering with linguistically informed bilingual n-grams. In Proceedings of the 23rd International Michel Galley and Christopher D. Manning. 2008. Conference on Computational Linguistics (Coling A simple and effective hierarchical phrase reorder- 2010: Posters), pages 197–205, Beijing, China. ing model. In Proceedings of the Conference on

Page 19 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

Empirical Methods in Natural Language Process- Franz J. Och, Christoph Tillmann, and Hermann Ney. ing, EMNLP ’08, pages 848–856, Stroudsburg, PA, 1999. Improved Alignment Models for Statistical USA. Association for Computational Linguistics. Machine Translation. In Proc. Joint SIGDAT Conf. on Empirical Methods in Natural Language Pro- Andreas Guta, Joern Wuebker, Miguel Grac¸a, Yunsu cessing and Very Large Corpora, pages 20–28, Uni- Kim, and Hermann Ney. 2015. Extended translation versity of Maryland, College Park, MD, June. models in phrase-based decoding. In Proceedings of the EMNLP 2015 Tenth Workshop on Statistical Franz Josef Och. 2003. Minimum Error Rate Training Machine Translation, Lisbon, Portugal, September. in Statistical Machine Translation. In Proc. of the to appear. 41th Annual Meeting of the Association for Compu- tational Linguistics (ACL), pages 160–167, Sapporo, Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H. Japan, July. Clark, and Philipp Koehn. 2013. Scalable modi- fied Kneser-Ney language model estimation. In Pro- Kishore Papineni, Salim Roukos, Todd Ward, and Wei- ceedings of the 51st Annual Meeting of the Associa- Jing Zhu. 2001. Bleu: a Method for Automatic tion for Computational Linguistics, pages 690–696, Evaluation of Machine Translation. IBM Research Sofia, Bulgaria, August. Report RC22176 (W0109-022), IBM Research Di- vision, Thomas J. Watson Research Center, P.O. Box Yuening Hu, Michael Auli, Qin Gao, and Jianfeng Gao. 218, Yorktown Heights, NY 10598, September. 2014. Minimum translation modeling with recur- rent neural networks. In Proceedings of the 14th Darelene Stewart, Roland Kuhn, Eric Joanis, and Conference of the European Chapter of the Associ- George Foster. 2014. Coarse split and lump bilin- ation for Computational Linguistics, pages 20–29, gual languagemodels for richer source information Gothenburg, Sweden, April. in smt. In AMTA, Vancouver, BC, Canada, October. P. Koehn, F. J. Och, and D. Marcu. 2003. Statisti- Martin Sundermeyer, Tamer Alkhouli, Wuebker Wue- cal Phrase-Based Translation. In Proceedings of the bker, and Hermann Ney. 2014. Translation Model- 2003 Meeting of the North American chapter of the ing with Bidirectional Recurrent Neural Networks. Association for Computational Linguistics (NAACL- In Conference on Empirical Methods on Natural 03), pages 127–133, Edmonton, Alberta. Language Processing, pages 14–25, Doha, Qatar, October. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Ilya Sutskever, Oriol Vinyals, and Quoc V. V Le. Brooke Cowan, Wade Shen, Christine Moran, 2014. Sequence to sequence learning with neural Richard Zens, Chris Dyer, Ondrejˇ Bojar, Alexandra networks. In Advances in Neural Information Pro- Constantine, and Evan Herbst. 2007. Moses: Open cessing Systems 27, pages 3104–3112. Source Toolkit for Statistical Machine Translation. pages 177–180, Prague, Czech Republic, June. Christoph Tillmann. 2004. A unigram orientation model for statistical machine translation. In Pro- Hai Son Le, Alexandre Allauzen, and Franc¸ois Yvon. ceedings of HLT-NAACL 2004: Short Papers, HLT- 2012. Continuous Space Translation Models with NAACL-Short ’04, pages 101–104, Stroudsburg, Neural Networks. In Conference of the North Amer- PA, USA. ican Chapter of the Association for Computational Linguistics: Human Language Technologies, pages Joern Wuebker, Stephan Peitz, Felix Rietig, and Her- 39–48, Montreal, Canada, June. mann Ney. 2013. Improving statistical machine translation with word class models. In Conference Jose´ B Marino,˜ Rafael E Banchs, Josep M Crego, Adria` on Empirical Methods in Natural Language Pro- de Gispert, Patrik Lambert, Jose´ A R Fonollosa, and cessing, pages 1377–1381, Seattle, USA, October. Marta R Costa-jussa.` 2006. N-gram-based Machine Translation. Comput. Linguist., 32(4):527–549, De- Richard Zens, Franz Josef Och, and Hermann Ney. cember. 2002. Phrase-Based Statistical Machine Transla- tion. In 25th German Conf. on Artificial Intelligence R.C. Moore and W. Lewis. 2010. Intelligent Selection (KI2002), pages 18–32, Aachen, Germany, Septem- of Language Model Training Data. In ACL (Short ber. Papers), pages 220–224, Uppsala, Sweden, July. Hui Zhang, Kristina Toutanova, Chris Quirk, and Jian- Jan Niehues, Teresa Herrmann, Stephan Vogel, and feng Gao. 2013. Beyond left-to-right: Multiple de- Alex Waibel, 2011. Proceedings of the Sixth Work- composition structures for smt. In Proceedings of shop on Statistical Machine Translation, chapter the 2013 Conference of the North American Chap- Wider Context by Using Bilingual Language Mod- ter of the Association for Computational Linguis- els in Machine Translation, pages 198–206. tics: Human Language Technologies, pages 12–21, Atlanta, Georgia, June. Franz J. Och and Hermann Ney. 2003. A System- atic Comparison of Various Statistical Alignment Models. Computational Linguistics, 29(1):19–51, March.

Page 20 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

B Alignment-Based Neural Machine Translation

Alignment-Based Neural Machine Translation

Tamer Alkhouli, Gabriel Bretschner, Jan-Thorsten Peter, Mohammed Hethnawi, Andreas Guta and Hermann Ney Human Language Technology and Pattern Recognition Group RWTH Aachen University, Aachen, Germany {surname}@cs.rwth-aachen.de

Abstract els into phrase-based decoding, where the mod- els are used to score phrasal candidates hypothe- Neural machine translation (NMT) has sized by the decoder (Vaswani et al., 2013; Devlin emerged recently as a promising statis- et al., 2014; Alkhouli et al., 2015). The second tical machine translation approach. In approach is referred to as neural machine trans- NMT, neural networks (NN) are directly lation, where neural models are used to hypoth- used to produce translations, without re- esize translations, word by word, without relying lying on a pre-existing translation frame- on a pre-existing framework. In comparison to the work. In this work, we take a step to- former approach, NMT does not restrict NNs to wards bridging the gap between conven- predetermined translation candidates, and it does tional word alignment models and NMT. not depend on word alignment concepts that have We follow the hidden Markov model been part of building state-of-the-art phrase-based (HMM) approach that separates the align- systems. In such systems, the HMM and the IBM ment and lexical models. We propose models developed more than two decades ago are a neural alignment model and combine used to produce Viterbi word alignments, which it with a lexical neural model in a log- are used to build standard phrase-based systems. linear framework. The models are used Existing NMT systems either disregard the no- in a standalone word-based decoder that tion of word alignments entirely (Sutskever et al., explicitly hypothesizes alignments during 2014), or rely on a probabilistic notion of align- search. We demonstrate that our system ments (Bahdanau et al., 2015) independent of the outperforms attention-based NMT on two conventional alignment models. tasks: IWSLT 2013 German→English and Most recently, Cohn et al. (2016) designed neu- BOLT Chinese→English. We also show ral models that incorporate concepts like fertility promising results for re-aligning the train- and Markov conditioning into their structure. In ing data using neural models. this work, we also focus on the question whether 1 Introduction conventional word alignment concepts can be used for NMT. In particular, (1) We follow the HMM Neural networks have been gaining a lot of at- approach to separate the alignment and translation tention recently in areas like speech recognition, models, and use neural networks to model align- image recognition and natural language process- ments and translation. (2) We introduce a lexical- ing. In machine translation, NNs are applied in ized alignment model to capture source reorder- two main ways: In N-best rescoring, the neural ing information. (3) We bootstrap the NN training model is used to score the first-pass decoding out- using Viterbi word alignments obtained from the put, limiting the model to a fixed set of hypotheses HMM and IBM model training, and use the trained (Le et al., 2012; Sundermeyer et al., 2014a; Hu et neural models to generate new alignments. The al., 2014; Guta et al., 2015). The second approach new alignments are then used to re-train the neural integrates the NN into decoding, potentially allow- networks. (4) We design an alignment-based de- ing it to directly determine the search space. coder that hypothesizes the alignment path along There are two approaches to use neural mod- with the associated translation. We show com- els in decoding. The first integrates the mod- petitive results in comparison to attention-based

Page 21 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

models on the IWSLT 2013 German→English and tasks like IWSLT English→German 2015 (Luong BOLT Chinese→English task. and Manning, 2015). In this work, we follow the same standalone 1.1 Motivation neural translation approach. However, we have Attention-based NMT computes the translation a different treatment of alignments. While the probability depending on an intermediate compu- attention-based soft-alignment model computes tation of an alignment distribution. The alignment an alignment distribution as an intermediate step distribution is used to choose the positions in the within the neural model, we follow the hard align- source sentence that the decoder attends to dur- ment concept used in phrase extraction. We sepa- ing translation. Therefore, the alignment model rate the alignment model from the lexical model, can be considered as an implicit part of the trans- and train them independently. At translation time, lation model. On the other hand, separating the the decoder hypothesizes and scores the alignment alignment model from the lexical model has its path in addition to the translation. own advantages: First, this leads to more flexi- Cohn et al. (2016) introduce several modifi- bility in modeling and training: not only can the cations to the attention-based model inspired by models be trained separately, but they can also traditional word alignment concepts. They mod- have different model types, e.g. neural models, ify the network architecture, adding a first-order count-based models, etc. Second, the separation dependence by making the attention vector com- avoids propagating errors from one model to the puted for a target position directly dependent on other. In attention-based systems, the translation that of the previous position. Our alignment model score is based on the alignment distribution, which has a first-order dependence that takes place at the risks propagating errors from the alignment part to input and output of the model, rather than an ar- the translation part. Third, using separate models chitectural modification of the neural network. makes it possible to assign them different weights. Yang et al. (2013) use NN-based lexical and We exploit this and use a log-linear framework alignment models, but they give up the probabilis- to combine them. We still retain the possibility tic interpretation and produce unnormalized scores of joint training, which can be performed flexibly instead. Furthermore, they model alignments us- by alternating between model training and align- ing a simple distortion model that has no depen- ment generation. The latter can be performed us- dence on lexical context. The models are used to ing forced-decoding. produce new alignments which are in turn used In contrast to the count-based models used in to train phrase systems. This leads to no sig- HMMs, we use neural models, which allow cov- nificant difference in terms of translation perfor- ering long context without having to explicitly ad- mance. Tamura et al. (2014) propose a lexicalized dress the smoothing problem that arises in count- RNN alignment model. The model still produces based models. non-probabilistic scores, and is used to generate word alignments used to train phrase-based sys- 2 Related Work tems. In this work, we develop a feed-forward Most recently, NNs have been trained on large neural alignment model that computes probabilis- amounts of data, and applied to translate indepen- tic scores, and use it directly in standalone de- dent of the phrase-based framework. Sutskever et coding, without constraining it to the phrase-based al. (2014) introduced the pure encoder-decoder ap- framework. In addition, we use the neural models proach, which avoids the concept of word align- to produce alignments that are used to re-train the ments. Bahdanau et al. (2015) introduced an atten- same neural models. tion mechanism to the encoder-decoder approach, Schwenk (2012) proposed a feed-forward net- allowing the decoder to attend to certain source work that computes phrase scores offline, and the words. This method was refined in (Luong et al., scores were added to the phrase table of a phrase- 2015) to allow for local attention, which makes the based system. Offline phrase scoring was also decoder attend to representations of source words done in (Alkhouli et al., 2014) using semantic residing within a window. These translation mod- phrase features obtained using simple neural net- els have shown competitive results, outperforming works. In comparison, our work does not rely on phrase-based systems when using ensembles on the phrase-based system, rather, the neural net-

Page 22 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

works are used to hypothesize translation candi- 4 Neural Network Models dates directly, and the scores are computed online There are two common network architectures used during decoding. in machine translation: feed-forward NNs (FFNN) We use the feed-forward joint model introduced and recurrent NNs (RNN). In this section we will in (Devlin et al., 2014) as a lexical model, and in- discuss alignment-based feed-forward and recur- troduce a lexicalized alignment model based on rent neural networks. These networks are condi- it. In addition, we modify the bidirectional joint tioned on the word alignment, in addition to the model presented in (Sundermeyer et al., 2014a) source and target words. and compare it to the feed-forward variant. These lexical models were applied in phrase-based sys- 4.1 Feed-forward Joint Model tems. In this work, we apply them in a standalone We adopt the feed-forward joint model (FFJM) NMT framework. proposed in (Devlin et al., 2014) as the lexical Forced alignment was applied to train phrase ta- model. The authors demonstrate the model has bles in (Wuebker et al., 2010; Peitz et al., 2012). a strong performance when applied in a phrase- We generate forced alignments using a neural de- based framework. In this work we explore its coder, and use them to re-train neural models. performance in standalone NMT. The model was Tackling the costly normalization of the out- introduced along with heuristics to resolve un- put layer during decoding has been the focus of aligned and multiply aligned words. We denote several papers (Vaswani et al., 2013; Devlin et the heuristic-based source alignment point corre- al., 2014; Jean et al., 2015). We propose a sim- sponding to the target position i by ˆb . The model ple method to speed up decoding using a class- i is defined as factored output layer with almost no loss in trans-

lation quality. i i−1 J i−1 ˆbi+m p(ei|b1, e1 , f1 ) = p(ei|ei−n, f ) (1) ˆbi−m 3 Statistical Machine Translation and it computes the probability of a target word In statistical machine translation, the target word e at position i given the n-gram target history I i sequence e = e1, ..., eI of length I is assigned i−1 1 e = ei−n, ..., ei−1, and a window of 2m + 1 a probability conditioned on the source word se- i−n ˆbi+m J source words fˆ = fˆb −m, ..., fˆb centered quence f1 = f1, ..., fJ of length J. By introduc- bi−m i i+m around the word f . ing word alignments as hidden variables, the pos- ˆbi I J As the heuristics have implications on our terior probability p(e1|f1 ) can be computed using a lexical and an alignment model as follows. alignment-based decoder, we explain them by the examples shown in Figure 1. We mark the source p(eI |f J ) 1 1 and target context by rectangles on the x- and y- X I I J = p(e1, b1|f1 ) axis, respectively. The left figure shows a sin- I gle source word ‘Jungen’ aligned to a single tar- b1 I get word ‘offspring’, in which case, the original X Y i−1 i−1 J source position is used, i.e., ˆb = b . If the tar- = p(ei, bi|b1 , e1 , f1 ) i i I i=1 get word is aligned to multiple source words, as b1 I it is the case with the words ‘Mutter Tiere’ and X Y i i−1 J i−1 i−1 J ‘Mothers’ in the middle figure, then ˆbi is set to = p(ei|b1, e1 , f1 ) · p(bi|b1 , e1 , f1 ) the middle alignment point. In this example, the bI i=1 | {z } | {z } 1 lexical model alignment model left alignment point associated with ‘Mutter’ is se- I where b1 = b1, ..., bI denotes the alignment path, lected. The right figure shows the case of the un- such that bi aligns the target word ei to the source aligned target word ‘of’. ˆbi is set to the source

word fbi . In this general formulation, the lexi- position associated with the closest aligned tar- cal model predicts the target word ei conditioned get word ‘full’, preferring right to left. Note that on the source sentence, the target history, and the this model does not have special handling of un- alignment history. The alignment model is lexical- aligned source words. While these words can be ized using the source and target context as well. covered indirectly by source windows associated The sum over alignment paths is replaced by the with aligned source words, the model does not ex- maximum during decoding (cf. Section 5). plicitly score them.

Page 23 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

Figure 1: Examples on resolving word alignments to obtain word affiliations.

Computing normalized probabilities is done us- Note that we also use the same alignment heuris- ing the softmax function, which requires comput- tics presented in Section 4.1. As this variant does ing the full output layer first, and then comput- not require future alignment information, it can be ing the normalization factor by summing over the applied in decoding. However, in this work we output scores of the full vocabulary. This is very apply this model in rescoring and leave decoder costly for large vocabularies. To overcome this, integration to future work. we adopt the class-factored output layer consisting of a class layer and a word layer (Goodman, 2001; 4.3 Feed-forward Alignment Model Morin and Bengio, 2005). The model in this case We propose a neural alignment model to score is defined as alignment paths. Instead of predicting the abso-

i−1 ˆbi+m lute positions in the source sentence, we model p(ei|ei−n, fˆ ) = bi−m the jumps from one source position to the next po- ˆ ˆ i−1 bi+m i−1 bi+m sition to be translated. The jump at target posi- p(ei|c(ei), ei−n, f ) · p(c(ei)|ei−n, f ) ˆbi−m ˆbi−m tion i is defined as ∆i = ˆbi − ˆbi−1, which cap- where c denotes a word mapping that assigns each tures the jump from the source position ˆbi−1 to ˆbi. target word to a single class, where the number We modify the FFNN lexical model to obtain a of classes is chosen to be much smaller than the feed-forward alignment model. The feed-forward vocabulary size |C| << |V |. Even though the alignment model (FFAM) is given by full class layer needs to be computed, only a sub- ˆ set of the significantly-larger word layer has to be i−1 i−1 J i−1 bi−1+m p(bi|b1 , e1 , f1 ) = p(∆i|ei−n, fˆ ) (2) considered, namely the words that share the same bi−1−m class c(ei) with the target word ei. This helps This is a lexicalized alignment model condi- speeding up training on large-vocabulary tasks. tioned on the n-gram target history and the 4.2 Bidirectional Joint Model (2m + 1)-gram source window. Note that, dif- ferent from the FFJM, the source window of this The bidirectional RNN joint model (BJM) pre- model is centered around the source position ˆb . sented in (Sundermeyer et al., 2014a) is another i−1 This is because the model needs to predict the lexical model. The BJM uses the full source sen- jump to the next source position ˆb to be translated. tence and the full target history for prediction, and i The alignment model architecture is shown in Fig- it is computed by reordering the source sentence ure 2. following the target order. This requires the com- In contrast to the lexical model, the output vo- plete alignment information to compute the model cabulary of the alignment model is much smaller, scores. Here, we introduce a variant of the model and therefore we use a regular softmax output that is conditioned on the alignment history in- layer for this model without class-factorization. stead of the full alignment path. This is achieved by computing forward and backward representa- 4.4 Feed-forward vs. Recurrent Models tions of the source sentence in its original order, RNNs have been shown to outperform feed- as done in (Bahdanau et al., 2015). The model is forward variants in language and translation mod- given by eling. Nevertheless, feed-forward networks have i i−1 J ˆi i−1 J p(ei|b1, e1 , f1 ) = p(ei|b1, e1 , f1 ) their own advantages: First, they are typically

Page 24 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

ˆ i−1 bi−1+2 Algorithm 1 Alignment-based Decoder p(∆i|e , f ) i−3 ˆb −2 J i−1 1: procedure TRANSLATE(f1 , beamSize) 2: hyps← initHyp. previous set of partial hypotheses 3: newHyps← ∅ .current set of partial hypotheses 4: while GETBEST(hyps) not terminated do 5: .compute alignment distribution in batch mode 6: alignDists ←ALIGNMENTDISTRIBUTION(hyps) 7: .hypothesize source alignment points 8: for pos From 1 to J do 9: .compute lexical distributions of all 10: .hypotheses in hyps in batch mode 11: dists ← LEXICALDISTRIBUTION(hyps, pos) ˆ i−1 bi−1+2 e 12: .expand each of the previous hypotheses fˆ i−3 bi−1−2 13: for hyp in hyps do 14: jmpCost ← SCORE(alignDists, hyp, pos) Figure 2: A feed-forward alignment NN, with 3 15: dist ← GETDISTRIBUTION(dists, hyp) 16: dist ← PARTIALSORT(dist,beamSize) target history words, 5-gram source window, a 17: cnt← 0 projection layer, 2 hidden layers, and a small out- 18: .hypothesize new target word put layer to predict jumps. 19: for word in dist do 20: if cnt > beamSize then 21: break faster to train due to their simple architecture, 22: newHyp ←EXTEND(hyp,word,pos,jmpCost) and second, they are more flexible to integrate 23: newHyps.INSERT(newHyp) into beam search decoders. This is because feed- 24: cnt ← cnt + 1 25:P RUNE(newHyps, beamSize) forward networks only depend on a limited con- 26: hyps ← newHyps text. RNNs, on the other hand, are conditioned on 27: an unbounded context. This means that the com- 28: .return the best scoring hypothesis plete hypotheses during decoding have to be main- 29: return GETBEST(hyps) tained without any state recombination. Since feed-forward networks allow the use of state re- duced to enumerating all J source positions a tar- combination, they are potentially capable of ex- get word can be aligned to. The following is a list ploring more candidates during beam search. of the possible alignment scenarios and how the decoder covers them. 5 Alignment-based Decoder • Multiply-aligned target words: the heuris- In this section we present the alignment-based de- tic chooses the middle link as an alignment coder. This is a beam-search word-based decoder point. Therefore, the decoder is able to cover that predicts one target word at a time. As the these cases by hypothesizing J many source models we use are alignment-based, the decoder positions for each target word hypothesis. hypothesizes the alignment path. This is different from the NMT approaches present in the literature, • Unaligned target words: the heuristic aligns which are based on models that either ignore word these words using the nearest aligned target alignments or compute alignments as part of the word in training (cf. Figure 1, right). In de- attention-based model. coding, these words are handled as aligned In the general case, a word can be aligned to words. a single word, multiple words, or it can be un- aligned. However, we do not use the general word • Multiply-aligned source words: covered by alignment notion, rather, the models are based on revisiting a source position that has already alignments derived using the heuristics discussed been translated. in Section 4. These heuristics simplify the task • Unaligned source words: result if no target of the decoder, as they induce equivalence classes word is generated using a source window over the alignment paths, reducing the number of centered around the source word in question. possible alignments the decoder has to hypothe- size significantly. As a result of using these heuris- The decoder is shown in Algorithm 1. It in- tics, the task of hypothesizing alignments is re- volves hypothesizing alignments and translation

Page 25 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

words. Alignments are hypothesized in the loop Model Combination starting at line 8. Once an alignment point is set to We embed the models in a log-linear framework, position pos, the lexical distribution over the full which is commonly used in phrase-based systems. target vocabulary is computed using this position The goal of the decoder is to find the best scoring in line 11. The distribution is sorted and the best hypothesis as follows. candidate translations lying within the beam are ( ) used to expand the partial hypotheses. M Iˆ X J I ˆI eˆ1 = arg max max λmhm(f1 , e1, b1) We batch the NN computations, calling the I ˆI I,e b1 alignment and lexical networks for all partial hy- 1 m=1 potheses in a single call to speed up computations where λm is the model weight associated with the as shown in lines 6 and 11. We also exploit the model hm, and M is the total number of models. beam and apply partial sorting in line 16, instead The model weights are automatically tuned using of completely sorting the list. Partial sorting has a minimum error rate training (MERT) (Och, 2003). linear complexity on average, and it returns a list Our main system includes a lexical neural model, whose first beamSize words have better scores an alignment neural model, and a word penalty, compared to the rest of the list. which is the count of target words. The word We terminate translation if the best scoring par- penalty becomes important at the end of transla- tial hypothesis ends with the sentence end symbol. tion, where hypotheses in the beam might have If a hypothesis terminates but it scores worse than different final lengths. other hypotheses, it is removed from the beam, but 6 Forced-Alignment Training it still competes with non-terminated hypotheses. Note that we do not have any explicit coverage Since the models we use require alignments for constraints. This means that a source position can training, we initially use word alignments pro- be revisited many times, hence generating one-to- duced using HMM/IBM models using GIZA++ as many alignment cases. This also allows having un- initial alignments. At first, the FFJM and the aligned source words. FFAM are trained separately until convergence, In the alignment-based decoder, an alignment then the models are used to generate new word distribution is computed, and word alignments are alignments by force-decoding the training data as hypothesized and scored using this distribution, follows. leading alignment decisions to become part of I b +m beam search. The search space is composed of ˜I J I Y λ1 i−1 i−1 b1(f1 , e1) = arg max p (∆i|ei−n, fb −m ) bI i−1 both alignment and translation decisions. In con- 1 i=1 trast, the search space in attention-based decoding λ2 i−1 bi+m · p (ei|ei−n, fb −m ) is composed of translation decisions only. i where λ1 and λ2 are the model weights. We mod- Class-Factored Output Layer in Decoding ify the decoder to only compute the probabilities The large output layer used in language and trans- of the target words in the reference sentence. The lation modeling is a major bottleneck in evaluating for loop in line 19 of Algorithm 1 collapses to a the network. Several papers discuss how to evalu- single iteration. We use both the the feed-forward ate it efficiently during decoding using approxima- joint model (FFJM) and the feed-forward align- tions. In this work, we exploit the class-factored ment model (FFAM) to perform force-decoding, output layer to speed up training. At decoding and the new alignments are used to retrain the time, the network needs to hypothesize all target models, replacing the initial GIZA++ alignments. words, which means the full output layer should Retraining the neural models using the forced- be evaluated. In the case of using a class-factored alignments has two benefits. First, since the align- output layer, this results in an additional compu- ments are produced using both of the lexical and tational overhead from computing the class layer. alignment models, this can be viewed as joint In order to speed up decoding, we propose to use training of the two models. Second, since the neu- the class layer to choose the top scoring k classes, ral decoder generates these alignments, training then we evaluate the word layer for each of these neural models based on them yields models that classes only. We show this leads to a significant are more consistent with the neural decoder. We speed up with minimal loss in translation quality. verify this claim in the experiments section.

Page 26 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

IWSLT BOLT training criterion with the bigram assumption. The De En Zh En alignment model uses a small output layer of 201 Sentences 4.32M 4.08M nodes, determined by a maximum jump length of Run. Words 108M 109M 78M 86M 100 (forward and backward). 300 nodes are used Vocab. 836K 792K 384K 817K for word embeddings. Each of the FFNN models FFNN/BJM Vocab. 173K 149K 169K 128K is trained on CPUs using 12 threads, which takes Attention Vocab. 30K 30K 30K 30K up to 3 days until convergence. We train with FFJM params 177M 159M stochastic gradient descent using a batch size of BJM params 170M 153M FFAM params 101M 94M 128. The learning rate is halved when the devel- Attention params 84M 84M opment perplexity increases. Each BJM has 4 LSTM layers: two for the for- Table 1: Corpora and NN statistics. ward and backward states, one for the target state, 7 Experiments and one after merging the source and target states. The size of the word embeddings and hidden lay- We carry out experiments on two tasks: the ers is 350 nodes. The output layers are identical to IWSLT 2013 German→English shared transla- those of the FFJM models. 1 tion task, and the BOLT Chinese→English task. We compare our system to an attention-based The corpora statistics are shown in Table 1. The baseline similar to the networks described in (Bah- IWSLT phrase-based baseline system is trained on danau et al., 2015). All such systems use single all available bilingual data, and uses a 4-gram LM models, rather than ensembles. The word embed- with modified Kneser-Ney smoothing (Kneser and ding dimension is 620, each direction of the en- Ney, 1995; Chen and Goodman, 1998), trained coder and the decoder has a layer of 1000 gated with the SRILM toolkit (Stolcke, 2002). As ad- recurrent units (Cho et al., 2014). Unknowns and ditional data sources for the LM, we selected parts numbers are carried out from the source side to the of the Shuffled News and LDC English Giga- target side based on the largest attention weight. word corpora based on the cross-entropy differ- To speed up decoding of long sentences, the ence (Moore and Lewis, 2010), resulting in a to- decoder hypothesizes 21 and 41 source positions tal of 1.7 billion running words for LM training. around the diagonal, for the IWSLT and the BOLT The phrase-based baseline is a standard phrase- tasks, respectively. We choose these numbers based SMT system (Koehn et al., 2003) tuned with such that the translation quality does not degrade. MERT (Och, 2003) and contains a hierarchical re- The beam size is set to 16 in all experiments. ordering model (Galley and Manning, 2008). The Larger beam sizes did not lead to improvements. in-domain data consists of 137K sentences. We apply part-of-speech-based long-range verb The BOLT Chinese→English task is evaluated reordering rules to the German side in a pre- on the “discussion forum” domain. We use a 5- processing step for all German→English systems gram LM trained on 2.9 billion running words in (Popovic´ and Ney, 2006), including the baselines. total. The in-domain data consists of a subset of The Chinese→English systems use no such pre- 67.8 K sentences. We used a set of 1845 sentences ordering. We use the GIZA++ word alignments test1 as a tune set. The evaluation set contains to train the models. The networks are fine-tuned test2 1844 and contains 1124 sentences. by training additional epochs on the in-domain We use the FFNN architecture for the lexical data only (Luong and Manning, 2015). The LMs and alignment models. Both models use a win- are only used in the phrase-based systems in both dow of 9 source words, and 5 target history words. tasks, but not in the NMT systems. Both models use two hidden layers, the first has All translation experiments are performed with 1000 units and the second has 500 units. The lex- the Jane toolkit (Vilar et al., 2010; Wuebker et al., ical model uses a class-factored output layer, with 2012). The alignment-based NNs are trained using 1000 singleton classes dedicated to the most fre- an extension of the rwthlm toolkit (Sundermeyer et quent words, and 1000 classes shared among the al., 2014b). We use an implementation based on rest of the words. The classes are trained using a Blocks (van Merrienboer¨ et al., 2015) and Theano separate tool to optimize the maximum likelihood (Bergstra et al., 2010; Bastien et al., 2012) for the 1http://www.iwslt2013.org attention-based experiments. All results are mea-

Page 27 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

test 2010 eval 2011 rable performance is achieved on test 2010. # system BLEU TER BLEU TER In order to highlight the difference between us- 1 phrase-based system 28.9 51.0 32.9 46.3 2 + monolingual data 30.4 49.5 35.4 44.2 ing the FFJM and the BJM, we replace the FFJM 3 attention-based RNN 27.9 51.4 31.8 46.5 scores after obtaining the N-best lists with the 4 +fine-tuning 29.8 48.9 32.9 45.1 BJM scores and apply rescoring (row #9). In com- 5 FFJM+dp+wp 21.6 56.9 24.7 53.8 6 FFJM+FFAM+wp 26.1 53..1 29.9 49.4 parison to row #7, we observe up to 0.5%BLEU 7 +fine-tuning 29.3 50.5 33.2 46.5 and 1.0%TER improvement. This is expected 8 +BJM Rescoring 30.0 48.7 33.8 44.8 9 BJM+FFAM+wp+fine-tuning 29.8 49.5 33.7 45.8 as the BJM captures unbounded source and tar- get context in comparison to the limited context of Table 2: IWSLT 2013 German→English results in the FFJM. This calls for a direct integration of the BLEU [%] and TER [%]. BJM into decoding, which we intend to do in fu- ture work. Our best system (row #8) outperforms the phrase-based system (row #1) by up to 1.1% sured in case-insensitive BLEU [%] (Papineni et BLEU and 2.3%TER. While the phrase-based al., 2002) and TER [%] (Snover et al., 2006) on system can benefit from training the LM on addi- a single reference. We used the multeval toolkit tional monolingual data (row #1 vs. #2), exploit- (Clark et al., 2011) for evaluation. ing monolingual data in NMT systems is still an 7.1 IWSLT 2013 German→English open research question. Table 2 shows the IWSLT German→English re- 7.2 BOLT Chinese→English sults. FFJM refers to feed-forward lexical model. The BOLT Chinese→English experiments are We compare against the phrase-based system with shown in Table 3. Again, we observe large im- an LM trained on the target side of the bilin- provements when including the FFAM in compar- gual data (row #1), the phrase-based system with ison to the distortion penalty (row #5 vs #6), and an LM trained on additional monolingual data fine-tuning improves the results considerably. In- (row #2), the attention-based system (row #3), cluding the BJM in rescoring improves the system and the attention-based system after fine-tuning by up to 0.4% BLEU. Our best system (row #8) towards the in-domain data (row #4). First, we ex- outperforms the attention-based model by up to periment with a system using the FFJM as a lex- 0.4%BLEU and 2.8%TER. We observe that the ical model and a linear distortion penalty (dp) to length ratio of our system’s output to the reference encourage monotone translation as the alignment is 93.3-94.9%, while it is 99.1-102.6% for the model. We also include a word penalty (wp). This attention-based system. In light of the BLEU and system is shown in row #5. In comparison, if the TER scores, the attention-based model does not distortion penalty is replaced by the feed-forward benefit from matching the reference length. Our alignment model (FFAM), we observe large im- system (row #8) still lags behind the phrase-based provements of 4.5% to 5.2%BLEU (row #5 vs. system (row #1). Note, however, that in the WMT #6). This highlights the significant role of the 2016 evaluation campaign,2 it was demonstrated alignment model in our system. Moreover, it in- that NMT can outperform phrase-based systems dicates that the FFAM is able to model alignments on several tasks including German→English and beyond the simple monotone alignments preferred English→German. Including monolingual data by the distortion penalty. (Sennrich et al., 2016) in training neural transla- Fine-tuning the neural networks towards in- tion models can boost performance, and this can domain data improves the system by up to 3.3% be applied to our system. BLEU and 2.9%TER (row #6 vs #7). The gain from fine-tuning is larger than the one observed 7.3 Neural Alignments for the attention-based system. This is likely due Next, we experiment with re-aligning the train- to the fact that our system has two neural models, ing data using neural networks as described in and each of them is fine-tuned. Section 6. We use the fine-tuned FFJM and We apply the BJM in 1000-best list rescoring FFAM to realign the in-domain data of the IWSLT (row #8). Which gives another boost, leading our German→English task. These models are initially system to outperform the attention-based system by 0.9%BLEU on eval 2011, while a compa- 2http://matrix.statmt.org/

Page 28 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

test1 test2 29.5 35 # system BLEU TER BLEU TER BLEU[%] speed-up factor 1 phrase-based system 17.6 68.3 16.9 67.4 29 30 2 + monolingual data 17.9 67.9 17.0 67.1 25 3 attention-based RNN 14.8 76.1 13.6 76.9 28.5 4 +fine-tuning 16.1 73.1 15.4 72.3 20 5 FFJM+dp+wp 10.1 77.2 9.8 75.8 28 15

6 FFJM+FFAM+wp 14.4 71.9 13.7 71.3 BLEU[%]

7 +fine-tuning 15.8 70.3 15.4 69.4 10 speed-up factor 8 +BJM Rescoring 16.0 70.3 15.8 69.5 27.5 9 BJM+FFAM+wp+fine-tuning 16.0 70.4 15.7 69.7 5 27 0 1 10 100 1000 Table 3: BOLT Chinese→English results in BLEU classes [%] and TER [%].

test 2010 eval 2011 Figure 3: Decoding speed-up and translation qual- Alignment Source BLEU TER BLEU TER ity using top scoring classes in a class-factored output layer. The results are computed for the GIZA++ 25.6 53.6 29.3 49.7 Neural Forced decoding 25.9 52.4 29.5 49.4 IWSLT German→English dev dataset.

. Table 4: Re-alignment results in BLEU [%] and factories TER [%] on the IWSLT 2013 German→English coal in-domain data. Each system includes FFJM, other of FFAM and word penalty. lot a build trained using GIZA++ alignments. We train new to was models using the re-aligned data and compare the proposal translation quality before and after re-alignment. the and We use 0.7 and 0.3 as model weights for the FFJM und war der Vorschlag, zu bauenviele weitereKohleFabriken. and FFAM, respectively. These values are based on the model weights obtained using MERT. The results are shown in Table 4. Note that the base- Figure 4: A translation example produced by line is worse than the one in Table 2 as the models our system. The shown German sentence is pre- are only trained on the in-domain data. We ob- ordered. serve that re-aligning the data improves translation quality by up to 0.3%BLEU and 1.2%TER. The new alignments are generated using the neural de- 8 Analysis coder, and using them to train the neural networks results in training that is more consistent with de- We show an example from the German→English coding. As future work, we intend to re-align the task in Figure 4, along with the alignment path. full bilingual data and use it for neural training. The reference translation is ‘and the proposal has been to build a lot more coal plants .’. Our sys- 7.4 Class-Factored Output Layer tem handles the local reordering of the word ‘was’, Figure 3 shows the trade-off between speed and which is produced in the correct target order. An performance when evaluating words belonging to example on the one-to-many alignments is given the top classes only. Limiting the evaluation to by the correct translation of ‘viele’ to ‘a lot of’. words belonging to the top class incurs a perfor- As an example on handling multiply-aligned mance loss of 0.4% BLEU only when compared to target words, we observe the translation of ‘Nord the full evaluation of the output layer. However, Westen’ to ‘northwest’ in our output. This is pos- this corresponds to a large speed-up. The system sible because the source window allows the FFNN is about 30 times faster, with a translation speed to translate the word ‘Westen’ in context of the of 0.4 words/sec. In conclusion, not only does word ‘Nord’. the class layer speed up training, but it can also be Table 5 lists some translation examples pro- used to speed up decoding considerably. We use duced by our system and the attention-based sys- the top 3 classes throughout our experiments. tem, where maximum attention weights are used

Page 29 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

source sie wurden¨ verhungern nicht , und wissen Sie was ? starve 1 reference they wouldn ’t , and you know what ? attention NMT you wouldn ’t interview , and guess what ? our system they wouldn ’t starve , and you know what ? source denn sie sind diejenigen , die sind auch Experten fur¨ Geschmack . experts in flavor 2 reference because they ’re the ones that are , too . attention NMT because they ’re the ones who are also experts . our system because they ’re the ones who are also experts in flavor . source es ist ein Online Spiel , in dem Sie mussen¨ uberwinden¨ eine Olknappheit¨ . oil shortage 3 reference this is an online game in which you try to survive an . attention NMT it ’s an online game where you need to get through a UNKOWN . our system it ’s an online game in which you have to overcome an astrolabe . source es liegt daran , dass gehen nicht Moglichkeiten¨ auf diesem Planeten zuruck¨ , sie gehen vorwarts¨ . possibilities don ’t go back 4 reference it ’s because on this planet , they , they go forward . attention NMT it ’s because there ’s no way back on this planet , they ’re going to move forward . our system it ’s because opportunities don ’t go on this planet , they go forward .

Table 5: Sample translations from the IWSLT German→English test set using the attention-based system (Table 2, row #4) and our system (Table 2, row #7). We highlight the (pre-ordered) source words and their aligned target words. We underline the source words of interest, italicize correct translations, and use bold-face for incorrect translations.

as alignment. While we use larger vocabularies TER. We also demonstrate that re-aligning the compared to the attention-based system, we ob- training data using the neural decoder yields better serve incorrect translations of rare words. E.g., translation quality. the German word Olknappheit¨ in sentence 3 oc- As future work, we aim to integrate alignment- curs only 7 times in the training data among 108M based RNNs such as the BJM into the alignment- words, and therefore it is an unknown word for based decoder. We also plan to develop a bidirec- the attention system. Our system has the word in tional RNN alignment model to make jump deci- the source vocabulary but fails to predict the right sions based on unbounded context. In addition, we translation. Another problem occurs in sentence want to investigate the use of coverage constraints 4, where the German verb “zuruckgehen”¨ is split in alignment-based NMT. Furthermore, we con- into “gehen ... zuruck”.¨ Since the feed-forward sider the re-alignment experiment promising and model uses a source window of size 9, it cannot plan to apply re-alignment on the full bilingual include both words when it is centered at any of data of each task. them. Such insufficient context might be resolved when integrating the bidirectional RNN in decod- Acknowledgments ing. Note that the attention-based model also fails This paper has received funding from the Euro- to produce the correct translation here. pean Union’s Horizon 2020 research and innova- tion programme under grant agreement no 645452 9 Conclusion (QT21). Tamer Alkhouli was partly funded by the 2016 Google PhD Fellowship for North America, This work takes a step towards bridging the gap Europe and the Middle East. between conventional word alignment concepts and NMT. We use an HMM-inspired factoriza- tion of the lexical and alignment models, and em- References ploy the Viterbi alignments obtained using con- ventional HMM/IBM models to train neural mod- Tamer Alkhouli, Andreas Guta, and Hermann Ney. 2014. Vector space models for phrase-based ma- els. An alignment-based decoder is introduced chine translation. In EMNLP 2014 Eighth Work- and a log-linear framework is used to combine the shop on Syntax, Semantics and Structure in Statis- models. We use MERT to tune the model weights. tical Translation, pages 1–10, Doha, Qatar, October. Our system outperforms the attention-based sys- Tamer Alkhouli, Felix Rietig, and Hermann Ney. 2015. tem on the German→English task by up to 0.9% Investigations on phrase-based decoding with recur- BLEU, and on Chinese→English by up to 2.8% rent neural network language and translation mod-

Page 30 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

els. In Proceedings of the EMNLP 2015 Tenth Work- model. In Proceedings of the Conference on Empiri- shop on Statistical Machine Translation, pages 294– cal Methods in Natural Language Processing, pages 303, Lisbon, Portugal, September. 848–856, Honolulu, Hawaii, USA, October.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- Joshua Goodman. 2001. Classes for fast maximum en- gio. 2015. Neural machine translation by jointly tropy training. In Proceedings of the IEEE Interna- learning to align and translate. In International Con- tional Conference on Acoustics, Speech, and Signal ference on Learning Representations, San Diego, Processing, volume 1, pages 561–564, Utah, USA, Calefornia, USA, May. May.

Fred´ eric´ Bastien, Pascal Lamblin, Razvan Pascanu, Andreas Guta, Tamer Alkhouli, Jan-Thorsten Peter, Jo- James Bergstra, Ian J. Goodfellow, Arnaud Berg- ern Wuebker, and Hermann Ney. 2015. A Com- eron, Nicolas Bouchard, and Yoshua Bengio. 2012. parison between Count and Neural Network Mod- Theano: new features and speed improvements. De- els Based on Joint Translation and Reordering Se- cember. quences. In Conference on Empirical Methods on James Bergstra, Olivier Breuleux, Fred´ eric´ Bastien, Natural Language Processing, pages 1401–1411, Pascal Lamblin, Razvan Pascanu, Guillaume Des- Lisbon, Portugal, September. jardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. 2010. Theano: a CPU and GPU Yuening Hu, Michael Auli, Qin Gao, and Jianfeng Gao. math expression compiler. In Proceedings of the 2014. Minimum translation modeling with recur- Python for Scientific Computing Conference, Austin, rent neural networks. In Proceedings of the 14th TX, USA, June. Conference of the European Chapter of the Associ- ation for Computational Linguistics, pages 20–29, Stanley F. Chen and Joshua Goodman. 1998. An Gothenburg, Sweden, April. Empirical Study of Smoothing Techniques for Lan- guage Modeling. Technical Report TR-10-98, Com- Sebastien´ Jean, Kyunghyun Cho, Roland Memisevic, puter Science Group, Harvard University, Cam- and Yoshua Bengio. 2015. On using very large tar- bridge, MA, August. get vocabulary for neural machine translation. In Proceedings of the Annual Meeting of the Associa- Kyunghyun Cho, Bart van Merrienboer, Caglar Gul- tion for Computational Linguistics, pages 1–10, Bei- cehre, Dzmitry Bahdanau, Fethi Bougares, Holger jing, China, July. Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder–decoder Reinhard Kneser and Hermann Ney. 1995. Improved for statistical machine translation. In Proceedings of backing-off for M-gram language modeling. In Pro- the 2014 Conference on Empirical Methods in Nat- ceedings of the International Conference on Acous- ural Language Processing, pages 1724–1734, Doha, tics, Speech, and Signal Processing, volume 1, pages Qatar, October. 181–184, May.

Jonathan H. Clark, Chris Dyer, Alon Lavie, and Philipp Koehn, Franz J. Och, and Daniel Marcu. Noah A. Smith. 2011. Better hypothesis testing for 2003. Statistical Phrase-Based Translation. In Pro- statistical machine translation: Controlling for opti- ceedings of the 2003 Meeting of the North Ameri- mizer instability. In 49th Annual Meeting of the As- can chapter of the Association for Computational sociation for Computational Linguistics, pages 176– Linguistics, pages 127–133, Edmonton, Canada, 181, Portland, Oregon, June. May/June. Trevor Cohn, Cong Duy Vu Hoang, Ekaterina Vy- molova, Kaisheng Yao, Chris Dyer, and Gholamreza Hai Son Le, Alexandre Allauzen, and Franc¸ois Yvon. Haffari. 2016. Incorporating structural alignment 2012. Continuous Space Translation Models with Conference of the North Amer- biases into an attentional neural translation model. Neural Networks. In ican Chapter of the Association for Computational In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computa- Linguistics: Human Language Technologies, pages tional Linguistics: Human Language Technologies, 39–48, Montreal, Canada, June. pages 876–885, San Diego, California, June. Minh-Thang Luong and Christopher D. Manning. Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas 2015. Stanford neural machine translation systems Lamar, Richard Schwartz, and John Makhoul. 2014. for spoken language domains. In Proceedings of the Fast and Robust Neural Network Joint Models for International Workshop on Spoken Language Trans- Statistical Machine Translation. In 52nd Annual lation, pages 76–79, Da Nag, Vietnam, December. Meeting of the Association for Computational Lin- guistics, pages 1370–1380, Baltimore, MD, USA, Minh-Thang Luong, Hieu Pham, and Christopher D. June. Manning. 2015. Effective approaches to attention- based neural machine translation. In Conference on Michel Galley and Christopher D. Manning. 2008. A Empirical Methods in Natural Language Process- simple and effective hierarchical phrase reordering ing, pages 1412–1421, Lisbon, Portugal, September.

Page 31 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

R.C. Moore and W. Lewis. 2010. Intelligent Selec- Conference on Empirical Methods on Natural Lan- tion of Language Model Training Data. In Proceed- guage Processing, pages 14–25, Doha, Qatar, Octo- ings of the 48th Annual Meeting of the Association ber. for Computational Linguistics, pages 220–224, Up- psala, Sweden, July. Martin Sundermeyer, Ralf Schluter,¨ and Hermann Ney. 2014b. rwthlm - the RWTH Aachen university neu- Frederic Morin and Yoshua Bengio. 2005. Hierarchi- ral network language modeling toolkit. In Inter- cal probabilistic neural network language model. In speech, pages 2093–2097, Singapore, September. Proceedings of the international workshop on artifi- cial intelligence and statistics, pages 246–252, Bar- Ilya Sutskever, Oriol Vinyals, and Quoc V. V Le. bados, January. 2014. Sequence to sequence learning with neu- ral networks. In Advances in Neural Information Franz Josef Och. 2003. Minimum Error Rate Train- Processing Systems 27, pages 3104–3112, Montreal,´ ing in Statistical Machine Translation. In Proceed- Canada, December. ings of the 41th Annual Meeting of the Association for Computational Linguistics, pages 160–167, Sap- Akihiro Tamura, Taro Watanabe, and Eiichiro Sumita. poro, Japan, July. 2014. Recurrent neural networks for word align- ment model. In 52nd Annual Meeting of the Asso- Kishore Papineni, Salim Roukos, Todd Ward, and Wei- ciation for Computational Linguistics, pages 1470– Jing Zhu. 2002. Bleu: a Method for Automatic 1480, Baltimore, MD, USA. Evaluation of Machine Translation. In Proceed- ings of the 41st Annual Meeting of the Associa- Bart van Merrienboer,¨ Dzmitry Bahdanau, Vincent Du- tion for Computational Linguistics, pages 311–318, moulin, Dmitriy Serdyuk, David Warde-Farley, Jan Philadelphia, Pennsylvania, USA, July. Chorowski, and Yoshua Bengio. 2015. Blocks and fuel: Frameworks for deep learning. CoRR, Stephan Peitz, Arne Mauser, Joern Wuebker, and Her- abs/1506.00619. mann Ney. 2012. Forced derivations for hierarchi- Ashish Vaswani, Yinggong Zhao, Victoria Fossum, cal machine translation. In International Confer- and David Chiang. 2013. Decoding with large- ence on Computational Linguistics, pages 933–942, scale neural language models improves translation. Mumbai, India, December. In Proceedings of the 2013 Conference on Empiri- Maja Popovic´ and Hermann Ney. 2006. POS-based cal Methods in Natural Language Processing, pages word reorderings for statistical machine transla- 1387–1392, Seattle, Washington, USA, October. tion. In Language Resources and Evaluation, pages David Vilar, Daniel Stein, Matthias Huck, and Her- 1278–1283, Genoa, Italy, May. mann Ney. 2010. Jane: Open source hierarchi- Holger Schwenk. 2012. Continuous Space Translation cal translation, extended with reordering and lexi- Models for Phrase-Based Statistical Machine Trans- con models. In ACL 2010 Joint Fifth Workshop on lation. In 25th International Conference on Com- Statistical Machine Translation and Metrics MATR, putational Linguistics, pages 1071–1080, Mumbai, pages 262–270, Uppsala, Sweden, July. India, December. Joern Wuebker, Arne Mauser, and Hermann Ney. 2010. Training phrase translation models with Rico Sennrich, Barry Haddow, and Alexandra Birch. leaving-one-out. In Proceedings of the Annual 2016. Improving neural machine translation models Meeting of the Association for Computational Lin- with monolingual data. In Proceedings of 54th An- guistics, pages 475–484, Uppsala, Sweden, July. nual Meeting of the Association for Computational Linguistics, August. Joern Wuebker, Matthias Huck, Stephan Peitz, Malte Nuhn, Markus Freitag, Jan-Thorsten Peter, Saab Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin- Mansour, and Hermann Ney. 2012. Jane 2: Open nea Micciulla, and John Makhoul. 2006. A Study of source phrase-based and hierarchical statistical ma- Translation Edit Rate with Targeted Human Annota- chine translation. In International Conference on Proceedings of the 7th Conference of the As- tion. In Computational Linguistics, pages 483–491, Mum- sociation for Machine Translation in the Americas , bai, India, December. pages 223–231, Cambridge, Massachusetts, USA, August. Nan Yang, Shujie Liu, Mu Li, Ming Zhou, and Neng- hai Yu. 2013. Word alignment modeling with con- Andreas Stolcke. 2002. SRILM – An Extensible text dependent deep neural network. In 51st Annual Language Modeling Toolkit. In Proceedings of the Meeting of the Association for Computational Lin- International Conference on Speech and Language guistics, pages 166–175, Sofia, Bulgaria, August. Processing, volume 2, pages 901–904, Denver, CO, September.

Martin Sundermeyer, Tamer Alkhouli, Joern Wuebker, and Hermann Ney. 2014a. Translation Modeling with Bidirectional Recurrent Neural Networks. In

Page 32 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

C The QT21/HimL Combined Machine Translation System

The QT21/HimL Combined Machine Translation System Jan-Thorsten Peter1, Tamer Alkhouli1, Hermann Ney1, Matthias Huck2, Fabienne Braune2, Alexander Fraser2, Alesˇ Tamchyna2,3, Ondrejˇ Bojar3, Barry Haddow4, Rico Sennrich4, Fred´ eric´ Blain5, Lucia Specia5, Jan Niehues6, Alex Waibel6, Alexandre Allauzen7, Lauriane Aufrant7,8, Franck Burlot7, Elena Knyazeva7, Thomas Lavergne7, Franc¸ois Yvon7, Stella Frank9,Marcis¯ Pinnis10 1RWTH Aachen University, Aachen, Germany 2LMU Munich, Munich, Germany 3Charles University in Prague, Prague, Czech Republic 4University of Edinburgh, Edinburgh, UK 5University of Sheffield, Sheffield, UK 6Karlsruhe Institute of Technology, Karlsruhe, Germany 7LIMSI, CNRS, Universite´ Paris Saclay, Orsay, France 8DGA, Paris, France 9ILLC, University of Amsterdam, Amsterdam, The Netherlands 10Tilde, Riga, Latvia 1{peter,alkhouli,ney}@cs.rwth-aachen.de 2{mhuck,braune,fraser}@cis.lmu.de 3{tamchyna,bojar}@ufal.mff.cuni.cz [email protected] [email protected] 5{f.blain,l.specia}@sheffield.ac.uk 6{jan.niehues,alex.waibel}@kit.edu 7{allauzen,aufrant,burlot,knyazeva,lavergne,yvon}@limsi.fr [email protected] [email protected]

Abstract of substantially improving statistical and machine learning based translation models for challenging This paper describes the joint submis- languages and low-resource scenarios. sion of the QT21 and HimL projects for Health in my Language (HimL) aims to make the English→Romanian translation task of public health information available in a wider va- the ACL 2016 First Conference on Ma- riety of languages, using fully automatic machine chine Translation (WMT 2016). The sub- translation that combines the statistical paradigm mission is a system combination which with deep linguistic techniques. combines twelve different statistical ma- chine translation systems provided by the In order to achieve high-quality machine trans- different groups (RWTH Aachen Univer- lation from English into Romanian, members of sity, LMU Munich, Charles University in the QT21 and HimL projects have jointly built a Prague, University of Edinburgh, Univer- combined statistical machine translation system. sity of Sheffield, Karlsruhe Institute of We participated with the QT21/HimL combined machine translation system in the WMT 2016 Technology, LIMSI, University of Ams- 1 terdam, Tilde). The systems are com- shared task for machine translation of news. Core bined using RWTH’s system combination components of the QT21/HimL combined system → approach. The final submission shows an are twelve individual English Romanian trans- lation engines which have been set up by differ- improvement of 1.0 BLEU compared to the best single system on newstest2016. ent QT21 or HimL project partners. The outputs of all these individual engines are combined us- 1 Introduction ing the system combination approach as imple-

Quality Translation 21 (QT21) is a European ma- 1http://www.statmt.org/wmt16/ chine translation research project with the aim translation-task.html

Page 33 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

mented in Jane, RWTH’s open source statistical table is adapted to the SETimes2 corpus (Niehues machine translation toolkit (Freitag et al., 2014a). and Waibel, 2012). The system uses a pre- The Jane system combination is a mature imple- reordering technique (Rottmann and Vogel, 2007) mentation which previously has been successfully in combination with lexical reordering. It uses two employed in other collaborative projects and for word-based n-gram language models and three ad- different language pairs (Freitag et al., 2013; Fre- ditional non-word language models. Two of them itag et al., 2014b; Freitag et al., 2014c). are automatic word class-based (Och, 1999) lan- In the remainder of the paper, we present the guage models, using 100 and 1,000 word classes. technical details of the QT21/HimL combined ma- In addition, we use a POS-based language model. chine translation system and the experimental re- During decoding, we use a discriminative word sults obtained with it. The paper is structured lexicon (Niehues and Waibel, 2013) as well. as follows: We describe the common preprocess- We rescore the system output using a 300-best ing used for most of the individual engines in list. The weights are optimized on the concate- Section 2. Section 3 covers the characteristics nation of the development data and the SETimes2 of the different individual engines, followed by dev set using the ListNet algorithm (Niehues et al., a brief overview of our system combination ap- 2015). In rescoring, we add the source discrimina- proach (Section 4). We then summarize our empir- tive word lexica (Herrmann et al., 2015) as well as ical results in Section 5, showing that we achieve neural network language and translation models. better translation quality than with any individual These models use a factored word representation engine. Finally, in Section 6, we provide a sta- of the source and the target. On the source side tistical analysis of certain linguistic phenomena, we use the word surface form and two automatic specifically the prediction precision on morpho- word classes using 100 and 1,000 classes. On the logical attributes. We conclude the paper with Romanian side, we add the POS information as an Section 7. additional word factor.

2 Preprocessing 3.2 LIMSI The data provided for the task was preprocessed The LIMSI system uses NCODE (Crego et al., once, by LIMSI, and shared with all the partici- 2011), which implements the bilingual n-gram ap- pants, in order to ensure consistency between sys- proach to SMT (Casacuberta and Vidal, 2004; tems. On the English side, preprocessing con- Crego and Marino,˜ 2006; Marino˜ et al., 2006) that sists of tokenizing and truecasing using the Moses is closely related to the standard phrase-based ap- toolkit (Koehn et al., 2007). proach (Zens et al., 2002). In this framework, On the Romanian side, the data is tokenized us- translation is divided into two steps. To trans- ing LIMSI’s tokro (Allauzen et al., 2016), a rule- late a source sentence into a target sentence, the based tokenizer that mainly normalizes diacritics source sentence is first reordered according to a and splits punctuation and clitics. This data is true- set of rewriting rules so as to reproduce the tar- cased in the same way as the English side. In addi- get word order. This generates a word lattice con- tion, the Romanian sentences are also tagged, lem- taining the most promising source permutations, matized, and chunked using the TTL tagger (Tufis¸ which is then translated. Since the translation step et al., 2008). is monotonic, this approach is able to rely on the n-gram assumption to decompose the joint proba- 3 Translation Systems bility of a sentence pair into a sequence of bilin- Each group contributed one or more systems. In gual units called tuples. this section the systems are presented in alphabetic We train three Romanian 4-gram language mod- order. els, pruning all singletons with KenLM (Heafield, 2011). We use the in-domain monolingual cor- 3.1 KIT pus, the Romanian side of the parallel corpora The KIT system consists of a phrase-based ma- and a subset of the (out-of-domain) Common chine translation system using additional models Crawl corpus as training data. We select in- in rescoring. The phrase-based system is trained domain sentences from the latter using the Moore- on all available parallel training data. The phrase Lewis (Moore and Lewis, 2010) filtering method,

Page 34 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

more specifically its implementation in XenC training data using the Berkeley parser (Petrov et (Rousseau, 2013). As a result, one third of the ini- al., 2006). For model prediction during tuning and tial corpus is removed. Finally, we make a linear decoding, we use parsed versions of the develop- interpolation of these models, using the SRILM ment and test sets. We train the rule selection toolkit (Stolcke, 2002). model using VW and tune the weights of the trans- lation model using batch MIRA (Cherry and Fos- 3.3 LMU-CUNI ter, 2012). The 5-gram language model is trained The LMU-CUNI contribution is a constrained using KenLM (Heafield et al., 2013) on the Roma- Moses phrase-based system. It uses a simple fac- nian part of the Common Crawl corpus concate- tored setting: our phrase table produces not only nated with the Romanian part of the training data. the target surface form but also its lemma and mor- phological tag. On the input, we include lemmas, 3.5 RWTH Aachen University: Hierarchical POS tags and information from dependency parses Phrase-based System (lemma of the parent node and syntactic relation), all encoded as additional factors. The RWTH hierarchical setup uses the open The main difference from a standard phrase- source translation toolkit Jane 2.3 (Vilar et al., based setup is the addition of a feature-rich dis- 2010). Hierarchical phrase-based translation criminative translation model which is condi- (HPBT) (Chiang, 2007) induces a weighted syn- tioned on both source- and target-side context chronous context-free grammar from parallel text. (Tamchyna et al., 2016). The motivation for us- In addition to the contiguous lexical phrases, as ing this model is to better condition lexical choices used in phrase-based translation (PBT), hierar- by using the source context and to improve mor- chical phrases with up to two gaps are also ex- phological and topical coherence by modeling the tracted. Our baseline model contains models (limited left-hand side) target context. with phrase translation probabilities and lexical We also take advantage of the target factors smoothing probabilities in both translation direc- by using a 7-gram language model trained on se- tions, word and phrase penalty, and enhanced low quences of Romanian morphological tags. Finally, frequency features (Chen et al., 2011). It also our system also uses a standard lexicalized re- contains binary features to distinguish between hi- ordering model. erarchical and non-hierarchical phrases, the glue rule, and rules with non-terminals at the bound- 3.4 LMU aries. We use the cube pruning algorithm (Huang The LMU system integrates a discriminative rule and Chiang, 2007) for decoding. selection model into a hierarchical SMT system, The system uses three backoff language models as described in (Tamchyna et al., 2014). The rule (LM) that are estimated with the KenLM toolkit selection model is implemented using the high- (Heafield et al., 2013) and are integrated into the speed classifier Vowpal Wabbit2 which is fully in- decoder as separate models in the log-linear com- tegrated in Moses’ hierarchical decoder. During bination: a full 4-gram LM (trained on all data), decoding, the rule selection model is called at each a limited 5-gram LM (trained only on in-domain rule application with syntactic context information data), and a 7-gram word class language model as feature templates. The features are the same as (wcLM) (Wuebker et al., 2013) trained on all data used by Braune et al. (2015) in their string-to-tree and with a output vocabulary of 143K words. system, including both lexical and soft source syn- The system produces 1000-best lists which are tax features. The translation model features com- reranked using a LSTM-based (Hochreiter and prise the standard hierarchical features (Chiang, Schmidhuber, 1997; Gers et al., 2000; Gers et al., 2005) with an additional feature for the rule se- 2003) language model (Sundermeyer et al., 2012) lection model (Braune et al., 2016). and a LSTM-based bidirectional joined model Before training, we reduce the number of trans- (BJM) (Sundermeyer et al., 2014a). The mod- lation rules using significance testing (Johnson et els have a class-factored output layer (Goodman, al., 2007). To extract the features of the rule se- 2001; Morin and Bengio, 2005) to speed up train- lection model, we parse the English part of our ing and evaluation. The language model uses 3 2 350 http://hunch.net/˜vw/ (VW). Implemented by stacked LSTM layers, with nodes each. The John Langford and many others. BJM has a projection layer, and computes a for-

Page 35 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

ward recurrent state encoding the source and target URLs, e-mail addresses, etc.). During translation history, a backward recurrent state encoding the a rule-based localisation feature is applied. source future, and a third LSTM layer to combine them. All layers have 350 nodes. The neural net- 3.8 Edinburgh/LMU Hierarchical System works are implemented using an extension of the The UEDIN-LMU HPBT system is a hierarchi- RWTHLM toolkit (Sundermeyer et al., 2014b). cal phrase-based machine translation system (Chi- The parameter weights are optimized with MERT ang, 2005) built jointly by the University of Ed- (Och, 2003) towards the BLEU metric. inburgh and LMU Munich. The system is based 3.6 RWTH Neural System on the open source Moses implementation of the hierarchical phrase-based paradigm (Hoang et al., The second system provided by the RWTH is an 2009). In addition to a set of standard features in a attention-based recurrent neural network similar log-linear combination, a number of non-standard to (Bahdanau et al., 2015). The implementation enhancements are employed to achieve improved is based on Blocks (van Merrienboer¨ et al., 2015) translation quality. and Theano (Bergstra et al., 2010; Bastien et al., 2012). Specifically, we integrate individual language The network uses the 30K most frequent words models trained over the separate corpora (News on the source and target side as input vocabulary. Crawl 2015, Europarl, SETimes2) directly into The decoder and encoder word embeddings are of the log-linear combination of the system and let size 620. The encoder uses a bidirectional layer MIRA (Cherry and Foster, 2012) optimize their with 1024 GRUs (Cho et al., 2014) to encode the weights along with all other features in tuning, source side, while the decoder uses 1024 GRU rather than relying on a single linearly interpolated layer. language model. We add another background lan- The network is trained for up to 300K updates guage model estimated over a concatenation of all with a minibatch size of 80 using Adadelta (Zeiler, Romanian corpora including Common Crawl. All 2012). The network is evaluated every 10000 up- language models are unpruned. dates on BLEU and the best network on the news- For hierarchical rule extraction, we impose less dev2016/1 dev set is selected as the final network. strict extraction constraints than the Moses de- The monolingual News Crawl 2015 corpus is faults. We extract more hierarchical rules by al- translated into English with a simple phrase-based lowing for a maximum of ten symbols on the translation system to create additional parallel source side, a maximum span of twenty words, training data. The new data is weighted by us- and no lower limit to the amount of words cov- ing the News Crawl 2015 corpus (2.3M sentences) ered by right-hand side non-terminals at extraction once, the Europarl corpus (0.4M sentences) twice time. We discard rules with non-terminals on their and the SETimes2 corpus (0.2M sentences) three right-hand side if they are singletons in the train- times. The final system is an ensemble of 4 net- ing data. works, all with the same configuration and training In order to promote better reordering decisions, settings. we implemented a feature in Moses that resem- bles the phrase orientation model for hierarchical 3.7 Tilde machine translation as described by Huck et al. The Tilde system is a phrase-based machine trans- (2013) and extend our system with it. The model lation system built on LetsMT infrastructure (Vasi- scores orientation classes (monotone, swap, dis- jevs et al., 2012) that features language-specific continuous) for each rule application in decoding. data filtering and cleaning modules. Tilde’s sys- We finally follow the approach outlined by tem was trained on all available parallel data. Huck et al. (2011) for lightly-supervised train- Two language models are trained using KenLM ing of hierarchical systems. We automatically (Heafield, 2011): 1) a 5-gram model using the translate parts (1.2M sentences) of the monolin- Europarl and SETimes2 corpora, and 2) a 3-gram gual Romanian News Crawl 2015 corpus to En- model using the Common Crawl corpus. We also glish with a Romanian→English phrase-based sta- apply a custom tokenization tool that takes into tistical machine translation system (Williams et account specifics of the Romanian language and al., 2016). The foreground phrase table extracted handles non-translatable entities (e.g., file paths, from the human-generated parallel data is filled

Page 36 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

up with entries from a background phrase table built from only News Crawl 2015, with singleton extracted from the automatically produced News 3-grams and above pruned out. The weights of Crawl 2015 parallel data. all these features and models are tuned with k-best Huck et al. (2016) give a more in-depth descrip- MIRA (Cherry and Foster, 2012) on first the half tion of the Edinburgh/LMU hierarchical machine of newsdev2016. In decoding, we use MBR (Ku- translation system, along with detailed experimen- mar and Byrne, 2004), cube-pruning (Huang and tal results. Chiang, 2007) with a pop-limit of 5000, and the Moses ”monotone at punctuation” switch (to pre- 3.9 Edinburgh Neural System vent reordering across punctuation) (Koehn and Edinburgh’s neural machine translation system Haddow, 2009). is an attentional encoder-decoder (Bahdanau et 3.11 USFD Phrase-based System al., 2015), which we train with nematus.3 We use byte-pair-encoding (BPE) to achieve open- USFD’s phrase-based system is built using the vocabulary translation with a fixed vocabulary of Moses toolkit, with MGIZA (Gao and Vogel, subword symbols (Sennrich et al., 2016c). We 2008) for word alignment and KenLM (Heafield produce additional parallel training data by auto- et al., 2013) for language model training. We use matically translating the monolingual Romanian all available parallel data for the translation model. News Crawl 2015 corpus into English (Sennrich A single 5-gram language model is built using all et al., 2016b), which we combine with the original the target side of the parallel data and a subpart of parallel data in a 1-to-1 ratio. We use minibatches the monolingual Romanian corpora selected with of size 80, a maximum sentence length of 50, word Xenc-v2 (Rousseau, 2013). For the latter we use embeddings of size 500, and hidden layers of size all the parallel data as in-domain data and the first 1024. We apply dropout to all layers (Gal, 2015), half of newsdev2016 as development set. The fea- with dropout probability 0.2, and also drop out full ture weights are tuned with MERT (Och, 2003) on words with probability 0.1. We clip the gradient the first half of newsdev2016. norm to 1.0 (Pascanu et al., 2013). We train the The system produces distinct 1000-best lists, models with Adadelta (Zeiler, 2012), reshuffling for which we extend the feature set with the the training corpus between epochs. We validate 17 baseline black-box features from sentence- the model every 10 000 minibatches via BLEU on level Quality Estimation (QE) produced with 4 a validation set, and perform early stopping on Quest++ (Specia et al., 2015). The 1000-best BLEU. Decoding is performed with beam search lists are then reranked and the top-best hypothesis with a beam size of 12. extracted using the nbest rescorer available within A more detailed description of the system, and the Moses toolkit. more experimental results, can be found in (Sen- 3.12 UvA nrich et al., 2016a). We use a phrase-based machine translation sys- 3.10 Edinburgh Phrase-based System tem (Moses) with a distortion limit of 6 and lex- Edinburgh’s phrase-based system is built using icalized reordering. Before translation, the En- the Moses toolkit, with fast align (Dyer et al., glish source side is preordered using the neural 2013) for word alignment, and KenLM (Heafield preordering model of (de Gispert et al., 2015). The et al., 2013) for language model training. In our preordering model is trained for 30 iterations on Moses setup, we use hierarchical lexicalized re- the full MGIZA-aligned training data. We use two ordering (Galley and Manning, 2008), operation language models, built using KenLM. The first is sequence model (Durrani et al., 2013), domain in- a 5-gram language model trained on all available dicator features, and binned phrase count features. data. Words in the Common Crawl dataset that ap- We use all available parallel data for the transla- pear fewer than 500 times were replaced by UNK, tion model, and all available Romanian text for the and all singleton ngrams of order 3 or higher were language model. We use two different 5-gram lan- pruned. We also use a 7-gram class-based lan- guage models; one built from all the monolingual guage model, trained on the same data. 512 word target text concatenated, without pruning, and one 4http://www.quest.dcs.shef.ac.uk/ quest_files/features_blackbox_baseline_ 3https://github.com/rsennrich/nematus 17

Page 37 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

the large building newsdev2016/1 and newsdev2016/2. The first part was used as development set while the second the large home part was our internal test set. Additionally we a big house extracted 2000 sentences from the Europarl and a huge house SETimes2 data to create two additional develop- ment and test sets. Most single systems are op- Figure 1: System A: the large building; System B: timized for newsdev2016/1 and/or the SETimes2 the large home; System C: a big house; System D: test set. The system combination was optimized a huge house; Reference: the big house. on the newsdev2016/1 set. The single system scores in Table 1 show classes were generated using the method of Green clearly that the UEDIN NMT system is the et al. (2014). strongest single system by a large margin. The other standalone attention-based neural network 4 System Combination contribution, RWTH NMT, follows, with only a small margin before the phrase-based contribu- System combination produces consensus transla- tions. The combination of all systems improved tions from multiple hypotheses which are obtained the strongest system by another 1.9 BLEU points from different translation approaches, i.e., the sys- on our internal test set, newsdev2016/2, and by 1 tems described in the previous section. A system BLEU point on the official test set, newstest2016. combination implementation developed at RWTH Removing the strongest system from our sys- Aachen University (Freitag et al., 2014a) is used to tem combination shows a large degradation of the combine the outputs of the different engines. The results. The combination is still slightly stronger consensus translations outperform the individual then the UEDIN NMT system on newsdev2016/2, hypotheses in terms of translation quality. but lags behind on newstest2016. Removing the The first step in system combination is the gen- by itself weakest system shows a slight degrada- eration of confusion networks (CN) from I in- tion on newsdev2016/2 and newstest2016, hinting put translation hypotheses. We need pairwise that it still provides valuable information. alignments between the input hypotheses, which Table 2 shows a comparison between all sys- are obtained from METEOR (Banerjee and Lavie, tems by scoring the translation output against each 2005). The hypotheses are then reordered to match other in TER and BLEU. We see that the neural a selected skeleton hypothesis in terms of word or- networks outputs differ the most from all the other dering. We generate I different CNs, each having systems. one of the input systems as the skeleton hypothe- sis, and the final lattice is the union of all I gen- 6 Morphology Prediction Precision erated CNs. In Figure 1 an example of a confu- sion network with I = 4 input translations is de- In order to assess how well the different system picted. Decoding of a confusion network finds the outputs predict the right morphology, we compute best path in the network. Each arc is assigned a a precision rate for each Romanian morphologi- score of a linear model combination of M differ- cal attribute that occurs with nouns, pronouns, ad- ent models, which includes word penalty, 3-gram jectives, determiners, and verbs (Table 3). For language model trained on the input hypotheses, a this purpose, we use the METEOR toolkit (Baner- binary primary system feature that marks the pri- jee and Lavie, 2005) to obtain word alignments mary hypothesis, and a binary voting feature for between each system translation and the refer- each system. The binary voting feature for a sys- ence translation for newstest2016. The reference tem is 1 if and only if the decoded word is from and hypotheses are tagged with TTL (Tufis¸et al., 5 that system, and 0 otherwise. The different model 2008). Each word in the reference that is assigned weights for system combination are trained with a POS tag of interest (noun, pronoun, adjective, MERT (Och, 2003). determiner, or verb) is then compared to the word it is aligned to in the system output. When, for

5 Experimental Evaluation 5The hypotheses were tagged despite the risks that go along with tagging automatically generated sentences. A dic- Since only one development set was provided we tionary would have been a solution, but unfortunately we had split the given development set into two parts: no such resource for Romanian.

Page 38 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

newsdev2016/1 newsdev2016/2 newstest2016 Individual Systems BLEU TER BLEU TER BLEU TER KIT 25.2 57.5 29.9 51.8 26.3 55.9 LIMSI 23.3 59.5 27.2 55.0 23.9 59.2 LMU-CUNI 23.4 60.4 28.4 53.5 24.7 58.1 LMU 23.3 60.5 28.6 53.8 24.5 58.5 RWTH HPBT 25.4 58.7 29.3 53.3 25.9 57.6 RWTH NMT 25.1 57.4 30.6 49.6 26.5 55.4 Tilde 21.3 62.7 25.8 56.3 23.2 60.2 UEDIN-LMU HPBT 24.8 58.7 30.1 52.3 25.4 57.7 UEDIN PBT 24.7 59.3 29.1 53.2 25.2 58.1 UEDIN NMT 26.8 56.1 31.4 50.3 27.9 54.5 USFD 22.9 60.4 27.8 54.0 24.4 58.5 UvA 22.1 61.0 27.7 54.2 24.1 58.7 System Combination 28.7 55.5 33.3 49.0 28.9 54.2 - without UEDIN NMT 27.4 56.6 31.6 50.9 27.5 55.4 - without Tilde 28.8 55.5 33.0 49.5 28.7 54.5

Table 1: Results of the individual systems for the English→Romanian task. BLEU [%] and TER [%] scores are case-sensitive. KIT LIMSI LMU-CUNI LMU RWTH HPBT RWTH NMT Tilde UEDIN-LMU HPBT UEDIN PBT UEDIN NMT USFD UvA Average KIT - 55.0 55.9 51.7 56.2 48.2 50.3 54.6 55.1 42.8 56.6 54.1 52.8 LIMSI 29.3 - 54.3 52.1 51.8 43.0 49.8 55.3 56.2 38.2 57.3 52.1 51.4 LMU-CUNI 28.5 30.8 - 52.4 53.3 43.8 55.4 56.0 56.6 39.3 58.6 56.6 52.9 LMU 31.2 32.0 31.7 - 53.6 43.1 49.1 59.4 58.6 37.8 56.1 55.8 51.8 RWTH HPBT 28.5 32.4 31.2 30.8 - 47.5 50.1 54.9 55.6 41.8 53.9 55.3 52.2 RWTH NMT 33.7 37.9 37.3 37.5 34.8 - 40.8 44.3 45.3 46.0 43.8 43.6 44.5 Tilde 32.2 33.7 29.6 33.8 33.4 39.6 - 53.4 58.5 36.5 55.5 52.0 50.1 UEDIN-LMU HPBT 29.5 29.9 29.4 27.3 29.8 36.9 30.9 - 62.8 38.9 59.6 56.2 54.1 UEDIN PBT 28.4 28.9 28.5 27.0 29.3 35.4 27.0 24.2 - 39.4 60.2 58.6 55.2 UEDIN NMT 38.6 42.6 42.0 43.0 40.1 35.5 44.0 42.1 41.1 - 38.2 38.2 39.7 USFD 27.6 28.8 27.4 28.8 30.4 37.0 29.1 26.5 25.7 42.6 - 58.8 54.4 UvA 29.9 32.0 28.6 29.2 29.6 37.5 31.5 29.0 26.5 43.2 26.9 - 52.9 Average 30.7 32.6 31.4 32.0 31.8 36.6 33.2 30.5 29.3 41.3 30.0 31.3 -

Table 2: Comparison of system outputs against each other, generated by computing BLEU and TER on the system translations for newstest2016. One system in a pair is used as the reference, the other as candidate translation; we report the average over both directions. The upper-right half lists BLEU [%] scores, the lower-left half TER [%] scores.

Page 39 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

Attribute KIT LIMSI LMU-CUNI LMU RWTH HPBT RWTH NMT Tilde UEDIN-LMU HPBT UEDIN PBT UEDIN NMT USFD UvA Combination Case 46.7% 46.0% 46.3% 45.7% 47.7% 48.0% 44.4% 46.3% 47.4% 49.8% 45.4% 45.4% 50.8% Definite 50.5% 49.1% 50.0% 49.2% 50.5% 50.1% 47.2% 50.0% 50.5% 51.0% 49.2% 48.9% 53.3% Gender 51.9% 51.0% 51.9% 51.3% 52.6% 52.1% 49.6% 51.9% 52.7% 53.0% 51.2% 50.9% 54.9% Number 53.2% 51.7% 52.6% 52.3% 53.6% 53.7% 50.6% 52.9% 53.6% 54.9% 52.1% 51.8% 56.3% Person 52.8% 51.3% 52.0% 52.0% 53.5% 55.0% 50.6% 52.6% 53.4% 57.2% 52.4% 51.6% 57.1% Tense 45.8% 44.1% 44.7% 44.8% 45.7% 45.5% 42.3% 45.2% 45.1% 46.6% 44.9% 44.8% 48.0% Verb form 45.9% 44.4% 45.5% 44.9% 46.6% 47.0% 43.9% 46.1% 46.5% 47.2% 45.5% 43.3% 48.7% Reference words 57.7% 56.7% 57.3% 57.3% 58.3% 57.6% 55.7% 58.0% 58.5% 58.3% 57.3% 56.8% 60.4% with alignment Table 3: Precision of each system on morphological attribute prediction computed over the reference translation using METEOR alignments. The last row shows the ratio of reference words for which METEOR managed to find an alignment in the hypothesis.

a given morphological attribute, the output and Acknowledgments the reference have the same value (e.g. Num- This project has received funding from the ber=Singular), we consider the prediction correct. European Union’s Horizon 2020 research and The prediction is considered wrong in every other innovation programme under grant agreements case. № 645452 (QT21) and 644402 (HimL). The last row in Table 3 shows the ratio of ref- erence words for which METEOR found an align- ment in the hypothesis. We observe a high cor- References relation between this ratio and the quality of the morphological predictions, showing that the accu- Alexandre Allauzen, Lauriane Aufrant, Franck Burlot, Elena Knyazeva, Thomas Lavergne, and Franc¸ois racy is highly dependent on the alignments. We Yvon. 2016. LIMSI@WMT’16 : Machine transla- nevertheless observe that the predictions made by tion of news. In Proc. of the ACL 2016 First Conf. on UEDIN NMT are strictly all better than UEDIN Machine Translation (WMT16), Berlin, Germany, PBT, although the latter has slightly more align- August. ments to the reference. The system combination Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- makes the most accurate predictions for almost ev- gio. 2015. Neural Machine Translation by Jointly ery attribute. The difference in precision with the Learning to Align and Translate. In Proceedings of best single system (UEDIN NMT) can be signifi- the International Conference on Learning Represen- tations (ICLR). cant (2.3% for definite and 1.4% for tense) show- ing that the combination managed to effectively Satanjeev Banerjee and Alon Lavie. 2005. METEOR: identify the strong points of each translation sys- An Automatic Metric for MT Evaluation with Im- proved Correlation with Human Judgments. In 43rd tem. Annual Meeting of the Assoc. for Computational Linguistics: Proc. Workshop on Intrinsic and Extrin- 7 Conclusion sic Evaluation Measures for MT and/or Summariza- Our combined effort shows that even with an ex- tion, pages 65–72, Ann Arbor, MI, USA, June. tremely strong single best system, we still manage Fred´ eric´ Bastien, Pascal Lamblin, Razvan Pascanu, to improve the final result by one BLEU point by James Bergstra, Ian J. Goodfellow, Arnaud Berg- combining it with the other systems of all partici- eron, Nicolas Bouchard, and Yoshua Bengio. 2012. Theano: new features and speed improvements. pating research groups. Deep Learning and Unsupervised Feature Learning The joint submission for English→Romanian is NIPS 2012 Workshop. the best submission measured in terms of BLEU, as presented on the WMT submission page.6 James Bergstra, Olivier Breuleux, Fred´ eric´ Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Des- 6http://matrix.statmt.org/ jardins, Joseph Turian, David Warde-Farley, and

Page 40 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

Yoshua Bengio. 2010. Theano: a CPU and Adria` de Gispert, Gonzalo Iglesias, and Bill Byrne. GPU math expression compiler. In Proceedings 2015. Fast and accurate preordering for SMT us- of the Python for Scientific Computing Conference ing neural networks. In Proceedings of the 2015 (SciPy), June. Oral Presentation. Conference of the North American Chapter of the Association for Computational Linguistics: Human Fabienne Braune, Nina Seemann, and Alexander Language Technologies, pages 1012–1017, Denver, Fraser. 2015. Rule Selection with Soft Syn- Colorado, May–June. tactic Features for String-to-Tree Statistical Ma- chine Translation. In Proc. of the Conf. on Em- Nadir Durrani, Alexander Fraser, Helmut Schmid, pirical Methods for Natural Language Processing Hieu Hoang, and Philipp Koehn. 2013. Can (EMNLP). Markov Models Over Minimal Translation Units Help Phrase-Based SMT? In Proceedings of the Fabienne Braune, Alexander Fraser, Hal Daume´ III, 51st Annual Meeting of the Association for Compu- and Alesˇ Tamchyna. 2016. A Framework for Dis- tational Linguistics (Volume 2: Short Papers), pages criminative Rule Selection in Hierarchical Moses. 399–405, Sofia, Bulgaria, August. In Proc. of the ACL 2016 First Conf. on Machine Chris Dyer, Victor Chahuneau, and Noah A. Smith. Translation (WMT16), Berlin, Germany, August. 2013. A Simple, Fast, and Effective Reparameter- ization of IBM Model 2. In Proceedings of NAACL- Francesco Casacuberta and Enrique Vidal. 2004. Ma- HLT, pages 644–648, Atlanta, Georgia, June. chine translation with inferred stochastic finite-state transducers. Computational Linguistics, 30(3):205– Markus Freitag, Stephan Peitz, Joern Wuebker, Her- 225. mann Ney, Nadir Durrani, Matthias Huck, Philipp Koehn, Thanh-Le Ha, Jan Niehues, Mohammed Boxing Chen, Roland Kuhn, George Foster, and Mediani, Teresa Herrmann, Alex Waibel, Nicola Howard Johnson. 2011. Unpacking and Transform- Bertoldi, Mauro Cettolo, and Marcello Federico. ing Feature Functions: New Ways to Smooth Phrase 2013. EU-BRIDGE MT: Text Translation of Talks Tables. In MT Summit XIII, pages 269–275, Xia- in the EU-BRIDGE Project. In Proc. of the men, China, September. Int. Workshop on Spoken Language Translation (IWSLT), pages 128–135, Heidelberg, Germany, De- Colin Cherry and George Foster. 2012. Batch Tun- cember. ing Strategies for Statistical Machine Translation. In Proc. of the Conf. of the North American Chapter of Markus Freitag, Matthias Huck, and Hermann Ney. the Assoc. for Computational Linguistics: Human 2014a. Jane: Open Source Machine Translation Language Technologies (NAACL-HLT), pages 427– System Combination. In Proc. of the Conf. of the 436, Montreal,´ Canada, June. European Chapter of the Assoc. for Computational Linguistics (EACL), pages 29–32, Gothenberg, Swe- David Chiang. 2005. A Hierarchical Phrase-Based den, April. Model for Statistical Machine Translation. In Proc. of the 43rd Annual Meeting of the Association for Markus Freitag, Stephan Peitz, Joern Wuebker, Her- Computational Linguistics (ACL), pages 263–270, mann Ney, Matthias Huck, Rico Sennrich, Nadir Ann Arbor, Michigan, June. Durrani, Maria Nadejde, Philip Williams, Philipp Koehn, Teresa Herrmann, Eunah Cho, and Alex David Chiang. 2007. Hierarchical Phrase-Based Waibel. 2014b. EU-BRIDGE MT: Combined Ma- Translation. Computational Linguistics, 33(2):201– chine Translation. In Proc. of the Workshop on 228. Statistical Machine Translation (WMT), pages 105– 113, Baltimore, MD, USA, June. Kyunghyun Cho, Bart van Merrienboer, Caglar Gul- Markus Freitag, Joern Wuebker, Stephan Peitz, Her- cehre, Dzmitry Bahdanau, Fethi Bougares, Hol- mann Ney, Matthias Huck, Alexandra Birch, Nadir ger Schwenk, and Yoshua Bengio. 2014. Learn- Durrani, Philipp Koehn, Mohammed Mediani, Is- ing Phrase Representations using RNN Encoder– abel Slawik, Jan Niehues, Eunah Cho, Alex Waibel, Decoder for Statistical Machine Translation. In Pro- Nicola Bertoldi, Mauro Cettolo, and Marcello Fed- ceedings of the 2014 Conference on Empirical Meth- erico. 2014c. Combined Spoken Language Trans- ods in Natural Language Processing (EMNLP), lation. In Proc. of the Int. Workshop on Spoken pages 1724–1734, Doha, Qatar, October. Associa- Language Translation (IWSLT), pages 57–64, Lake tion for Computational Linguistics. Tahoe, CA, USA, December. Joseph M. Crego and Jose´ B. Marino.˜ 2006. Improving Yarin Gal. 2015. A Theoretically Grounded Appli- statistical MT by coupling reordering and decoding. cation of Dropout in Recurrent Neural Networks. Machine translation, 20(3):199–215, Jul. ArXiv e-prints. Josep Maria Crego, Franois Yvon, and Jose´ B. Marino.˜ Michel Galley and Christopher D. Manning. 2008. A 2011. N-code: an open-source Bilingual N-gram simple and effective hierarchical phrase reordering SMT Toolkit. Prague Bulletin of Mathematical Lin- model. In Proceedings of the Conference on Em- guistics, 96:49–58. pirical Methods in Natural Language Processing,

Page 41 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

pages 848–856, Stroudsburg, PA, USA. Association Proc. of the EMNLP 2011 Workshop on Unsuper- for Computational Linguistics. vised Learning in NLP, pages 91–96, Edinburgh, Scotland, UK, July. Qin Gao and Stephan Vogel. 2008. Parallel implemen- tations of word alignment tool. In Software Engi- Matthias Huck, Joern Wuebker, Felix Rietig, and Her- neering, Testing, and Quality Assurance for Natural mann Ney. 2013. A Phrase Orientation Model Language Processing, pages 49–57. Association for for Hierarchical Machine Translation. In Proc. of Computational Linguistics. the Workshop on Statistical Machine Translation (WMT), pages 452–463, Sofia, Bulgaria, August. Felix A. Gers, Jurgen¨ Schmidhuber, and Fred Cum- mins. 2000. Learning to forget: Contin- Matthias Huck, Alexander Fraser, and Barry Haddow. ual prediction with LSTM. Neural computation, 2016. The Edinburgh/LMU Hierarchical Machine 12(10):2451–2471. Translation System for WMT 2016. In Proc. of the ACL 2016 First Conf. on Machine Translation Felix A. Gers, Nicol N. Schraudolph, and Jurgen¨ (WMT16), Berlin, Germany, August. Schmidhuber. 2003. Learning precise timing with lstm recurrent networks. The Journal of Machine Howard Johnson, Joel Martin, George Foster, and Learning Research, 3:115–143. Roland Kuhn. 2007. Improving Translation Qual- Joshua Goodman. 2001. Classes for fast maximum ity by Discarding Most of the Phrasetable. In Proc. entropy training. CoRR, cs.CL/0108006. of EMNLP-CoNLL 2007.

Spence Green, Daniel Cer, and Christopher D. Man- Philipp Koehn and Barry Haddow. 2009. Edinburgh’s ning. 2014. An empirical comparison of features Submission to all Tracks of the WMT 2009 Shared and tuning for phrase-based machine translation. In Task with Reordering and Speed Improvements to In Procedings of the Ninth Workshop on Statistical Moses. In Proceedings of the Fourth Workshop Machine Translation. on Statistical Machine Translation, pages 160–164, Athens, Greece. Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H. Clark, and Philipp Koehn. 2013. Scalable Modified Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Kneser-Ney Language Model Estimation. pages Callison-Burch, Marcello Federico, Nicola Bertoldi, 690–696, Sofia, Bulgaria, August. Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrejˇ Bojar, Alexandra Kenneth Heafield. 2011. KenLM: Faster and Smaller Constantine, and Evan Herbst. 2007. Moses: Open Language Model Queries. In Proceedings of the Source Toolkit for Statistical Machine Translation. EMNLP 2011 Sixth Workshop on Statistical Ma- pages 177–180, Prague, Czech Republic, June. chine Translation, pages 187–197, Edinburgh, Scot- land, United Kingdom, July. Shankar Kumar and William Byrne. 2004. Minimum Bayes-Risk Decoding for Statistical Machine Trans- Teresa Herrmann, Jan Niehues, and Alex Waibel. lation. In HLT 2004 - Human Language Technology 2015. Source Discriminative Word Lexicon for Conference, Boston, MA, May. Translation Disambiguation. In Proceedings of the 12th International Workshop on Spoken Language Jose´ B. Marino,˜ Rafael E. Banchs, Josep M. Crego, Translation (IWSLT15), Danang, Vietnam. Adria` de Gispert, Patrik Lambert, Jose´ A. R. Fonol- losa, and Marta R. Costa-jussa.` 2006. N-gram- Hieu Hoang, Philipp Koehn, and Adam Lopez. 2009. based machine translation. Comput. Linguist., A Unified Framework for Phrase-Based, Hierarchi- 32(4):527–549, December. cal, and Syntax-Based Statistical Machine Transla- tion. In Proc. of the Int. Workshop on Spoken Lan- R.C. Moore and W. Lewis. 2010. Intelligent Selection guage Translation (IWSLT), pages 152–159, Tokyo, of Language Model Training Data. In ACL (Short Japan, December. Papers), pages 220–224, Uppsala, Sweden, July. Sepp Hochreiter and Jurgen¨ Schmidhuber. 1997. Long short-term memory. Neural computation, Frederic Morin and Yoshua Bengio. 2005. Hierarchi- 9(8):1735–1780. cal probabilistic neural network language model. In Robert G. Cowell and Zoubin Ghahramani, editors, Liang Huang and David Chiang. 2007. Forest Rescor- Proceedings of the Tenth International Workshop on ing: Faster Decoding with Integrated Language Artificial Intelligence and Statistics, pages 246–252. Models. In Proceedings of the 45th Annual Meet- Society for Artificial Intelligence and Statistics. ing of the Association for Computational Linguis- tics, pages 144–151, Prague, Czech Republic, June. J. Niehues and A. Waibel. 2012. Detailed Analysis of Different Strategies for Phrase Table Adaptation in Matthias Huck, David Vilar, Daniel Stein, and Her- SMT. In Proceedings of the 10th Conference of the mann Ney. 2011. Lightly-Supervised Training for Association for Machine Translation in the Ameri- Hierarchical Phrase-Based Machine Translation. In cas, San Diego, CA, USA.

Page 42 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

Jan Niehues and Alex Waibel. 2013. An MT Error- Lucia Specia, Gustavo Paetzold, and Carolina Scar- Driven Discriminative Word Lexicon using Sen- ton. 2015. Multi-level translation quality prediction tence Structure Features. In Proceedings of the 8th with quest++. In Proceedings of ACL-IJCNLP 2015 Workshop on Statistical Machine Translation, Sofia, System Demonstrations, pages 115–120, Beijing, Bulgaria. China, July. Association for Computational Linguis- tics and The Asian Federation of Natural Language Jan Niehues, Quoc Khanh Do, Alexandre Allauzen, Processing. and Alex Waibel. 2015. Listnet-based MT Rescor- ing. EMNLP 2015, page 248. Andreas Stolcke. 2002. SRILM – An Extensible Lan- guage Modeling Toolkit. In Proc. of the Int. Conf. Franz Josef Och. 1999. An Efficient Method for Deter- on Speech and Language Processing (ICSLP), vol- mining Bilingual Word Classes. In Proceedings of ume 2, pages 901–904, Denver, CO, September. the 9th Conference of the European Chapter of the Association for Computational Linguistics, Bergen, Martin Sundermeyer, Ralf Schluter,¨ and Hermann Ney. Norway. 2012. LSTM Neural Networks for Language Mod- eling. In Interspeech, Portland, OR, USA, Septem- Franz Josef Och. 2003. Minimum Error Rate Training ber. in Statistical Machine Translation. In Proc. of the 41th Annual Meeting of the Association for Compu- tational Linguistics (ACL), pages 160–167, Sapporo, Martin Sundermeyer, Tamer Alkhouli, Joern Wuebker, Japan, July. and Hermann Ney. 2014a. Translation Modeling with Bidirectional Recurrent Neural Networks. In Razvan Pascanu, Tomas Mikolov, and Yoshua Ben- Conference on Empirical Methods in Natural Lan- gio. 2013. On the difficulty of training recurrent guage Processing, pages 14–25, Doha, Qatar, Octo- neural networks. In Proceedings of the 30th Inter- ber. national Conference on Machine Learning, ICML 2013, pages 1310–1318, , Atlanta, GA, USA. Martin Sundermeyer, Ralf Schluter,¨ and Hermann Ney. 2014b. rwthlm - The RWTH Aachen University Slav Petrov, Leon Barrett, Romain Thibaux, and Dan Neural Network Language Modeling Toolkit . In In- Klein. 2006. Learning Accurate, Compact, and In- terspeech, pages 2093–2097, Singapore, September. terpretable Tree Annotation. In Proc. of the 21st Int. Conf. on Computational Linguistics and the 44th Alesˇ Tamchyna, Fabienne Braune, Alexander M. Annual Meeting of the Assoc. for Computational Fraser, Marine Carpuat, Hal Daume´ III, and Chris Linguistics, pages 433–440. Quirk. 2014. Integrating a Discriminative Classi- fier into Phrase-based and Hierarchical Decoding. Kay Rottmann and Stephan Vogel. 2007. Word Re- The Prague Bulletin of Mathematical Linguistics ordering in Statistical Machine Translation with a (PBML), 101:29–42. POS-Based Distortion Model. In Proceedings of the 11th International Conference on Theoretical Alesˇ Tamchyna, Alexander Fraser, Ondrejˇ Bojar, and and Methodological Issues in Machine Translation, Marcin Junczsys-Dowmunt. 2016. Target-Side Skovde,¨ Sweden. Context for Discriminative Models in Statistical Ma- chine Translation. In Proc. of ACL, Berlin, Ger- Anthony Rousseau. 2013. Xenc: An open-source many, August. Association for Computational Lin- tool for data selection in natural language process- guistics. ing. The Prague Bulletin of Mathematical Linguis- tics, (100):73–82. Dan Tufis¸, Radu Ion, Ru Ceaus¸u, and Dan S¸tefanescu.˘ Rico Sennrich, Barry Haddow, and Alexandra Birch. 2008. RACAI’s Linguistic Web Services. In 2016a. Edinburgh Neural Machine Translation Sys- Proceedings of the Sixth International Language tems for WMT 16. In Proceedings of the First Con- Resources and Evaluation (LREC’08), Marrakech, ference on Machine Translation (WMT16), Berlin, Morocco, May. European Language Resources As- Germany. sociation (ELRA).

Rico Sennrich, Barry Haddow, and Alexandra Birch. Bart van Merrienboer,¨ Dzmitry Bahdanau, Vincent Du- 2016b. Improving Neural Machine Translation moulin, Dmitriy Serdyuk, David Warde-Farley, Jan Models with Monolingual Data. In Proceedings Chorowski, and Yoshua Bengio. 2015. Blocks of the 54th Annual Meeting of the Association for and Fuel: Frameworks for deep learning. CoRR, Computational Linguistics (ACL 2016), Berlin, Ger- abs/1506.00619. many. Andrejs Vasijevs, Raivis Skadis,ˇ and Jorg¨ Tiedemann. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2012. LetsMT!: A Cloud-Based Platform for Do- 2016c. Neural Machine Translation of Rare Words It-Yourself Machine Translation. In Min Zhang, ed- with Subword Units. In Proceedings of the 54th An- itor, Proceedings of the ACL 2012 System Demon- nual Meeting of the Association for Computational strations, number July, pages 43–48, Jeju Island, Linguistics (ACL 2016), Berlin, Germany. Korea. Association for Computational Linguistics.

Page 43 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

David Vilar, Daniel Stein, Matthias Huck, and Her- mann Ney. 2010. Jane: Open source hierarchi- cal translation, extended with reordering and lexi- con models. In ACL 2010 Joint Fifth Workshop on Statistical Machine Translation and Metrics MATR, pages 262–270, Uppsala, Sweden, July. Philip Williams, Rico Sennrich, Maria Nadejde,˘ Matthias Huck, Barry Haddow, and Ondrejˇ Bojar. 2016. Edinburgh’s Statistical Machine Translation Systems for WMT16. In Proc. of the ACL 2016 First Conf. on Machine Translation (WMT16), Berlin, Germany, August. Joern Wuebker, Stephan Peitz, Felix Rietig, and Her- mann Ney. 2013. Improving statistical machine translation with word class models. In Conference on Empirical Methods in Natural Language Pro- cessing, pages 1377–1381, Seattle, WA, USA, Oc- tober.

Matthew D. Zeiler. 2012. ADADELTA: An Adaptive Learning Rate Method. CoRR, abs/1212.5701. Richard Zens, Franz Josef Och, and Hermann Ney. 2002. Phrase-Based Statistical Machine Transla- tion. In 25th German Conf. on Artificial Intelligence (KI2002), pages 18–32, Aachen, Germany, Septem- ber. Springer Verlag.

Page 44 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

D Beer for MT evaluation and tuning

BEER 1.1: ILLC UvA submission to metrics and tuning task

Milosˇ Stanojevic´ Khalil Sima’an ILLC ILLC University of Amsterdam University of Amsterdam [email protected] [email protected]

Abstract 2. including syntactic features and

We describe the submissions of ILLC 3. removing the recall bias from BEER . UvA to the metrics and tuning tasks on WMT15. Both submissions are based In Section 2 we give a short introduction to on the BEER evaluation metric origi- BEER after which we move to the innovations for nally presented on WMT14 (Stanojevic´ this year in Sections 3, 4 and 5. We show the re- and Sima’an, 2014a). The main changes sults from the metric and tuning tasks in Section 6, introduced this year are: (i) extending and conclude in Section 7. the learning-to-rank trained sentence level 2 BEER basics metric to the corpus level (but still decom- posable to sentence level), (ii) incorporat- The model underying the BEER metric is flexible ing syntactic ingredients based on depen- for the integration of an arbitrary number of new dency trees, and (iii) a technique for find- features and has a training method that is targeted ing parameters of BEER that avoid “gam- for producing good rankings among systems. Two ing of the metric” during tuning. other characteristic properties of BEER are its hi- erarchical reordering component and character n- 1 Introduction grams lexical matching component. In the 2014 WMT metrics task, BEER turned up as 2.1 Old BEER scoring the best sentence level evaluation metric on aver- age over 10 language pairs (Machacek and Bojar, BEER is essentially a linear model with which the 2014). We believe that this was due to: score can be computed in the following way:

1. learning-to-rank - type of training that allows score(h, r) = w φ (h, r) = ~w φ~ a large number of features and also training i × i · on the same objective on which the model is Xi going to be evaluated : ranking of translations where ~w is a weight vector and φ~ is a feature 2. dense features - character n-grams and skip- vector. bigrams that are less sparse on the sentence 2.2 Learning-to-rank level than word n-grams Since the task on which our model is going to 3. permutation trees - hierarchical decompo- be evaluated is ranking translations it comes natu- sition of word order based on (Zhang and ral to train the model using learning-to-rank tech- Gildea, 2007) niques. Our training data consists of pairs of “good” A deeper analysis of (2) is presented in (Stano- and “bad” translations. By using a feature vector jevic´ and Sima’an, 2014c) and of (3) in (Stanojevic´ φ~ for a good translation and a feature vector and Sima’an, 2014b). good ~ Here we modify BEER by φbad for a bad translation then using the following equations we can transform the ranking problem 1. incorporating a better scoring function that into a binary classification problem (Herbrich et give scores that are better scaled al., 1999):

396 Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 396–401, Lisboa, Portugal, 17-18 September 2015. c 2015 Association for Computational Linguistics.

Page 45 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

that there is no need to do any reordering (every- thing is in the right place). BEER has features score(h , r) > score(h , r) good bad ⇔ that compute the number of different node types ~ ~ ~w φgood > ~w φbad and for each different type it assigns a different · · ⇔ weight. Sometimes there are more than one PET ~w φ~ ~w φ~ > 0 · good − · bad ⇔ for the same permutation. Consider Figure 1b and ~w (φ~ φ~ ) > 0 · good − bad 1c which are just 2 out of 3 possible PETs for per- ~ ~ mutation 4, 3, 2, 1 . Counting the number of trees ~w (φbad φgood) < 0 h i · − that could be built is also a good indicator of the ~ ~ permutation quality. See (Stanojevic´ and Sima’an, If we look at φgood φbad as a positive training − 2014b) for details on using PETs for evaluating instance and at φ~ φ~ as a negative training bad − good word order. instance, we can train any linear classifier to find weight the weight vector ~w that minimizes mis- 3 Corpus level BEER takes in ranking on the training set. Our goal here is to create corpus level extension 2.3 Lexical component based on character of BEER that decomposes trivially at the sentence n-grams level. More concretely we wanted to have a corpus Lexical scoring of BEER relies heavily on charac- level BEER that would be the average of the sen- ter n-grams. Precision, Recall and F1-score are tence level BEER of all sentences in the corpus: used with character n-gram orders from 1 until 6. These scores are more smooth on the sentence s c BEERsent(si) BEER (c) = i∈ (1) level than word n-gram matching that is present in corpus c other metrics like BLEU (Papineni et al., 2002) or P | | METEOR (Michael Denkowski and Alon Lavie, In order to do so it is not suitable to to use pre- 2014). vious scoring function of BEER . The previous BEER also uses precision, recall and F1-score scoring function (and training method) take care on word level (but not with word n-grams). only that the better translation gets a higher score Matching of words is computed over METEOR than the worse translation (on the sentence level). alignments that use WordNet, paraphrasing and For this kind of corpus level computations we have stemming to have more accurate alignment. an additional requirement that our sentence level We also make distinction between function and scores need to be scaled proportional to the trans- content words. The more precise description of lation quality. used features and their effectiveness is presented in (Stanojevic´ and Sima’an, 2014c). 3.1 New BEER scoring function To make the scores on the sentence level better 2.4 Reordering component based on PETs scaled we transform our linear model into a prob- The word alignments between system and refer- abilistic linear model – logistic regression with the ence translation can be simplified and considered following scoring function: as permutation of words from the reference trans- lation in the system translation. Previous work 1 score(h, r) = by (Isozaki et al., 2010) and (Birch and Osborne, P w φ (h,r) 1 + e− i i× i 2010) used this permutation view of word order and applied Kendall τ for evaluating its distance There is still a problem with this formulation. from ideal (monotone) word order. During training, the model is trained on the differ- ence between two feature vectors φ~ φ~ , BEER goes beyond this skip-gram based eval- good − bad uation and decomposes permutation into a hierar- while during testing it is applied only to one fea- ~ ~ ~ chical structure which shows how subparts of per- ture vector φtest. φgood φbad is usually very − ~ mutation form small groups that can be reordered close to the separating hyperplane, whereas φtest all together. Figure 1a shows PET for permuta- is usually very far from it. This is not a problem tion 2, 5, 6, 4, 1, 3 . Ideally the permutation tree for ranking but it presents a problem if we want h i will be filled with nodes 1, 2 which would say well scaled scores. Being extremely far from the h i

397

Page 46 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

2, 4, 1, 3 2, 1 h i h i 2, 1 h i 2, 1 1 2 2, 1 13 h i h i 2, 1 2, 1 1, 2 4 h i h i 2, 1 2 h i h i 56 43 21 43 (a) Complex PET (b) Fully inverted PET 1 (c) Fully inverted PET 2

Figure 1: Examples of PETs

separated hyperplane gives extreme scores such as 1. POS bigrams matching 0.9999999999912 and 0.00000000000000213 as a result which are obviously not well scaled. 2. dependency words bigram matching Our model was trained to give a probability of 3. arc type matching the “good” translation being better than the “bad” translation so we should also use it in that way – 4. valency matching to estimate the probability of one translation being For each of these we compute precision, recall better than the other. But which translation? We and F1-score. are given only one translation and we need to com- It has been shown by other researchers (Popovic´ pute its score. To avoid this problem we pretend and Ney, 2009) that POS tags are useful for ab- that we are computing a probability of the test sen- stracting away from concrete words and measure tence being a better translation than the reference the grammatical aspect of translation (for example for the given reference. In the ideal case the sys- it can captures agreement). tem translation and the reference translation will Dependency word bigrams (bigrams connected have the same features which will make logistic by a dependency arc) are also useful for capturing regression output probability 0.5 (it is uncertain long distance dependencies. about which translation is the better one). To make Most of the previous metrics that work with de- the scores between 0 and 1 we multiply this result pendency trees usually ignore the type of the de- with 2. The final scoring formula is the following: pendency that is (un)matched and treat all types equally (Yu et al., 2014). This is clearly not the 2 score(h, r) = P case. Surely subject and complement arcs are wi (φi(h,r) φi(r,r)) 1 + e− i × − more important than modifier arc. To capture this 4 BEER + Syntax = BEER Treepel we created individual features for precision, re- The standard version of BEER does not use any call and F1-score matching of each arc type so our syntactic knowledge. Since the training method system could learn on which arc type to put more of BEER allows the usage of a large number of weight. features, it is trivial to integrate new features that All words take some number of arguments (va- would measure the matching between some syntax lency), and not matching that number of argu- attributes of system and reference translations. ments is a sign of a, potentially, bad translation. The syntactic representation we exploit is a de- With this feature we hope to capture the aspect of pendency tree. The reason for that is that we can not producing the right number of arguments for easily connect the structure with the lexical con- all words (and especially verbs) in the sentence. tent and it is fast to compute which can often be This model BEER Treepel contains in total 177 very important for evaluation metrics when they features out of which 45 are from original BEER . need to evaluate on large data. We used Stanford’s 5 BEER for tuning dependency parser (Chen and Manning, 2014) be- cause it gives high accuracy parses in a very short The metrics that perform well on metrics task are time. very often not good for tuning. This is because The features we compute on the dependency recall has much more importance for human judg- trees of the system and its reference translation ment than precision. The metrics that put more are: weight on recall than precision will be better with

398

Page 47 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

tuning metric BLEU MTR BEER Length System Name TrueSkill Score BLEU Tuning-Only All BEER 16.4 28.4 10.2 115.7 BLEU-MIRA-DENSE 0.153 -0.177 12.28 BLEU 18.2 28.1 10.1 103.0 ILLC-UVA 0.108 -0.188 12.05 BEER no bias 18.0 27.7 9.8 BLEU-MERT-DENSE 0.087 -0.200 12.11 99.7 AFRL 0.070 -0.205 12.20 USAAR-TUNA 0.011 -0.220 12.16 Table 1: Tuning results with BEER without bias DCU -0.027 -0.256 11.44 on WMT14 as tuning and WMT13 as test set METEOR-CMU -0.101 -0.286 10.88 BLEU-MIRA-SPARSE -0.150 -0.331 10.84 HKUST -0.150 -0.331 10.99 HKUST-LATE — — 12.20 correlation with human judgment, but when used for tuning they will create overly long translations. Table 6: Results on Czech-English tuning This bias for long translation is often resolved by manually setting the weights of recall and pre- cision to be equal (Denkowski and Lavie, 2011; The difference between BEER and He and Way, 2009). BEER Treepel are relatively big for de-en, This problem is even bigger with metrics with cs-en and ru-en while for fr-en and fi-en the many features. When we have metric like difference does not seem to be big. BEER Treepel which has 117 features it is not clear how to set weights for each feature manu- The results of WMT15 tuning task is shown in ally. Also some features might not have easy inter- Table 6. The system tuned with BEER without re- pretation as precision or recall of something. Our call bias was the best submitted system for Czech- method for automatic removing of this recall bias, English and only the strong baseline outperformed which is presented in (Stanojevic,´ 2015), gives it. very good results that can be seen in Table 1. Before the automatic adaptation of weights 7 Conclusion for tuning, tuning with standard BEER produces translations that are 15% longer than the refer- We have presented ILLC UvA submission to the ence translations. This behavior is rewarded by shared metric and tuning task. All submissions metrics that are recall-heavy like METEOR and are centered around BEER evaluation metric. On BEER and punished by precision heavy metrics the metrics task we kept the good results we had like BLEU. After automatic adaptation of weights, on sentence level and extended our metric to cor- tuning with BEER matches the length of reference pus level with high correlation with high human translation even better than BLEU and achieves judgment without losing the decomposability of the BLEU score that is very close to tuning with the metric to the sentence level. Integration of syn- BLEU. This kind of model is disliked by ME- tactic features gave a bit of improvement on some TEOR and BEER but by just looking at the length language pairs. The removal of recall bias allowed of the produced translations it is clear which ap- us to go from overly long translations produced proach is preferred. in tuning to translations that match reference rel- atively close by length and won the 3rd place in 6 Metric and Tuning task results the tuning task. BEER is available at https: //github.com/stanojevic/beer. The results of WMT15 metric task of best per- forming metrics is shown in Tables 2 and 3 for the system level and Tables 4 and 5 for segment level. Acknowledgments On the sentence level for out of English lan- guage pairs on average BEER was the best met- This work is supported by STW grant nr. 12271 ric (same as the last year). Into English it got 2nd and NWO VICI grant nr. 277-89-002. QT21 place with its syntactic version and 4th place as the project support to the second author is also original BEER . acknowledged (European Unions Horizon 2020 On the corpus level BEER is on average second grant agreement no. 64545). We are thankful to for out of English language pairs and 6th for into Christos Louizos for help with incorporating a de- English. BEER and BEER Treepel are the best for pendency parser to BEER Treepel. en-ru and fi-en.

399

Page 48 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

Correlation coefficient Pearson Correlation Coefficient Direction fr-en fi-en de-en cs-en ru-en Average DPMFCOMB .995 .006 .951 .013 .949 .016 .992 .004 .871 .025 .952 .013 RATATOUILLE .989 ± .010 .899 ± .019 .942 ± .018 .963 ± .008 .941 ± .018 .947 ± .014 DPMF .997 ± .005 .939 ± .015 .929 ± .019 .986 ± .005 .868 ± .026 .944 ± .014 METEOR-WSD .982 ± .011 .944 ± .014 .914 ± .021 .981 ± .006 .857 ± .026 .936 ± .016 ± ± ± ± ± ± CHRF3 .979 .012 .893 .020 .921 .020 .969 .007 .915 .023 .935 .016 ± ± ± ± ± ± BEER TREEPEL .981 .011 .957 .013 .905 .021 .985 .005 .846 .027 .935 .016 BEER .979 ± .012 .952 ± .013 .903 ± .022 .975 ± .006 .848 ± .027 .931 ± .016 ± ± ± ± ± ± CHRF .997 .005 .942 .015 .884 .024 .982 .006 .830 .029 .927 .016 ± ± ± ± ± ± LEBLEU-OPTIMIZED .989 .009 .895 .020 .856 .025 .970 .007 .918 .023 .925 .017 ± ± ± ± ± ± LEBLEU-DEFAULT .960 .015 .895 .020 .856 .025 .946 .010 .912 .022 .914 .018 ± ± ± ± ± ± Table 2: System-level correlations of automatic evaluation metrics and the official WMT human scores when translating into English.

Correlation coefficient Pearson Correlation Coefficient Metric en-fr en-fi en-de en-cs en-ru Average CHRF3 .949 .021 .813 .025 .784 .028 .976 .004 .913 .011 .887 .018 BEER .970 ± .016 .729 ± .030 .811 ± .026 .951 ± .005 .942 ± .009 .880 ± .017 ± ± ± ± ± ± LEBLEU-OPTIMIZED .949 .020 .727 .030 .896 .020 .944 .005 .867 .013 .877 .018 ± ± ± ± ± ± LEBLEU-DEFAULT .949 .020 .760 .028 .827 .025 .946 .005 .849 .014 .866 .018 RATATOUILLE .962 ± .017 .675 ± .031 .777 ± .028 .953 ± .005 .869 ± .013 .847 ± .019 ± ± ± ± ± ± CHRF .949 .021 .771 .027 .572 .037 .968 .004 .871 .013 .826 .020 METEOR-WSD .961 ± .018 .663 ± .032 .495 ± .039 .941 ± .005 .839 ± .014 .780 ± .022 BS .977± .014 .334 ± .039 .615± .036 .947± .005 .791± .016 .600± .022 DPMF −.973 ±.015 n/a± −.584 ±.037 − n/a± − n/a± −.778 ±.026 ± ± ± Table 3: System-level correlations of automatic evaluation metrics and the official WMT human scores when translating out of English.

Direction fr-en fi-en de-en cs-en ru-en Average DPMFCOMB .367 .015 .406 .015 .424 .015 .465 .012 .358 .014 .404 .014 ± ± ± ± ± ± BEER TREEPEL .358 .015 .399 .015 .386 .016 .435 .013 .352 .013 .386 .014 RATATOUILLE .367 ± .015 .384 ± .015 .380 ± .015 .442 ± .013 .336 ± .014 .382 ± .014 BEER .359 ± .015 .392 ± .015 .376 ± .015 .417 ± .013 .336 ± .013 .376 ± .014 METEOR-WSD .347 ± .015 .376 ± .015 .360 ± .015 .416 ± .013 .331 ± .014 .366 ± .014 ± ± ± ± ± ± CHRF .350 .015 .378 .015 .366 .016 .407 .013 .322 .014 .365 .014 DPMF .344 ± .014 .368 ± .015 .363 ± .015 .413 ± .013 .320 ± .014 .362 ± .014 ± ± ± ± ± ± CHRF3 .345 .014 .361 .016 .360 .015 .409 .012 .317 .014 .359 .014 ± ± ± ± ± ± LEBLEU-OPTIMIZED .349 .015 .346 .015 .346 .014 .400 .013 .316 .015 .351 .014 ± ± ± ± ± ± LEBLEU-DEFAULT .343 .015 .342 .015 .341 .014 .394 .013 .317 .014 .347 .014 ± ± ± ± ± ± TOTAL-BS .305 .013 .277 .015 .287 .014 .357 .013 .263 .014 .298 .014 − ± − ± − ± − ± − ± − ± Table 4: Segment-level Kendall’s τ correlations of automatic evaluation metrics and the official WMT human judgments when translating into English. The last three columns contain average Kendall’s τ computed by other variants.

Direction en-fr en-fi en-de en-cs en-ru Average BEER .323 .013 .361 .013 .355 .011 .410 .008 .415 .012 .373 .011 ± ± ± ± ± ± CHRF3 .309 .013 .357 .013 .345 .011 .408 .008 .398 .012 .363 .012 RATATOUILLE .340 ± .013 .300 ± .014 .337 ± .011 .406 ± .008 .408 ± .012 .358 ± .012 ± ± ± ± ± ± LEBLEU-DEFAULT .321 .013 .354 .013 .345 .011 .385 .008 .386 .012 .358 .011 ± ± ± ± ± ± LEBLEU-OPTIMIZED .325 .013 .344 .012 .345 .012 .383 .008 .385 .012 .356 .011 ± ± ± ± ± ± CHRF .317 .013 .346 .012 .315 .013 .407 .008 .387 .012 .355 .012 METEOR-WSD .316 ± .013 .270 ± .013 .287 ± .012 .363 ± .008 .373 ± .012 .322 ± .012 ± ± ± ± ± ± TOTAL-BS .269 .013 .205 .012 .231 .011 .324 .008 .332 .012 .273 .011 DPMF −.308 ±.013 − n/a± −.289 ±.012 − n/a± − n/a± −.298 ±.013 ± ± ± PARMESAN n/a n/a n/a .089 .006 n/a .089 .006 ± ± Table 5: Segment-level Kendall’s τ correlations of automatic evaluation metrics and the official WMT human judgments when translating out of English. The last three columns contain average Kendall’s τ computed by other variants.

400

Page 49 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

References Milosˇ Stanojevic´ and Khalil Sima’an. 2014a. BEER: BEtter Evaluation as Ranking. In Proceedings of the Alexandra Birch and Miles Osborne. 2010. LRscore Ninth Workshop on Statistical Machine Translation, for Evaluating Lexical and Reordering Quality in pages 414–419, Baltimore, Maryland, USA, June. MT. In Proceedings of the Joint Fifth Workshop on Association for Computational Linguistics. Statistical Machine Translation and MetricsMATR, pages 327–332, Uppsala, Sweden, July. Association Milosˇ Stanojevic´ and Khalil Sima’an. 2014b. Eval- for Computational Linguistics. uating Word Order Recursively over Permutation- Proceedings of SSST-8, Eighth Work- Danqi Chen and Christopher D Manning. 2014. A fast Forests. In shop on Syntax, Semantics and Structure in Statis- and accurate dependency parser using neural net- tical Translation works. In Empirical Methods in Natural Language , pages 138–147, Doha, Qatar, Oc- Processing (EMNLP). tober. Association for Computational Linguistics. Michael Denkowski and Alon Lavie. 2011. Meteor Milosˇ Stanojevic´ and Khalil Sima’an. 2014c. Fitting 1.3: Automatic metric for reliable optimization and Sentence Level Translation Evaluation with Many evaluation of machine translation systems. In Pro- Dense Features. In Proceedings of the 2014 Con- ceedings of the Sixth Workshop on Statistical Ma- ference on Empirical Methods in Natural Language chine Translation, WMT ’11, pages 85–91, Strouds- Processing (EMNLP), pages 202–206, Doha, Qatar, burg, PA, USA. Association for Computational Lin- October. Association for Computational Linguistics. guistics. Milosˇ Stanojevic.´ 2015. Removing Biases from Train- Y.He and A. Way. 2009. Improving the objective func- able MT Metrics by Using Self-Training. arXiv tion in minimum error rate training. Proceedings preprint arXiv:1508.02445. of the Twelfth Machine Translation Summit, pages 238–245. Hui Yu, Xiaofeng Wu, Jun Xie, Wenbin Jiang, Qun Liu, and Shouxun Lin. 2014. Red: A reference depen- Ralf Herbrich, Thore Graepel, and Klaus Obermayer. dency based mt evaluation metric. In COLING’14, 1999. Support Vector Learning for Ordinal Regres- pages 2042–2051. sion. In In International Conference on Artificial Neural Networks, pages 97–102. Hao Zhang and Daniel Gildea. 2007. Factorization of synchronous context-free grammars in linear time. Hideki Isozaki, Tsutomu Hirao, Kevin Duh, Katsuhito In In NAACL Workshop on Syntax and Structure in Sudoh, and Hajime Tsukada. 2010. Automatic Statistical Translation (SSST. Evaluation of Translation Quality for Distant Lan- guage Pairs. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Pro- cessing, EMNLP ’10, pages 944–952, Stroudsburg, PA, USA. Association for Computational Linguis- tics. Matous Machacek and Ondrej Bojar. 2014. Results of the wmt14 metrics shared task. In Proceedings of the Ninth Workshop on Statistical Machine Trans- lation, pages 293–301, Baltimore, Maryland, USA, June. Association for Computational Linguistics. Michael Denkowski and Alon Lavie. 2014. Meteor Universal: Language Specific Translation Evalua- tion for Any Target Language. In Proceedings of the ACL 2014 Workshop on Statistical Machine Transla- tion. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu. 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Com- putational Linguistics, ACL ’02, pages 311–318, Stroudsburg, PA, USA. Association for Computa- tional Linguistics. Maja Popovic´ and Hermann Ney. 2009. Syntax- oriented evaluation measures for machine transla- tion output. In Proceedings of the Fourth Work- shop on Statistical Machine Translation, StatMT ’09, pages 29–32, Stroudsburg, PA, USA. Associ- ation for Computational Linguistics.

401

Page 50 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

E Particle Swarm Optimization Submission for WMT16 Tuning Task

Particle Swarm Optimization Submission for WMT16 Tuning Task

Viktor Kocur Ondrejˇ Bojar CTU in Prague Charles University in Prague FNSPE MFF UFAL´ [email protected] [email protected]

Abstract translation metric2. The standard optimization is a variant of grid search and in our work, we replace This paper describes our submission to the it with the Particle Swarm Optimization (PSO, Tuning Task of WMT16. We replace the Eberhart et al., 1995) algorithm. grid search implemented as part of stan- Particle Swarm Optimization is a good candi- dard minimum-error rate training (MERT) date for an efficient implementation of the inner in the Moses toolkit with a search based loop of MERT due to the nature of the optimiza- on particle swarm optimization (PSO). An tion space. The so-called Traditional PSO (TPSO) older variant of PSO has been previously has already been tested by Suzuki et al. (2011), successfully applied and we now test it with a success. Improved versions of the PSO al- in optimizing the Tuning Task model for gorithm, known as Standard PSO (SPSO), have English-to-Czech translation. We also been summarized in Clerc (2012). adapt the method in some aspects to al- In this paper, we test a modified version of low for even easier parallelization of the the latest SPSO2011 algorithm within the Moses search. toolkit and compare its results and computational costs with the standard Moses implementation of 1 Introduction MERT.

Common models of statistical machine transla- 2 MERT tion (SMT) consist of multiple features which as- sign probabilities or scores to possible transla- The basic goal of MERT is to find optimal weights tions. These are then combined in a weighted for various numerical features of an SMT system. sum to determine the best translation given by the The weights are considered optimal if they min- model. Tuning within SMT refers to the process of imize an automated error metric which compares finding the optimal weights for these features on a the machine translation to a human translation for given tuning set. This paper describes our submis- a certain tuning (development) set. sion to WMT16 Tuning Task1, a shared task where Formally, each feature provides a score (some- all the SMT model components and the tuning set times a probability) that a given sentence e in goal are given and task participants are expected to pro- language is the translation of the foreign sentence vide only the weight settings. We took part only in f. Given a weight for each such feature, it is pos- English-to-Czech system tuning. sible to combine the scores to a single figure and Our solution is based on the standard tuning find the highest scoring translation. The best trans- method of Minimum Error-Rate Training (MERT, lation can then be obtained by the following for- Och, 2003). The MERT algorithm described in mula: Bertoldi et al. (2009) is the default tuning method in the Moses SMT toolkit (Koehn et al., 2007). The inner loop of the algorithm performs opti- e∗ = argmax λi log (pi(e f)) = gp(λ) (1) mization on a space of weight vectors with a given e | Xi 1http://www.statmt.org/wmt16/ 2All our experiments optimize the default BLEU but other tuning-task/ metrics could be directly tested as well.

515 Proceedings of the First Conference on Machine Translation, Volume 2: Shared Task Papers, pages 515–521, Berlin, Germany, August 11-12, 2016. c 2016 Association for Computational Linguistics

Page 51 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

The process of finding the best translation e∗ is model has 21 features but adding sparse features, called decoding. The translations can vary signif- we can get to thousands of dimensions. icantly based on the values of the weights, there- These properties of the search space make PSO fore it is necessary to find the weights that would an interesting candidate for the inner loop algo- give the best result. This is achieved by minimiz- rithm. PSO is stochastic so it doesn’t require ing the error of the machine translation against the smoothness of the optimized function. It is also human translation: highly parallelizable and gains more power with more CPUs available, which is welcome since the λ∗ = argmin errf (gp(λ), ehuman) (2) optimization itself is quite expensive. The simplic- λ ity of PSO also leaves space for various improve- The error function can also be considered as a ments. negative value of an automated scorer. The prob- lem with this straight-forward approach is that de- coding is computationally expensive. To reduce 3 PSO Algorithm this cost, the decoder is not run for every consid- ered weight setting. Instead, only some promis- The PSO algorithm was first described by Eber- ing settings are tested in a loop (called the “outer hart et al. (1995). PSO is an iterative optimization loop”): given the current best weights, the decoder method inspired by the behavior of groups of ani- is asked to produce n best translation for each mals such as flocks of birds or schools of fish. The sentence of the tuning set. This enlarged set of space is searched by individual particles with their candidates allows us to estimate translation scores own positions and velocities. The particles can in- for similar weight settings. An optimizer uses form others of their current and previous positions these estimates to propose a new vector of weights and their properties. and the decoder then tests this proposal in another outer loop. The outer loop is stopped when no new weight setting is proposed by the optimizer or no 3.1 TPSO new translations are found by the decoder. The The original algorithm is defined quite generally. run of the optimizer is called the “inner loop”, al- Let us formally introduce the procedure. The though it need not be iterative in any sense. The search space S is defined as optimizer tries to find the best weights so that the least erroneous translations appear as high as pos- sible in the n-best lists of candidate translations. D Our algorithm replaces the inner loop of MERT. S = [mind, maxd] (3) It is therefore important to describe the properties Od=1 of the inner loop optimization task. Due to finite number of translations accumu- where D is the dimension of the space and lated in the n-best lists (across sentences as well as mind and maxd are the minimal and maximal outer loop iterations), the error function changes values for the d-th coordinate. We try to find a only when the change in weights leads to a change point in the space which maximizes a given func- in the order of the n-best list. This is represented tion f : S R. by numerous plateaus in the error function with 7→ There are p particles and the i-th particle in discontinuities on the edges of the plateaus. This the n-th iteration has the following D-dimensional prevents the use of simple gradient methods. We vectors: position xn, velocity vn, and two vectors can define a local optimum not in a strict math- i i of maxima found so far: the best position pn vis- ematical sense but as a plateau which has only i ited by the particle itself and the best known po- higher or only lower plateaus at the edges. These sition ln that the particle has learned about from local optima can then be numerous within the i others. search space and trap any optimizing algorithm, n thus preventing convergence to the global opti- In TPSO algorithm, the li vector is always the mum which is desired. globally best position visited by any particle so far. Another problem is the relatively high dimen- The TPSO algorithm starts with simple initial- sionality of the search space. The Tuning Task ization:

516

Page 52 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

located in any close vicinity). This set of best po- sitions is limited to k elements, each new addition 0 xi = rand(S) (4) over the limit k replaces the oldest information. 0 t rand(S) x To establish the “global” optimum li, every parti- v0 = − i (5) i 2 cle consults only its set of learned best positions. 0 0 The algorithm starts with the initialization of pi = xi (6) 0 0 particle vectors given by the equations (4-6). The li = argmax f(pj ) (7) 0 0 j li is initialized with the value of pi . The sets of learned best positions are initialized as empty. where the function rand(S) generates a random Two constants affect computations given below: vector from space S with uniform distribution. w is again the slowdown and c controls the “ex- The velocity for the next iteration is updated as pansion” of examined neighbourhood of each par- follows: ticle. We set w and c to values that (as per Bonyadi and Michalewicz, 2014) ensure convergence:

t+1 t t t v = wv + U(0, 1)φpp + U(0, 1)φll (8) 1 i i i i w = 0.721 (12) 2ln(2) ≈ where U(0, 1) denotes a random number be- tween 0 and 1 with uniform distribution. The pa- 1 rameters w, φ , φ (0, 1) are set by the user and c = + ln(2) 1.193 (13) p l ∈ 2 ≈ indicate a slowdown, and the respective weight for own vs. learned optimum. All the following vectors are then updated: t cli t+1 t t+1 xi = xi + vi (9) t t+1 t+1 t+1 t li pi = xi if f(xi ) > f(pi) (10) t t yi wv t+1 t+1 i li = argmax(f(pj )) (11) j t+1 xi t The process continues with the next iteration Gi until all of the particles converge to proximity of a vt+1 i cpt certain point. Other stopping criteria are also used. pt i t i vi 3.2 Modified SPSO2011 xt We introduce a number of changes to the algo- i rithm SPSO2011 described by Clerc (2012). Figure 1: Construction of the particle position up- t date. The grey area indicates P (G, x). In SPSO2011 the global best position li is re- placed by the best position the particle has re- For the update of velocity, it is first necessary to ceived information about from other particles. In calculate a “center of gravity” Gt of three points: the original SPSO2011 this is done in a synchro- i the current position xt, a slightly “expanded” cur- nized fashion: after every iteration, all particles i rent best position pt and a slightly expanded best send their best personal positions to m other parti- i position known by colleagues lt. The “expansion” cles. Every particle chooses the best position it has i t of the positions is controlled by c and directed out- received in the current iteration and sets its li ac- t t wards from xi: cordingly. This generalization of li is introduced in order to combat premature convergence to a lo- pt + lt 2xt cal optimum. Gt = xt + c i i − i (14) i i · 3 To avoid waiting until all particles finish their t computation, we introduce per-particle memory To introduce further randomness, xi is relocated t of “learned best positions” called the “neighbour- to a position yi sampled from the uniform distri- t t hood set” (although its members do not have to be bution in the area P (Gi, xi) formally defined as:

517

Page 53 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

the algorithm for a fixed number of position up- dates, specifically 32000. Later, we changed the D algorithm to terminate after the manager has seen P (G, x) = G G x ,G + G x d − | d − d| d | d − d| 3200 position updates without any update of the d=1h i O (15) global best position. In the following section, we t refer to the former as PSO without the termination Our P (G, x) is a hypercube centered in Gi and t condition (PSO) and the latter as PSO with the ter- touching xi, see Figure 1 for an illustration. The original SPSO2011 used a d-dimensional ball with mination condition (PSO-T). the center in G and radius G x to avoid the Properties of SPSO2011 have been investigated k − k bias of searching towards points on axes. We are by Bonyadi and Michalewicz (2014). We use a less concerned about this and opt for a simpler and slightly different algorithm, but our modifications faster calculation. should have an effect only on rotational invariance, The new velocity is set to include the previous which is not so much relevant for our purpose. velocity (reduced by w) as well as the speedup Aside from the discussion on the values of w and caused by the random relocation: c with respect to the convergence of all particles to the same point, Bonyadi and Michalewicz also t+1 t t t mention that SPSO2011 is not guaranteed to con- v = wvi + yi xi (16) i − verge to a local optimum. Since our search space Finally, the particle position is updated: is discontinuous with plateaus, the local conver- gence in the mathematical sense is not especially t+1 t t+1 t t xi = xi + vi = wvi + yi (17) useful anyway. The optimized function is evaluated at the new 4 Implementation t+1 position xi and the particle’s best position is up- dated if a new optimum was found. In any case, We implemented the algorithm described above t+1 with one parameter, the number of particles. We the best position pi together with its value is sent to m randomly selected particles (possibly in- set the size of the neighborhood set, denoted k cluding the current particle) to be included in their above, to 4 and the number of random particles re- sets of learned best positions as described above. ceiving the information about a particle’s best po- t+1 sition so far (m) to 3. The particle then sets its li to best position from its own list of learned positions. The implementation of our version of the PSO The next iteration continues with the updated algorithm is built within the standard Moses code. vectors. Normally, the algorithm would terminate The algorithm itself creates a reasonable parallel when all particles converge to a close proximity structure with each thread representing a single to each other, but it turns out that this often leads particle. to premature stopping. There are many other ap- We use similar object structure as the base- proaches possible to this problem (Xinchao, 2010; line MERT implementation. The points are rep- Evers and Ben Ghalia, 2009), but we choose a sim- resented by their own class which handles basic ple restarting strategy: when the particle is send- arithmetic and stream operations. The class car- ing out its new best position and value to m fel- ries not only the vector of the current position but lows, the manager responsible for this checks if also its associated score. this value was not reported in the previous call Multiple threads are maintained by the stan- (from any other particle). If it was, then the current dard Moses thread pools (Haddow, 2012). Ev- particle is instructed to restart itself by setting all ery thread (“Task” in Moses thread pools) cor- of its vectors to random initial state.3 The neigh- responds to a particle and is responsible for cal- borhood set is left unchanged. The restart prevents culating its search in the space using the class multiple particles exploring the same area. PSOOptimizer. There are no synchronous it- The drawback of restarts is that the stopping cri- erations, each particle proceeds at its own pace. terion is never met. In our first version, we ran All optimizers have access to a global manager object of class PSOManager, see Figure 2 for an 3The use of score and not position is possible due to the nature of the space in which a same score of two points very illustration. The manager provides methods for t likely means that the points are equivalent. the optimizers to get the best vector li from the

518

Page 54 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

Run PSO-16 PSO-64 PSO-T-16 PSO-T-64 MERT-16 1 14.5474 15.6897 15.6133 15.6613 14.5470 2 17.3292 18.7340 18.7437 18.4464 18.8704 3 18.9261 18.9788 18.9711 18.9069 19.0625 4 19.0926 19.2060 19.0646 19.0785 19.0623 5 19.1599 19.2140 19.0968 19.0738 19.1992 6 19.2444 19.2319 - 19.0772 19.1751 7 19.2470 19.2383 - - 19.0480 8 19.2613 19.2245 - - 19.1359 12 - - - - 19.1625

Table 1: The final best BLEU score after the runs of the inner loop for PSO without and with the termination condition with 16 and 64 threads respectively and standard Moses MERT implementation with 16 threads.

pared to the calculations performed in the optimiz- AllTasks ers. The only locking occurs when threads are try- PSOOptimizationTask PSOOptimizationTask ing to add points; read access to the manager can ... PSOOptimizer PSOOptimizer be concurrent. FeatureData FeatureData 5 Results ScorerData ScorerData We ran the tuning only for the English to Czech part of the tuning task. We filtered and binarized the model supplied by the organizers to achieve PSOManager better performance and smaller memory costs. +addPoint(Point p) +getBestNeighbor(int i, Point P) For the computation, we used the services of +cont() Metacentrum VO. Due to the relatively high mem- ory demands we used two SGI UV 2000 machines: Figure 2: Base structure of our PSO algorithm one with 48x 6-core Intel Xeon E5-4617 2.9GHz and 6TB RAM and one with 48x 8-core Intel Xeon E5-4627v2 3.30GHz and 6TB RAM. We ran the neighborhood set, to report its best position to the tuning process on 16 and 64 CPUs, i.e. with 16 random m particles (addPoint) and to check if and 64 particles, respectively. We submitted the the optimization should still run (cont) or termi- weights from the 16-CPU run. We also ran a test nate. The method addPoint serves two other run using the standard Moses MERT implementa- purposes: incrementing an internal counter of it- tion with 16 threads for a comparison. erations and indicating through its return value Table 1 shows the best BLEU scores at the end whether the reporting particle should restart itself. of each inner loop (as projected from the n-best Every optimizer has its own FeatureData lists on the tuning set of sentences). Both meth- and ScorerData, which are used to determine ods provide similar results. Since the methods are the score of the investigated points. As of now, stochastic, different runs will lead to different best the data is loaded serially, so the more threads we positions (and different scores). have, the longer the initialization takes. In the Comparison of our implementation with with baseline implementation of MERT, all the threads the baseline MERT on a test set is not nec- share the scoring data. This means that the data essary. Both implementations try to maximize is loaded only once, but due to some unexpected BLEU score, therefore any overtraining occurring locking, the baseline implementation never gains in the baseline MERT occurs also in our imple- speedups higher than 1.5, even with 32 threads, mentation and vice versa. see Table 2 below. Table 2 shows the average run times and This structure allows an efficient use of multi- reached scores for 8 runs of the baseline MERT ple cores. Methods of the manager are fast com- and our PSO and PSO-T, starting with the same

519

Page 55 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

Wall Clock [s] Projected BLEU Reached Outer Loop CPUs MERT PSO PSO-T MERT PSO PSO-T 1 1 186.24 10.63 397.28 2.13 62.37 19.64 14.50 0.03 13.90 0.05 13.84 0.05 1 4 123.51±3.58 72.75±1.12 21.94±4.63 14.51±0.03 14.48±0.08 14.46±0.06 1 8 135.40±8.43 43.07±0.78 15.62±3.40 14.52±0.04 14.53±0.05 14.42±0.12 1 16 139.43±8.00 33.00±1.37 14.59±2.21 14.53±0.02 14.51±0.08 14.48±0.10 1 24 119.69±4.43 32.20±1.62 16.89±3.16 14.52±0.02 14.55±0.06 14.47±0.07 1 32 119.04±4.47 33.42±2.16 19.16±2.92 14.53±0.03 14.50±0.04 14.50±0.07 3 1 701.18±47.13 1062.38±1.88 117.64±0.47 18.93±0.04 18.08±0.00 18.08±0.00 3 4 373.69±28.37 189.86±0.64 57.28±23.61 18.90±0.00 18.82±0.12 18.81±0.07 3 8 430.88±24.82 111.50±0.53 37.92±8.68 18.95±0.05 18.89±0.09 18.87±0.06 3 16 462.77±18.78 80.54±5.39 29.62±4.34 18.94±0.04 18.94±0.07 18.90±0.05 3 24 392.66±13.39 74.08±3.64 31.67±3.47 18.94±0.04 18.93±0.05 18.86±0.05 3 32 399.93±27.68 82.83±3.82 37.70±4.52 18.91±0.01 18.90±0.05 18.87±0.06 ± ± ± ± ± ± Table 2: Average run times and reached scores. The are standard deviations. ±

n-best lists as accumulated in iteration 1 and 3 of ing language resources stored and distributed the outer loop. Note that PSO and PSO-T use only by the LINDAT/CLARIN project of the Min- as many particles as there are threads, so running istry of Education, Youth and Sports of the them with just one thread leads to a degraded per- Czech Republic (project LM2015071). Compu- formace in terms of BLEU. With 4 or 8 threads, tational resources were supplied by the Ministry the three methods are on par in terms of tuning- of Education, Youth and Sports of the Czech set BLEU. Starting from 4 threads, both PSO and Republic under the Projects CESNET (Project PSO-T terminate faster than the baseline MERT No. LM2015042) and CERIT-Scientific Cloud implementation. Moreover the baseline MERT (Project No. LM2015085) provided within the proved unable to utilize multiple CPUs efficiently, program Projects of Large Research, Development whereas PSO gives us up to 14-fold speedup. and Innovations Infrastructures. In general, the higher the ratio of the serial data loading to the search computation time, the worse References the speedup. The search in PSO-T takes much Nicola Bertoldi, Barry Haddow, and Jean-Baptiste shorter time so the overhead of serial data loading Fouet. 2009. Improved minimum error rate is more apparent and PSO-T seems parallelized training in moses. The Prague Bulletin of Math- badly and gives only quadruple speedup. The re- ematical Linguistics 91:7–16. duction of this overhead is highly desirable. Mohammad Reza Bonyadi and Zbigniew Michalewicz. 2014. Spso 2011: Analysis of 6 Conclusion stability; local convergence; and rotation sensi- We presented our submission to the WMT16 Tun- tivity. In Proceedings of the 2014 conference on ing Task, a variant of particle swarm optimization Genetic and evolutionary computation. ACM, applied to minimum error-rate training in statisti- pages 9–16. cal machine translation. Our method is a drop-in Maurice Clerc. 2012. Standard particle swarm op- replacement of the standard Moses MERT and has timisation . the benefit of easy parallelization. Preliminary ex- Russ C Eberhart, James Kennedy, et al. 1995. periments suggest that it indeed runs faster and de- A new optimizer using particle swarm theory. livers comparable weight settings. In Proceedings of the sixth international sym- The effects on the number of iterations of the posium on micro machine and human science. MERT outer loop as well as on the test-set perfor- New York, NY, volume 1, pages 39–43. mance have still to be investigated. George I Evers and Mounir Ben Ghalia. 2009. Re- Acknowledgments grouping particle swarm optimization: a new global optimization algorithm with improved This work has received funding from the Eu- performance consistency across benchmarks. In ropean Union’s Horizon 2020 research and in- Systems, Man and Cybernetics, 2009. SMC novation programme under grant agreement no. 2009. IEEE International Conference on. IEEE, 645452 (QT21). This work has been us- pages 3901–3908.

520

Page 56 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

Barry Haddow. 2012. Adding Multi-Threaded De- coding to Moses. Prague Bulletin of Mathemat- ical Linguistics 93:57–66. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, et al. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions. Association for Computational Linguistics, pages 177–180. Franz Josef Och. 2003. Minimum error rate train- ing in statistical machine translation. In Pro- ceedings of the 41st Annual Meeting on Asso- ciation for Computational Linguistics-Volume 1. Association for Computational Linguistics, pages 160–167. Jun Suzuki, Kevin Duh, and Masaaki Nagata. 2011. Distributed minimum error rate training of smt using particle swarm optimization. In IJCNLP. pages 649–657. Zhao Xinchao. 2010. A perturbed particle swarm algorithm for numerical optimization. Applied Soft Computing 10(1):119–124.

521

Page 57 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

F CharacTER: Translation Edit Rate on Character Level

G Exponentially Decaying Bag-of-Words Input Features for Feed- Forward Neural Network in Statistical Machine Translation

Exponentially Decaying Bag-of-Words Input Features for Feed-Forward Neural Network in Statistical Machine Translation

Jan-Thorsten Peter, Weiyue Wang, Hermann Ney Human Language Technology and Pattern Recognition, Computer Science Department RWTH Aachen University, 52056 Aachen, Germany {peter,wwang,ney}@cs.rwth-aachen.de

Abstract context length on source and target sides. Using the Bag-of-Words (BoW) model as additional in- Recently, neural network models have put of a neural network based language model, achieved consistent improvements in sta- (Mikolov et al., 2015) have achieved very simi- tistical machine translation. However, lar perplexities on automatic speech recognition most networks only use one-hot encoded tasks in comparison to the long short-term mem- input vectors of words as their input. ory (LSTM) neural network, whose structure is In this work, we investigated the ex- much more complex. This suggests that the bag- ponentially decaying bag-of-words input of-words model can effectively store the longer features for feed-forward neural network term contextual information, which could show translation models and proposed to train improvements in statistical machine translation as the decay rates along with other weight pa- well. Since the bag-of-words representation can rameters. This novel bag-of-words model cover as many contextual words without further improved our phrase-based state-of-the-art modifying the network structure, the problem of system, which already includes a neural limited context window size of feed-forward neu- network translation model, by up to 0.5% ral networks is reduced. Instead of predefining BLEU and 0.6% TER on three different fixed decay rates for the exponentially decaying translation tasks and even achieved a simi- bag-of-words models, we propose to learn the de- lar performance to the bidirectional LSTM cay rates from the training data like other weight translation model. parameters in the neural network model.

2 The Bag-of-Words Input Features 1 Introduction The bag-of-words model is a simplifying repre- Neural network models have recently gained much sentation applied in natural language processing. attention in research on statistical machine trans- In this model, each sentence is represented as the lation. Several groups have reported strong im- set of its words disregarding the word order. Bag- provements over state-of-the-art baselines when of-words models are used as additional input fea- combining phrase-based translation with feed- tures to feed-forward neural networks in addition forward neural network-based models (FFNN) to the one-hot encoding. Thus, the probability of (Schwenk et al., 2006; Vaswani et al., 2013; the feed-forward neural network translation model Schwenk, 2012; Devlin et al., 2014), as well with an m-word source window can be written as: as with recurrent neural network models (RNN) I (Sundermeyer et al., 2014). Even in alternative I J Y bi+∆m p(e | f ) ≈ p(ei | f , fBoW,i) (1) translation systems they showed remarkable per- 1 1 bi−∆m i=1 formance (Sutskever et al., 2014; Bahdanau et al., m−1 2015). where ∆m = 2 and bi is the index of the single The main drawback of a feed-forward neural aligned source word to the target word ei. We ap- network model compared to a recurrent neural plied the affiliation technique proposed in (Devlin network model is that it can only have a limited et al., 2014) for obtaining the one-to-one align-

Page 58 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

Page 59 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

ments. The bag-of-words input features fBoW,i can contextual words from the current word. There- be seen as normalized n-of-N vectors as demon- fore the bag-of-words vector with decay weights strated in Figure 1, where n is the number of words can be defined as following: inside each bag-of-words. ˜ X |i−k| ˜ fBoW,i = d fk (2)

k∈SBoW where

[0 1 0 0 ··· 0 0] [0 0 0 1 ··· 0 0] [0 0 1 0 ··· 0 0] [1 0 0 0 ··· 1 1] [0 0 1 0 ··· 1 0] n n n n n i, k Positions of the current word and words original word features bag-of-words input features within the BoW model respectively. Figure 1: The bag-of-words input features along ˜ fBoW,i The value vector of the BoW input fea- with the original word features. The input vectors ture for the i-th word in the sentence. are projected and concatenated at the projection ˜ layer. We omit the hidden and output layers for fk One-hot encoded feature vector of the k- simplification, since they remain unchanged. th word in the sentence.

SBoW Indices set of the words contained in the 2.1 Contents of Bag-of-Words Features BoW. If a word appears more than once in the BoW, the index of the nearest one Before utilizing the bag-of-words input features to the current word will be selected. we have to decide which words should be part of it. We tested multiple different variants: d Decay rate with float value ranging from zero to one. It specifies how fast weights 1. Collecting all words of the sentence in one bag- of contextual words decay along with dis- of-words except the currently aligned word. tances, which can be learned like other weight parameters of the neural network. 2. Collecting all preceding words in one bag-of- words and all succeeding words in a second Instead of using fixed decay rate as in (Irie et al., bag-of-words. 2015), we propose to train the decay rate like other weight parameters in the neural network. The ap- 3. Collecting all preceding words in one bag-of- proach presented by (Mikolov et al., 2015) is com- words and all succeeding words in a second parable to the corpus decay rate shown here, ex- bag-of-words except those already included in cept that their work makes use of a diagonal ma- the source window. trix instead of a scalar as decay rate. In our ex- periments, three different kinds of decay rates are All of these variants provide the feed-forward trained and applied: neural network with an unlimited context in both directions. The differences between these setups 1. Corpus decay rate: all words in vocabulary only varied by 0.2% BLEU and 0.1% TER. We share the same decay rate. choose to base further experiments on the last vari- ant since it performed best and seemed to be the 2. Individual decay rate for each bag-of-words: most logical choice for us. each bag-of-words has its own decay rate given the aligned word. 2.2 Exponentially Decaying Bag-of-Words 3. Individual decay rate for each word: each word Another variant is to weight the words within uses its own decay rate. the bag-of-words model. In the standard bag- of-words representation these weights are equally We use the English sentence distributed for all words. This means the bag-of- “friends had been talking about this fish for a long time” words input is a vector which marks if a word is as an example to clarify the differences between given or not and does not encode the word or- these variants. A five words contextual window der. To avoid this problem, the exponential decay centered at the current aligned word fish has approach proposed in (Clarkson and Robinson, been applied: {about, this, fish, for, a}. 1997) has been adopted to express the distance of The bag-of-words models are used to collect all

Page 60 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

other source words outside the context window: • Enhanced low frequency counts {friends, had, been, talking} and {long, (Chen et al., 2011) time}. Furthermore, there are multiple choices for assigning decay weights to all these words in • 4-gram language model the bag-of-words feature: • 7-gram word class language model

Sentence: friends had been talking about this fish for a long time (Wuebker et al., 2013)

• Word and phrase penalties Distance: 6 5 4 3 3 4 • Hierarchical reordering model (Galley and Manning, 2008) 1. Corpus decay rate: d Additionally, a neural network translation model, Weights: d6 d5 d4 d3 d3 d4 similar to (Devlin et al., 2014), with following configurations is applied for reranking the n-best

2. Bag-of-words individual decay rate: d = dfish lists:

6 5 4 3 3 4 • Projection layer size 100 for each word Weights: dfish dfish dfish dfish dfish dfish • Two non-linear hidden layers with 1000 and 500 3. Word individual decay rate: nodes respectively d ∈ {dfriends, dhad, dbeen, dtalking, dlong, dtime} • Short-list size 10000 along with 1000 word Weights: d6 d5 d4 d3 d3 d4 friends had been talking long time classes at the output layer

• 5 one-hot input vectors of words

Unless otherwise stated, the investigations on bag- 3 Experiments of-words input features are based on this neural 3.1 Setup network model. We also integrated our neural net- work translation model into the decoder as pro- Experiments are conducted on the IWSLT 2013 posed in (Devlin et al., 2014). The relative im- German→English, WMT 2015 German→English provements provided by integrated decoding and and DARPA BOLT Chinese→English translation reranking are quite similar, which can also be con- tasks. GIZA++ (Och and Ney, 2003) is applied firmed by (Alkhouli et al., 2015). We therefore for aligning the parallel corpus. The translation decided to only work in reranking for repeated ex- quality is evaluated by case-insensitive BLEU (Pa- perimentation. pineni et al., 2002) and TER (Snover et al., 2006) metric. The scaling factors are tuned with MERT 3.2 Exponentially Decaying Bag-of-Words (Och, 2003) with BLEU as optimization criterion As shown in Section 2.2, the exponential decay on the development sets. The systems are evalu- approach is applied to express the distance of con- ated using MultEval (Clark et al., 2011). In the textual words from the current word. Thereby the experiments the maximum size of the n-best lists information of sequence order can be included into applied for reranking is 500. For the translation bag-of-words models. We demonstrated three dif- experiments, the averaged scores are presented on ferent kinds of decay rates for words in the bag- the development set from three optimization runs. of-words input feature, namely the corpus general Experiments are performed using the Jane decay rate, the bag-of-words individual decay rate toolkit (Vilar et al., 2010; Wuebker et al., 2012) and the word individual decay rate. with a log-linear framework containing following Table 1 illustrates the experimental results of feature functions: the neural network translation model with ex- • Phrase translation probabilities both directions ponentially decaying bag-of-words input features on IWSLT 2013 German→English, WMT 2015 • Word lexicon features in both directions German→English and BOLT Chinese→English

Page 61 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

IWSLT WMT BOLT test eval11 newstest2013 test BLEU[%] TER[%] BLEU[%] TER[%] BLEU[%] TER[%] BLEU[%] TER[%] Baseline + NNTM 31.9 47.5 36.7 43.0 28.8 53.8 17.4 67.1 + BoW Features 32.0 47.3 36.9 42.9 28.8 53.5∗ 17.5 67.0 + Fixed DR (0.9) 32.2∗ 47.3 37.0∗ 42.6∗† 29.0 53.5∗ 17.7∗ 66.8∗ + Corpus DR 32.1 47.3 36.9 42.7∗ 29.1∗† 53.5∗ 17.7∗ 66.7∗† + BoW DR 32.4∗† 47.0∗† 37.2∗† 42.4∗† 29.2∗† 53.2∗† 17.9∗† 66.6∗† + Word DR 32.3∗† 47.0∗ 37.1∗ 42.7∗ 29.1∗† 53.4∗ 17.8∗† 66.7∗† Baseline + LSTM 32.2∗ 47.4 37.1∗ 42.5∗† 29.0 53.3∗ 17.6 66.8∗

Table 1: Experimental results of translations using exponentially decaying bag-of-words models with different kinds of decay rates. Improvements by systems marked by ∗ have a 95% statistical significance from the baseline system, whereas † denotes the 95% statistical significant improvements with respect to the BoW Features system (without decay weights). We experimented with several values for the fixed decay rate (DR) and 0.9 performed best. The applied RNN model is the LSTM bidirectional translation model proposed in (Sundermeyer et al., 2014).

translation tasks. Here we applied two bag-of- 3.3 Comparison between Bag-of-Words and words models to separately contain the preced- Large Context Window ing and succeeding words outside the context win- The main motivation behind the usage of the bag- dow. We can see that the bag-of-words feature of-words input features is to provide the model without exponential decay weights only provides with additional context information. We compared small improvements. After appending the de- the bag-of-words input features to different source cay weights, four different kinds of decay rates side windows to refute the argument that simply provide further improvements to varying degrees. increasing the size of the window could achieve The bag-of-words individual decay rate performs the same results. Our experiments showed that in- the best, which gives us improvements by up to creasing the source side window beyond 11 gave 0.5% on BLEU and up to 0.6% on TER. On these no more improvements while the model that used tasks, these improvements even help the feed- the bag-of-words input features is able to achieve forward neural network achieve a similar perfor- the best result (Figure 2). A possible explanation mance to the popular long short-term memory re- for this could be that the feed-forward neural net- current neural network model (Sundermeyer et al., work learns its input position-dependent. If one 2014), which contains three LSTM layers with source word is moved by one position the feed- 200 nodes each. The results of the word individual forward neural network needs to have seen a word decay rate are worse than that of the bag-of-words with a similar word vector at this position dur- decay rate. One reason is that in word individual ing training to interpret it correctly. The likeli- case, the sequence order can still be missing. We hood of precisely getting the position decreases initialize all values for the tunable decay rates with with a larger distance. The bag-of-words model 0.9. In the IWSLT 2013 German→English task, on the other hand will still get the same input only the corpus decay rate is tuned to 0.578. When in- slightly stronger or weaker on the new distance vestigating the values of the trained bag-of-words and decay rate. individual decay rate vector, we noticed that the variance of the value for frequent words is much lower than for rare words. We also observed that 4 Conclusion most function words, such as prepositions and conjunctions, are assigned low decay rates. We The aim of this work was to investigate the influ- could not find a pattern for the trained value vec- ence of exponentially decaying bag-of-words in- tor of the word individual decay rates. put features with trained decay rates on the feed- forward neural network translation model. Ap- plying the standard bag-of-words model as an ad- ditional input feature in our feed-forward neural network translation model only yields slight im-

Page 62 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

37.2 References 5 words + 2 BoWs 37.1 Tamer Alkhouli, Felix Rietig, and Hermann Ney. 2015.

eval 37 Investigations on phrase-based decoding with recur- rent neural network language and translation mod- 36.9 els. In EMNLP 2015 Tenth Workshop on Statistical 36.8 Machine Translation (WMT 2015), pages 294–303, scores on 36.7 Lisbon, Portugal, September.

LEU 36.6 Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- B 36.5 gio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of 3 5 7 9 11 15 21 the 3rd International Conference on Learning Rep- Context window size (|input vectors|) resentations, San Diego, CA, USA, May.

Boxing Chen, Roland Kuhn, George Foster, and Figure 2: The change of BLEU scores Howard Johnson. 2011. Unpacking and Transform- on the eval11 set of the IWSLT 2013 ing Feature Functions: New Ways to Smooth Phrase German→English task along with the source Tables. In Proceedings of MT Summit XIII, pages 269–275, Xiamen, China, September. context window size. The source windows are always symmetrical with respect to the aligned Jonathan H. Clark, Chris Dyer, Alon Lavie, and word. For instance, window size five denotes that Noah A. Smith. 2011. Better Hypothesis Testing for two preceding and two succeeding words along Statistical Machine Translation: Controlling for Op- with the aligned word are included in the window. timizer Instability. In Proceedings of the 49th An- nual Meeting of the Association for Computational The average sentence length of the corpus is about Linguistics: Short Papers, pages 176–181, Portland, 18 words. The red line is the result of using a OR, USA, June. model with bag-of-words input features and a bag-of-words individual decay rate. P. Clarkson and A. Robinson. 1997. Language Model Adaptation Using Mixtures and an Exponentially Decaying Cache. In Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and provements, since the original bag-of-words rep- Signal Processing, pages 799–802, Washington, resentation does not include information about the DC, USA, April. ordering of each word. To avoid this problem, we Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas applied the exponential decay weight to express Lamar, Richard Schwartz, and John Makhoul. 2014. the distances between words and propose to train Fast and robust neural network joint models for sta- the decay rate as other weight parameters of the tistical machine translation. In Proceedings of the 52nd Annual Meeting of the Association for Com- network. Three different kinds of decay rates are putational Linguistics, pages 1370–1380, Baltimore, proposed, the bag-of-words individual decay rate MD, USA, June. performs best and provides improvements by av- eragely 0.5% BLEU on three different translation Michel Galley and Christopher D. Manning. 2008. A tasks, which is even able to outperform a bidirec- simple and effective hierarchical phrase reordering model. In Proceedings of the 2008 Conference on tional LSTM translation model on the given tasks. Empirical Methods in Natural Language Process- By contrast, applying additional one-hot encoded ing, pages 848–856, Honolulu, HI, USA, October. input vectors or enlarging the network structure can not achieve such good performances as bag- Kazuki Irie, Ralf Schlter, and Hermann Ney. 2015. Bag-of-Words Input for Long History Representa- of-words features. tion in Neural Network-based Language Models for Speech Recognition. In Proceedings of the 16th Annual Conference of International Speech Com- Acknowledgments munication Association, pages 2371–2375, Dresden, Germany, September.

This paper has received funding from the Euro- Tomas Mikolov, Armand Joulin, Sumit Chopra, pean Union’s Horizon 2020 research and innova- Michael Mathieu, and Marc’Aurelio Ranzato. 2015. tion programme under grant agreement no 645452 Learning longer memory in recurrent neural net- works. In Proceedings of the 3rd International Con- (QT21). ference on Learning Representations, San Diego, CA, USA, May.

Page 63 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

Franz Josef Och and Hermann Ney. 2003. A Sys- Joern Wuebker, Matthias Huck, Stephan Peitz, Malte tematic Comparison of Various Statistical Align- Nuhn, Markus Freitag, Jan-Thorsten Peter, Saab ment Models. Computational Linguistics, 29:19– Mansour, and Hermann Ney. 2012. Jane 2: 51, March. Open Source Phrase-based and Hierarchical Statis- tical Machine Translation. In International Confer- Franz Josef Och. 2003. Minimum Error Rate Train- ence on Computational Linguistics, pages 483–491, ing in Statistical Machine Translation. In Proceed- Mumbai, India, December. ings of the 41st Annual Meeting on Association for Computational Linguistics, pages 160–167, Sap- Joern Wuebker, Stephan Peitz, Felix Rietig, and Her- poro, Japan, July. mann Ney. 2013. Improving Statistical Machine Translation with Word Class Models. In Proceed- Kishore Papineni, Salim Roukos, Todd Ward, and Wei- ings of the 2013 Conference on Empirical Methods Jing Zhu. 2002. BLEU: A Method for Automatic in Natural Language Processing, pages 1377–1381, Evaluation of Machine Translation. In Proceedings Seattle, WA, USA, October. of the 40th Annual Meeting on Association for Com- putational Linguistics, pages 311–318, Philadelphia, PA, USA, July. Holger Schwenk, Daniel Dechelotte,´ and Jean-Luc Gauvain. 2006. Continuous space language models for statistical machine translation. In Proceedings of the 44th Annual Meeting of the International Com- mittee on Computational Linguistics and the Asso- ciation for Computational Linguistics, pages 723– 730, Sydney, Australia, July. Holger Schwenk. 2012. Continuous space translation models for phrase-based statistical machine transla- tion. In Proceedings of the 24th International Con- ference on Computational Linguistics, pages 1071– 1080, Mumbai, India, December. Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin- nea Micciulla, and John Makhoul. 2006. A study of translation edit rate with targeted human annota- tion. In Proceedings of the Conference of the As- sociation for Machine Translation in the Americas, pages 223–231, Cambridge, MA, USA, August. Martin Sundermeyer, Tamer Alkhouli, Joern Wuebker, and Hermann Ney. 2014. Translation modeling with bidirectional recurrent neural networks. In Pro- ceedings of the 2014 Conference on Empirical Meth- ods in Natural Language Processing, pages 14–25, Doha, Qatar, October. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural net- works. In Z Ghahramani, M Welling, C Cortes, N D Lawrence, and K Q Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 3104–3112. Curran Associates, Inc. Ashish Vaswani, Yinggong Zhao, Victoria Fossum, and David Chiang. 2013. Decoding with large- scale neural language models improves translation. In Proceedings of the 2013 Conference on Empiri- cal Methods in Natural Language Processing, pages 1387–1392, Seattle, WA, USA, October. David Vilar, Daniel Stein, Matthias Huck, and Her- mann Ney. 2010. Jane: Open Source Hierarchi- cal Translation, Extended with Reordering and Lex- icon Models. In ACL 2010 Joint Fifth Workshop on Statistical Machine Translation and Metrics MATR, pages 262–270, Uppsala, Sweden, July.

Page 64 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

H A Comparative Study on Vocabulary Reduction for Phrase Table Smoothing

A Comparative Study on Vocabulary Reduction for Phrase Table Smoothing

Yunsu Kim, Andreas Guta, Joern Wuebker∗, and Hermann Ney Human Language Technology and Pattern Recognition Group RWTH Aachen University, Aachen, Germany {surname}@cs.rwth-aachen.de ∗Lilt, Inc. [email protected]

Abstract label vocabularies with arbitrary sizes and struc- tures. Our experiments reveal that the vocabulary This work systematically analyzes the of the smoothing model has no significant effect smoothing effect of vocabulary reduction on the end-to-end translation quality. For exam- for phrase translation models. We ex- ple, a randomized label space also leads to a de- tensively compare various word-level vo- cent improvement of BLEU or TER scores by the cabularies to show that the performance presented smoothing models. of smoothing is not significantly affected We also test vocabulary reduction in transla- by the choice of vocabulary. This result tion scenarios of different scales, showing that the provides empirical evidence that the stan- smoothing works better with more parallel cor- dard phrase translation model is extremely pora. sparse. Our experiments also reveal that vocabulary reduction is more effective for 2 Related Work smoothing large-scale phrase tables. Koehn and Hoang (2007) propose integrating a la- 1 Introduction bel vocabulary as a factor into the phrase-based SMT pipeline, which consists of the following Phrase-based systems for statistical machine trans- three steps: mapping from words to labels, label- lation (SMT) (Zens et al., 2002; Koehn et al., to-label translation, and generation of words from 2003) have shown state-of-the-art performance labels. Rishøj and Søgaard (2011) verify the ef- over the last decade. However, due to the huge size fectiveness of word classes as factors. Assuming of phrase vocabulary, it is difficult to collect robust probabilistic mappings between words and labels, statistics for lots of phrase pairs. The standard the factorization implies a combinatorial expan- phrase translation model thus tends to be sparse sion of the phrase table with regard to different (Koehn, 2010). vocabularies. A fundamental solution to a sparsity problem in Wuebker et al. (2013) show a simplified case of natural language processing is to reduce the vo- the factored translation by adopting hard assign- cabulary size. By mapping words onto a smaller ment from words to labels. In the end, they train label space, the models can be trained to have the existing translation, language, and reordering denser distributions (Brown et al., 1992; Miller et models on word classes to build the corresponding al., 2004; Koo et al., 2008). Examples of such la- smoothing models. bels are part-of-speech (POS) tags or lemmas. Other types of features are also trained on word- In this work, we investigate the vocabulary re- level labels, e.g. hierarchical reordering fea- duction for phrase translation models with respect tures (Cherry, 2013), an n-gram-based translation to various vocabulary choice. We evaluate two model (Durrani et al., 2014), and sparse word pair types of smoothing models for phrase translation features (Haddow et al., 2015). The first and the probability using different kinds of word-level la- third are trained with a large-scale discriminative bels. In particular, we use automatically gener- training algorithm. ated word classes (Brown et al., 1992) to obtain For all usages of word-level labels in SMT,

Page 65 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

a common and important question is which la- where N is the count of a phrase or a phrase pair bel vocabulary maximizes the translation quality. in the training data. These counts are very low for Bisazza and Monz (2014) compare class-based many phrases due to a limited amount of bilingual language models with diverse kinds of labels in training data. terms of their performance in translation into mor- Using a smaller vocabulary, we can aggregate phologically rich languages. To the best of our the low counts and make the distribution smoother. knowledge, there is no published work on sys- We now define two types of smoothing models for tematic comparison between different label vocab- Equation 2 using a general word-label mapping c. ularies, model forms, and training data size for smoothing phrase translation models—the most 4.1 Mapping All Words at Once (map-all) basic component in state-of-the-art SMT systems. For the phrase translation model, the simplest for- Our work fulfills these needs with extensive trans- mulation of vocabulary reduction is obtained by lation experiments (Section 5) and quantitative replacing all words in the source and target phrases analysis (Section 6) in a standard phrase-based with the corresponding labels in a smaller space. SMT framework. Namely, we employ the following probability in- stead of Equation 2: 3 Word Classes N(c(f˜), c(˜e)) p (f˜|e˜) = (3) In this work, we mainly use unsupervised word all N(c(˜e)) classes by Brown et al. (1992) as the reduced vo- which we call map-all. This model resembles the cabulary. This section briefly reviews the principle word class translation model of Wuebker et al. and properties of word classes. (2013) except that we allow any kind of word-level A word-class mapping c is estimated by a clus- labels. tering algorithm that maximizes the following ob- This model generalizes all words of a phrase jective (Brown et al., 1992): without distinction between them. Also, the same I formulation is applied to word-based lexicon mod- X X L := p(c(ei)|c(ei−1)) · p(ei|c(ei)) (1) els. eI i=1 1 4.2 Mapping Each Word at a Time I (map-each) for a given monolingual corpus {e1}, where each I More elaborate smoothing can be achieved by gen- e1 is a sentence of length I in the corpus. The objective guides c to prefer certain collocations of eralizing only a sub-part of the phrase pair. The class sequences, e.g. an auxiliary verb class should idea is to replace one source word at a time with succeed a class of pronouns or person names. its respective label. For each source position j, we Consequently, the resulting c groups words ac- also replace the target words aligned to the source cording to their syntactic or semantic similarity. word fj. For this purpose, we let aj ⊆ {1, ..., |e˜|} Word classes have a big advantage for our com- denote a set of target positions aligned to j. The parative study: The structure and size of the class resulting model takes a weighted average of the vocabulary can be arbitrarily adjusted by the clus- redefined translation probabilities over all source ˜ tering parameters. This makes it possible to pre- positions of f: pare easily an abundant set of label vocabularies |f˜| (j) ˜ (aj ) that differ in linguistic coherence and degree of ˜ X N(c (f), c (˜e)) peach(f|e˜) = wj · (4) N(c(aj )(˜e)) generalization. j=1 4 Smoothing Models where the superscripts of c indicate the positions that are mapped onto the label space. wj is a In the standard phrase translation model, the trans- P weight for each source position, where j wj = lation probability for each segmented phrase pair 1. We call this model map-each. ˜ (f, e˜) is estimated by relative frequencies: We illustrate this model with a pair of three- word phrases: f˜ = [f , f , f ] and e˜ = [e , e , e ] ˜ 1 2 3 1 2 3 ˜ N(f, e˜) (see Figure 1 for the in-phrase word alignments). pstd(f|e˜) = (2) N(˜e) The map-each model score for this phrase pair is:

Page 66 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

e1 5 Experiments

e2 5.1 Setup e3 We evaluate how much the translation quality f1 f 2 f 3 is improved by the smoothing models in Sec- Figure 1: Word alignments of a pair of three-word tion 4. The two smoothing models are trained phrases. in both source-to-target and target-to-source di- rections, and integrated as additional features in the log-linear combination of a standard phrase- peach([f1, f2, f3] | [e1, e2, e3] ) = based SMT system (Koehn et al., 2003). We also test linear interpolation between the standard and N([c(f1), f2, f3], [c(e1), e2, e3]) w1 · smoothing models, but the results are generally N([c(e1), e2, e3]) worse than log-linear interpolation. Note that vo- N([f1, c(f2), f3], [e1, e2, e3]) cabulary reduction models by themselves cannot + w2 · N([e1, e2, e3]) replace the corresponding standard models, since this leads to a considerable drop in translation N([f1, f2, c(f3)], [e1, c(e2), c(e3)]) quality (Wuebker et al., 2013). + w3 · N([e1, c(e2), c(e3)]) Our baseline systems include phrase transla- tion models in both directions, word-based lexi- where the alignments are depicted by line seg- con models in both directions, word/phrase penal- ments. ties, a distortion penalty, a hierarchical lexicalized First of all, we replace f1 and also e1, which is reordering model (Galley and Manning, 2008), aligned to f1, with their corresponding labels. As a 4-gram language model, and a 7-gram word f2 has no alignment points, we do not replace any class language model (Wuebker et al., 2013). The target word accordingly. f3 triggers the class re- model weights are trained with minimum error placement of two target words at the same time. rate training (Och, 2003). All experiments are Note that the model implicitly encapsulates the conducted with an open source phrase-based SMT alignment information. toolkit Jane 2 (Wuebker et al., 2012). We empirically found that the map-each model To validate our experimental results, we mea- performs best with the following weight: sure the statistical significance using the paired bootstrap resampling method of Koehn (2004). Every result in this section is marked with ‡ if it N(c(j)(f˜), c(aj )(˜e)) wj = (5) is statistically significantly better than the base- |f˜| 0 line with 95% confidence, or with † for 90% con- P N(c(j )(f˜), c(aj0 )(˜e)) j0=1 fidence.

which is a normalized count of the generalized 5.2 Comparison of Vocabularies phrase pair itself. Here, the count is relatively The presented smoothing models are dependent large when fj, the word to be backed off, is less on the label vocabulary, which is defined by the ˜ frequent than other words in f. In contrast, if fj word-label mapping c. Here, we train the models is a very frequent word and one of the other words with various label vocabularies and compare their in f˜is rare, the count becomes low due to that rare smoothing performance. word. The same logic holds for target words in e˜. The experiments are done on the IWSLT 2012 After all, Equation 5 carries more weight when a German→English shared translation task. To rare word is replaced with its label. The intuition rapidly perform repetitive experiments, we train is that a rare word is the main reason for unstable the translation models with the in-domain TED counts and should be backed off above all. We use portion of the dataset (roughly 2.5M running this weight for all experiments in the next section. words for each side). We run the monolingual In contrast, the map-all model merely replace word clustering algorithm of (Botros et al., 2015) all words at one time and ignore alignments within on each side of the parallel training data to obtain phrase pairs. class label vocabularies (Section 3).

Page 67 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

29.4 We carry out comparative experiments regard- map-each ing the three factors of the clustering algorithm: map-all ‡ Baseline 29.0 ‡ ‡ ‡ ‡ † † † ‡

‡ ‡ ‡ 1) Clustering iterations. It is shown that the BLEU [%] ‡ 28.6 ‡ ‡ ‡ number of iterations is the most influential ‡

factor in clustering quality (Och, 1995). We 28.2 0 5 10 15 20 25 30 35 now verify its effect on translation quality clustering iterations when the clustering is used for phrase table smoothing. Figure 3: BLEU scores for clustering iterations As we run the clustering algorithm, we ex- when using a fixed set of model weights. The tract an intermediate class mapping for each weights that produce the best results in Figure 2 iteration and train the smoothing models with are chosen. it. The model weights are tuned for each it- eration separately. The BLEU scores of the 2) Initialization of the clustering. Since the tuned systems are given in Figure 2. We use clustering process has no significant impact 100 classes on both source and target sides. on the translation quality, we hypothesize that the initialization may dominate the cluster- 29.4 map-each ing. We compare five different initial class map-all Baseline 29.0 mappings: ‡ † ‡ ‡ ‡ • random: randomly assign words to ‡ BLEU [%] 28.6 ‡ † classes ‡ ‡ † • top-frequent (default): top-frequent 28.2 0 5 10 15 20 25 30 35 clustering iterations words have their own classes, while all other words are in the last class Figure 2: BLEU scores for clustering iterations • same-countsum: each class has almost when using individually tuned model weights for the same sum of word unigram counts each iteration. Dots indicate those iterations in • same-#words: each class has almost the which the translation is performed. same number of words • count-bins: each class represents a bin The score does not consistently increase or of the total count range decrease over the iterations; it is rather on a similar level (± 0.2% BLEU) for all settings BLEU TER with slight fluctuations. This is an important Initialization [%] [%] clue that the whole process of word clustering has no meaning in smoothing phrase transla- Baseline 28.3 52.2 tion models. + map-each random 28.9‡ 51.7‡ To see this more clearly, we keep the top-frequent 29.0‡ 51.5‡ model weights fixed over different systems same-countsum 28.8‡ 51.7‡ and run the same set of experiments. In this same-#words 28.9‡ 51.6‡ way, we focus only on the change of label count-bins 29.0‡ 51.4‡ vocabulary, removing the impact of nonde- terministic model weight optimization. The Table 1: Translation results for various initializa- results are given in Figure 3. tions of the clustering. 100 classes on both sides. This time, the curves are even flatter, re- sulting in only ± 0.1% BLEU difference over Table 1 shows the translation results the iterations. More surprisingly, the models with the map-each model trained with these trained with the initial clustering, i.e. when initializations—without running the cluster- the clustering algorithm has not even started ing algorithm. We use the same set of model yet, are on a par with those trained with weights used in Figure 3. We find that the more optimized classes in terms of transla- initialization method also does not affect the tion quality. translation performance. As an extreme case,

Page 68 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

random clustering is also a fine candidate for lation tasks: IWSLT 2012 German→English, training the map-each model. WMT 2015 Finnish→English, WMT 3) Number of classes. This determines the vo- 2014 English→German, and WMT 2015 cabulary size of a label space, which even- English→Czech. We train 100 classes on each tually adjusts the smoothing degree. Table side with 30 clustering iterations starting from the 2 shows the translation performance of the default (top-frequent) initialization. map-each model with a varying number of Table 3 provides the corpus statistics of all classes. Similarly as before, there is no se- datasets used. Note that a morphologically rich rious performance gap among different word language is on the source side for the first two classes, and POS tags and lemmas also com- tasks, and on the target side for the last two form to this trend. tasks. According to the results (Table 4), the map- However, we observe a slight but steady each model, which encourages backing off infre- degradation of translation quality (≈ -0.2% quent words, performs consistently better (maxi- BLEU) when the vocabulary size is larger mum +0.5% BLEU, -0.6% TER) than the map-all than a few hundreds. We also lose statisti- model in all cases. cal significance for BLEU in these cases. The 5.4 Comparison of Training Data Size reason could be: If the label space becomes larger, it gets closer to the original vocabulary Lastly, we analyze the smoothing performance and therefore the smoothing model provides for different training data sizes (Figure 4). The less additional information to add to the stan- improvement of BLEU score over the baseline dard phrase translation model. decreases drastically when the training data get smaller. We argue that this is because the smooth- ing models are only the additional scores for the #vocab BLEU TER phrases seen in the training data. For smaller train- (source) [%] [%] ing data, we have more out-of-vocabulary (OOV) Baseline 28.3 52.2 words in the test set, which cannot be handled by + map-each 100 29.0‡ 51.5‡ the presented models. (word class) 200 28.9† 51.6‡ 16 35 500 28.7 51.8‡ ‡ 1000 28.7 51.8 15 † 10000 28.7 51.9 30 14 + map-each (POS) 52 28.9† 51.5‡ ‡ + map-each (lemma) 26744 28.8 51.7 13 map-each Baseline 25 OOV rate BLEU [%] 12

Table 2: Translation results for different vocabu- OOV rate [%]

lary sizes. 11 20

10 The series of experiments show that the map- 9 15 each model performs very similar across vocab- 0 5 10 15 20 25 ulary size and its structure. From our internal ex- running words [M] periments, this argument also holds for the map-all model. The results do not change even when we Figure 4: BLEU scores and OOV rates for use a different clustering algorithm, e.g. bilingual the varying training data portion of WMT 2015 clustering (Och, 1999). For the translation perfor- Finnish→English data. mance, the more important factor is the log-linear model training to find an optimal set of weights for 6 Analysis the smoothing models. In Section 5.2, we have shown experimentally that 5.3 Comparison of Smoothing Models more optimized or more fine-grained classes do Next, we compare the two smoothing models not guarantee better smoothing performance. We by their performance in four different trans- now verify by examining translation outputs that

Page 69 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

IWSLT 2012 WMT 2015 WMT 2014 WMT 2015 German English Finnish English English German English Czech Sentences 130k 1.1M 4M 0.9M Running Words 2.5M 2.5M 23M 32M 104M 105M 23.9M 21M Vocabulary 71k 49k 509k 88k 648k 659k 161k 345k

Table 3: Bilingual training data statistics for IWSLT 2012 German→English, WMT 2015 Finnish→English, WMT 2014 English→German, and WMT 2015 English→Czech tasks.

de-en fi-en en-de en-cs BLEU TER BLEU TER BLEU TER BLEU TER [%] [%] [%] [%] [%] [%] [%] [%] Baseline 28.3 52.2 15.1 72.6 14.6 69.8 15.3 68.7 + map-all 28.6‡ 51.6‡ 15.3‡ 72.5 14.8‡ 69.4‡ 15.4‡ 68.2‡ + map-each 29.0‡ 51.4‡ 15.8‡ 72.0‡ 15.1‡ 69.0‡ 15.8‡ 67.6‡

Table 4: Translation results for IWSLT 2012 German→English, WMT 2015 Finnish→English, WMT 2014 English→German, and WMT 2015 English→Czech tasks.

Top 200 TER-improved Sentences Common Input Same Translation Model Classes #vocab [%] [%] map-each optimized 100 - - non-optimized 100 89.5 89.9 random 100 88.5 89.8 lemma 26744 87.0 92.6 map-all optimized 100 56.0 54.5

Table 5: Comparison of translation outputs for the smoothing models with different vocabularies. “op- timized” denotes 30 iterations of the clustering algorithm, whereas “non-optimized” means the initial (default) clustering.

the same level of performance is not by chance but cates that two systems are particularly effective in due to similar hypothesis scoring across different a large common part of the test set, showing that systems. they behaved analogously in the search process. Given a test set, we compare its translations The numbers in this column are computed against generated from different systems as follows. First, the map-each model setup trained with 100 opti- for each translated set, we sort the sentences by mized word classes (first row). For all map-each how much the sentence-level TER is improved settings, the overlap is very large—around 90%. over the baseline translation. Then, we select the To investigate further, we count how often the top 200 sentences from this sorted list, which rep- two translations of a single input are identical (the resent the main contribution to the decrease of last column). This is normalized by the number TER. In Table 5, we compare the top 200 TER- of common input sentences in the top 200 lists be- improved translations of the map-each model se- tween two systems. It is a straightforward measure tups with different vocabularies. to see if two systems discriminate translation hy- In the fourth column, we trace the input sen- potheses in a similar manner. Remarkably, all sys- tences that are translated by the top 200 lists, and tems equipped with the map-each model produce count how many of those inputs are overlapped exactly the same translations for the most part of across given systems. Here, a large overlap indi- the top 200 TER-improved sentences.

Page 70 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

We can see from this analysis that, even though For future work, we plan to perform a similar a smoothing model is trained with essentially dif- set of comparative experiments on neural machine ferent vocabularies, it helps the translation process translation systems. in basically the same manner. For comparison, we also compute the measures for a map-all model, Acknowledgments which are far behind the high similarity among the This paper has received funding from the Euro- map-each models. Indeed, for smoothing phrase pean Union’s Horizon 2020 research and innova- translation models, changing the model structure tion programme under grant agreement no 645452 for vocabulary reduction exerts a strong influence (QT21). in the hypothesis scoring, yet changing the vocab- ulary does not. References 7 Conclusion Arianna Bisazza and Christof Monz. 2014. Class- Reducing vocabulary using word-label mapping is based language modeling for translating into mor- a simple and effective way of smoothing phrase phologically rich languages. In Proceedings of 25th International Conference on Computational Lin- translation models. By mapping each word in a guistics (COLING 2014), pages 1918–1927, Dublin, phrase at a time, the translation quality can be im- Ireland, August. proved by up to +0.7% BLEU and -0.8% TER over a standard phrase-based SMT baseline, which is Rami Botros, Kazuki Irie, Martin Sundermeyer, and Hermann Ney. 2015. On efficient training of word superior to Wuebker et al. (2013). classes and their application to recurrent neural net- Our extensive comparison among various vo- work language models. In Proceedings of 16th An- cabularies shows that different word-label map- nual Conference of the International Speech Com- pings are almost equally effective for smoothing munication Association (Interspeech 2015), pages 1443–1447, Dresden, Germany, September. phrase translation models. This allows us to use any type of word-level label, e.g. a randomized Peter F. Brown, Peter V. deSouza, Robert L. Mer- vocabulary, for the smoothing, which saves a con- cer, Vincent J. Della Pietra, and Jenifer C. Lai. siderable amount of effort in optimizing the struc- 1992. Class-based n-gram models of natural lan- ture and granularity of the label vocabulary. Our guage. Computational Linguistics, 18(4):467–479, December. analysis on sentence-level TER demonstrates that the same level of performance stems from the Colin Cherry. 2013. Improved reordering for phrase- analogous hypothesis scoring. based translation using sparse features. In Pro- We claim that this result emphasizes the fun- ceedings of 2013 Conference of the North American Chapter of the Association for Computational Lin- damental sparsity of the standard phrase transla- guistics: Human Language Technologies (NAACL- tion model. Too many target phrase candidates HLT 2013), pages 22–31, Atlanta, GA, USA, June. are originally undervalued, so giving them any reasonable amount of extra probability mass, e.g. Nadir Durrani, Philipp Koehn, Helmut Schmid, and Alexander Fraser. 2014. Investigating the useful- by smoothing with random classes, is enough to ness of generalized word representations in smt. In broaden the search space and improve translation Proceedings of 25th Annual Conference on Com- quality. Even if we change a single parameter in putational Linguistics (COLING 2014), pages 421– estimating the label space, it does not have a sig- 432, Dublin, Ireland, August. nificant effect on scoring hypotheses, where many Michel Galley and Christopher D. Manning. 2008. other models than the smoothed translation model, A simple and effective hierarchical phrase reorder- e.g. language models, are involved with large ing model. In Proceedings of 2008 Conference on weights. Nevertheless, an exact linguistic expla- Empirical Methods in Natural Language Process- ing (EMNLP 2008) nation is still to be discovered. , pages 848–856, Honolulu, HI, USA, October. Our results on varying training data show that vocabulary reduction is more suitable for large- Barry Haddow, Matthias Huck, Alexandra Birch, Niko- scale translation setups. This implies that OOV lay Bogoychev, and Philipp Koehn. 2015. The edin- burgh/jhu phrase-based machine translation systems handling is more crucial than smoothing phrase for wmt 2015. In Proceedings of 2016 EMNLP 10th translation models for low-resource translation Workshop on Statistical Machine Translation (WMT tasks. 2016), pages 126–133, Lisbon, Portugal, September.

Page 71 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

Philipp Koehn and Hieu Hoang. 2007. Factored trans- source phrase-based and hierarchical statistical ma- lation models. In Proceedings of 2007 Joint Con- chine translation. In Proceedings of 24th Inter- ference on Empirical Methods in Natural Language national Conference on Computational Linguistics Processing and Computational Natural Language (COLING 2012), pages 483–492, Mumbai, India, Learning (EMNLP-CoNLL 2007), pages 868–876, December. Prague, Czech Republic, June. Joern Wuebker, Stephan Peitz, Felix Rietig, and Her- Philipp Koehn, Franz Josef Och, and Daniel Marcu. mann Ney. 2013. Improving statistical machine 2003. Statistical phrase-based translation. In Pro- translation with word class models. In Proceedings ceedings of 2003 Conference of the North American of 2013 Conference on Empirical Methods in Nat- Chapter of the Association for Computational Lin- ural Language Processing (EMNLP 2013), pages guistics on Human Language Technology (NAACL- 1377–1381, Seattle, USA, October. HLT 2003), pages 48–54, Edmonton, Canada, May. Richard Zens, Franz Josef Och, and Hermann Ney. Philipp Koehn. 2004. Statistical significance tests for 2002. Phrase-based statistical machine translation. machine translation evaluation. In Proceedings of In Matthias Jarke, Jana Koehler, and Gerhard Lake- 2004 Conference on Empirical Methods in Natural meyer, editors, 25th German Conference on Artifi- Language Processing (EMNLP 2004), pages 388– cial Intelligence (KI2002), volume 2479 of Lecture 395, Barcelona, Spain, July. Notes in Artificial Intelligence (LNAI), pages 18–32, Aachen, Germany, September. Springer Verlag. Philipp Koehn. 2010. Statistical Machine Translation. Cambridge University Press, New York, NY, USA.

Terry Koo, Xavier Carreras, and Michael Collins. 2008. Simple semi-supervised dependency parsing. In Proceedings of 46th Annual Meeting of the Asso- ciation for Computational Linguistics (ACL 2008), pages 595–603, Columbus, OH, USA, June.

Scott Miller, Jethran Guinness, and Alex Zamanian. 2004. Name tagging with word clusters and discrim- inative training. In Proceedings of 2004 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2004), pages 337–342, Boston, MA, USA, May.

Franz Josef Och. 1995. Maximum-likelihood- schatzung¨ von wortkategorien mit verfahren der kombinatorischen optimierung. Studienar- beit, Friedrich-Alexander-Universitat¨ Erlangen- Nurnberg,¨ Erlangen, Germany, May.

Franz Josef Och. 1999. An efficient method for de- termining bilingual word classes. In Proceedings of 9th Conference on European Chapter of Association for Computational Linguistics (EACL 1999), pages 71–76, Bergen, Norway, June.

Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of 41st Annual Meeting of the Association for Com- putational Linguistics (ACL 2003), pages 160–167, Sapporo, Japan, July.

Christian Rishøj and Anders Søgaard. 2011. Factored translation with unsupervised word clusters. In Pro- ceedings of 2011 EMNLP 6th Workshop on Statisti- cal Machine Translation (WMT 2011), pages 447– 451, Edinburgh, Scotland, July.

Joern Wuebker, Matthias Huck, Stephan Peitz, Malte Nuhn, Markus Freitag, Jan-Thorsten Peter, Saab Mansour, and Hermann Ney. 2012. Jane 2: Open

Page 72 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

I BIRA: Improved Predictive Exchange Word Clustering

BIRA: Improved Predictive Exchange Word Clustering

Jon Dehdari1,2 and Liling Tan2 and Josef van Genabith1,2 1DFKI, Saarbrucken,¨ Germany jon.dehdari,josef.van genabith @dfki.de { } 2University of Saarland, Saarbrucken,¨ Germany [email protected]

Abstract (Zollmann and Vogel, 2011), and OSM (Durrani et al., 2014), among many others. Word clusters are useful for many NLP tasks Word clusterings have also found utility in pars- including training neural network language ing (Koo et al., 2008; Candito and Seddah, 2010; models, but current increases in datasets are Kong et al., 2014), chunking (Turian et al., 2010), outpacing the ability of word clusterers to han- dle them. Little attention has been paid thus NER (Miller et al., 2004; Liang, 2005; Ratinov and far on inducing high-quality word clusters at Roth, 2009; Ritter et al., 2011), structure transfer a large scale. The predictive exchange algo- (Tackstr¨ om¨ et al., 2012), and discourse relation dis- rithm is quite scalable, but sometimes does covery (Rutherford and Xue, 2014). not provide as good perplexity as other slower Word clusters also speed up normalization in clustering algorithms. training neural network and MaxEnt language We introduce the bidirectional, interpolated, models, via class-based decomposition (Goodman, refining, and alternating (BIRA) predictive ex- 2001a). This reduces the normalization time from change algorithm. It improves upon the pre- ( V ) (the vocabulary size) to ( V ) . More dictive exchange algorithm’s perplexity by up O | | ≈ O | | improvements to (log( V )) are found using hier- to 18%, giving it perplexities comparable to O | | p the slower two-sided exchange algorithm, and archical softmax (Morin and Bengio, 2005; Mnih better perplexities than the slower Brown clus- and Hinton, 2009) . tering algorithm. Our BIRA implementation is fast, clustering a 2.5 billion token English 2 Word Clustering News Crawl corpus in 3 hours. It also reduces machine translation training time while pre- Word clustering partitions a vocabulary V, grouping serving translation quality. Our implementa- together words that function similarly. This helps tion is portable and freely available. generalize language and alleviate data sparsity. We discuss flat clustering in this paper. Flat, or strict partitioning clustering surjectively maps word types 1 Introduction onto a smaller set of clusters. Words can be grouped together into equivalence The exchange algorithm (Kneser and Ney, 1993) classes to help reduce data sparsity and better gener- is an efficient technique that exhibits a general time alize data. Word clusters are useful in many NLP ap- complexity of ( V C I), where V is the O | | × | | × | | plications. Within machine translation word classes number of word types, C is the number of classes, | | are used in word alignment (Brown et al., 1993; and I is the number of training iterations, typically Och and Ney, 2000), translation models (Koehn and < 20 . This omits the specific method of exchang- Hoang, 2007; Wuebker et al., 2013), reordering ing words, which adds further complexity. Words (Cherry, 2013), preordering (Stymne, 2012), target- are exchanged from one class to another until con- side inflection (Chahuneau et al., 2013), SAMT vergence or I .

1169

Proceedings of NAACL-HLT 2016, pages 1169–1174, San Diego, California, June 12-17, 2016. c 2016 Association for Computational Linguistics

Page 73 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

One of the oldest and still most popular exchange gration of these two models: algorithm implementations is mkcls (Och, 1995)1, P (wi wi 1, wi+1) , P (wi ci) (1) which adds various metaheuristics to escape local | − | optima. Botros et al. (2015) introduce their imple- (λP (ci wi 1) · | − mentation of three exchange-based algorithms. Mar- + (1 λ)P (ci wi+1)) tin et al. (1998) and Muller¨ and Schutze¨ (2015)2 − | use trigrams within the exchange algorithm. Clark The interpolation weight λ for the forward direction (2003) adds an orthotactic bias.3 alternates to 1 λ every a iterations (i): − The previous algorithms use an unlexicalized 1 λ0 if i mod a = 0 (two-sided) language model: P (wi wi 1) = λi := − (2) | − λ0 otherwise P (wi ci) P (ci ci 1) , where the class ci of the pre- ( | | − dicted word wi is conditioned on the class ci 1 of − Figure 1 illustrates the benefit of this λ-inversion to the previous word wi 1 . Goodman (2001b) altered − help escape local minima, with lower training set this model so that ci is conditioned directly upon perplexity by inverting λ every four iterations: wi 1 , hence: P (wi wi 1) = P (wi ci) P (ci wi 1) . − | − | | − This new model fractionates the history more, but it ● allows for a large speedup in hypothesizing an ex- 800 Clusterer change since the history doesn’t change. The re- ● 750 ● Predictive ● Exchange ● ● ● sulting partially lexicalized (one-sided) class model ● ● ● ● ● ● ● ● 700 +Rev gives the accompanying predictive exchange al- Perplexity gorithm (Goodman, 2001b; Uszkoreit and Brants, 650 2008) a time complexity of ((B + V ) C I) 5 10 O | | × | | × Iteration where B is the number of unique bigrams in the training set.4 We introduce a set of improvements Figure 1: Training set perplexity using lambda inversion +Rev to this algorithm to enable high-quality large-scale ( ), using 100M tokens of the Russian News Crawl (cf. 4.1). Here a = 4, λ = 1, and C = 800 . word clusters. § 0 | | The time complexity is (2 (B+ V ) C I) . O × | | ×| |× 3 BIRA Predictive Exchange The original predictive exchange algorithm can be obtained by setting λ = 1 and a = 0 .5 We developed a bidirectional, interpolated, refining, Another innovation, both in terms of cluster qual- ity and speed, is cluster refinement. The vocabulary and alternating (BIRA) predictive exchange algo- is initially clustered into G sets, where G C , rithm. The goal of BIRA is to produce better clusters | | | |  | | by using multiple, changing models to escape local typically 2–10 . After a few iterations (i) of this, optima. This uses both forward and reversed bigram the full partitioning Cf is explored. Clustering G converges very quickly, typically requiring no more class models to improve cluster quality by evaluat- 6 ing log-likelihood on two different models. Unlike than 3 iterations. using trigrams, bidirectional bigram models only G if i 3 linearly increase time and memory requirements, C i := | | ≤ (3) | | ( C f otherwise and in fact some data structures can be shared. The | | two directions are interpolated to allow softer inte- The intuition behind this is to group words first into broad classes, like nouns, verbs, adjectives, etc. 1https://github.com/moses-smt/mgiza In contrast to divisive hierarchical clustering and 2http://cistern.cis.lmu.de/marlin coarse-to-fine methods (Petrov, 2009), after the ini- 3 http://bit.ly/1VJwZ7n tial iterations, the algorithm is still able to exchange 4Green et al. (2014) provide a Free implementation of the original predictive exchange algorithm within the Phrasal 5The time complexity is ((B + V ) C I) if λ = 1 . O | | × | | × MT system, at http://nlp.stanford.edu/phrasal . 6The piecewise definition could alternatively be conditioned Another implementation is in the Cicada semiring MT system. upon a percentage threshold of moved words.

1170

Page 74 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

Russian News Crawl, T=100M, |C|=800 N0 , and due to the power law distribution of the al- gorithm’s access to these entropy terms, we can pre- 550 compute N log N up to, say 10e+7, with minimal · memory requirements.8 This results in a consider- able speedup of around 40% . 500

Perplexity 4 Experiments

450 Our experiments consist of both intrinsic and extrin- sic evaluations. The intrinsic evaluation measures the perplexity (PP) of two-sided class-based models +BiDi +Rev PredEx +Refine +BiDi+Rev +BiDi+Refi+Rev+Rene fine for English and Russian, and the extrinsic evalua- +BiDi+Rev+Refine tion measures BLEU scores of phrase-based MT of Figure 2: Development set PP of combinations of improve- Russian English and Japanese English texts. ments to predictive exchange (cf. 3), using 100M tokens of the ↔ ↔ § Russian News Crawl, with 800 word classes. 4.1 Class-based Language Model Evaluation any word to any cluster—there is no hard constraint In this task we used 400, 800, and 1200 classes that the more refined partitions be subsets of the ini- for English, and 800 classes for Russian. The data comes from the 2011–2013 News Crawl monolin- tial coarser partitions. This gives more flexibility 9 in optimizing on log-likelihood, especially given the gual data of the WMT task. For these experiments noise that naturally arises from coarser clusterings. the data was deduplicated, shuffled, tokenized, digit- We explored cluster refinement over more stages conflated, and lowercased. In order to have a large test set, one line per 100 of the resulting (shuffled) than just two, successively increasing the number of 10 clusters. We observed no improvement over the two- corpus was separated into the test set. The min- stage method described above. imum count threshold was set to 3 occurrences in the training set. Table 1 shows information on the Each BIRA component can be applied to any exchange-based clusterer. The contributions of each resulting corpus. of these are shown in Figure 2, which reports the Corpus Tokens Types Lines development set perplexities (PP) of all combina- English Train 1B 2M 42M tions of BIRA components over the original pre- English Test 12M 197K 489K dictive exchange algorithm. The data and con- Russian Train 550M 2.7M 31M figurations are discussed in more detail in Sec- Russian Test 6M 284K 313K tion 4. The greatest PP reduction is due to using lambda inversion (+Rev), followed by cluster re- Table 1: Monolingual training & test set sizes. finement (+Refine), then interpolating the bidirec- The clusterings are evaluated on the PP of an ex- tional models (+BiDi), with robust improvements ternal 5-gram unidirectional two-sided class-based by using all three of these—an 18% reduction in language model (LM). The n-gram-order interpola- perplexity over the predictive exchange algorithm. tion weights are tuned using a distinct development We have found that both lambda inversion and clus- set of comparable size and quality as the test set. ter refinement prevent early convergence at local op- Table 2 and Figure 3 show perplexity results us- tima, while bidirectional models give immediate and ing a varying number of classes. Two-sided ex- consistent training set PP improvements, but this is change gives the lowest perplexity across the board, attenuated in a unidirectional evaluation. although this is within a two-sided LM evaluation. We observed that most of the computation for the of a given word followed by a given class. predictive exchange algorithm is spent on the log- 8This was independently discovered in Botros et al. (2015). arithm function, calculating δ δ N(w, c) 9http://www.statmt.org/wmt15/ ← − · log N(w, c) .7 Since the codomain of N(w, c) is translation-task.html 10The data setup script is at http://www.dfki.de/ 7δ is the change in log-likelihood, and N(w, c) is the count ˜jode03/naacl2016.sh .

1171

Page 75 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

English News Crawl, T = 109 Training Set mkcls BIRA Brown Phrasal

220 EN, C = 400 39.0 1.0 2.3 3.1 | | EN, C = 800 48.8 1.4 12.5 5.1 | | EN, C = 1200 68.8 1.7 25.5 6.2 200 Clusterer ● | | ● RU, C = 800 75.0 1.5 14.6 5.5 BIRA | | Brown 180 Table 3: Clustering times (hours) of full training sets. Mkcls 2−Sided Exchange Perplexity implements two-sided exchange, and Phrasal implements one- Pred. Exchange 160 ● sided predictive exchange.

140 ●

400 600 800 1000 1200 The original predictive exchange algorithm has Number of Classes a more fractionated history than the two-sided exchange algorithm. Interestingly, increasing the Figure 3: 5-gram two-sided class-based LM perplexities for number of clusters causes a convergence in the various clusterers on English News Crawl varying the number word clusterings themselves, while also causing of classes. a divergence in the time complexities of these two varieties of the exchange algorithm. The We also evaluated clusters derived from word2vec metaheuristic techniques employed by the two- (Mikolov et al., 2013) using various configura- 11 sided clusterer mkcls can be applied to other tions , and all gave poor perplexities. BIRA gives exchange-based clusterers—including ours—for better perplexities than both the original predictive 12 further improvements. exchange algorithm and Brown clusters. The Rus- Table 3 presents wall clock times using the full sian experiments yielded higher perplexities for all training set, varying the number of word classes clusterings, but otherwise the same comparative re- C (for English).14 The predictive exchange-based sults. | | clusterers (BIRA and Phrasal) exhibit slow increases in time as the number of classes increases, while the Training Set 2-Side Ex. BIRA Brown Pred. Ex. EN, C = 400 193.3 197.3 201.8 220.5 others (Brown and mkcls) are much more sensi- | | EN, C = 800 155.0 158.1 160.2 178.3 tive to C . Our BIRA-based clusterer is three times | | | | EN, C = 1200 138.4 140.4 141.5 157.6 faster than Phrasal for all these sets. | | RU, C = 800 322.4 340.7 350.4 389.3 We performed an additional experiment, adding | | more English News Crawl training data.15 Our 5-gram two-sided class-based LM perplexities. Table 2: implementation took 3.0 hours to cluster 2.5 bil- lion training tokens, with C = 800 using modest In general Brown clusters give slightly worse | | hardware.14 results relative to exchange-based clusters, since Brown clustering requires an early, permanent 4.2 Machine Translation Evaluation placement of frequent words, with further re- We also evaluated the BIRA predictive exchange al- strictions imposed on the C -most frequent | | gorithm extrinsically in machine translation. As dis- words (Liang, 2005).13 Liang-style Brown cluster- cussed in Section 1, word clusters are employed in a ing is only efficient on a small number of clusters, variety of ways within machine translation systems, since there is a C 2 term in its time complexity. | | the most common of which is in word alignment 11Negative sampling & hierarchical softmax; CBOW & skip- where mkcls is widely used. As training sets get gram; various window sizes; various dimensionalities. larger every year, mkcls struggles to keep pace, and 12For the two-sided exchange we used mkcls; for the origi- nal pred. exchange we used Phrasal’s clusterer; for Brown clus- 14All time experiments used a 2.4 GHz Opteron 8378 featur- tering we used Percy Liang’s brown-cluster (329dc). All had ing 16 threads. min-count=3, and all but mkcls (which is not multithreaded) 15Adding years 2008–2010 and 2014 to the existing training had threads=12, iterations=15. data. This training set was too large for the external class-based 13Recent work by Derczynski and Chester (2016) loosens LM to fit into memory, so no perplexity evaluation of this clus- some restrictions on Brown clustering. tering was possible.

1172

Page 76 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

is a substantial time bottleneck in MT pipelines with two-sided models generally give better perplexity in large datasets. class-based LM experiments. Our paper shows that We used data from the Workshop on Ma- BIRA-based predictive exchange clusters are com- chine Translation 2015 (WMT15) Russian English petitive with two-sided clusters even in a two-sided ↔ dataset and the Workshop on Asian Translation 2014 evaluation. They also give better perplexity than the (WAT14) Japanese English dataset (Nakazawa et original predictive exchange algorithm and Brown ↔ al., 2014). Both pairs used standard configurations, clustering. like truecasing, MeCab segmentation for Japanese, The software is freely available at https:// MGIZA alignment, grow-diag-final-and phrase ex- github.com/jonsafari/clustercat . traction, phrase-based Moses, quantized KenLM 5- gram modified Kneser-Ney LMs, and MERT tuning. Acknowledgements We would like to thank Hermann Ney and Kazuki C EN-RU RU-EN EN-JA JA-EN | | 10 20.8 20.9 26.2 26.0 23.5 23.4 16.9 16.8 Irie, as well as the reviewers for their useful com- → ∗ → → → 50 21.0 21.2 25.9 25.7 24.0 23.7 16.9 16.9 ments. This work was supported by the QT21 → ∗ → → ∗ → 100 20.4 21.1 25.9 25.8 23.8 23.5 16.9 17.0 → → → → project (Horizon 2020 No. 645452). 200 21.0 20.8 25.8 25.9 23.8 23.4 17.0 16.8 → → → → 500 20.9 20.9 25.8 25.9 24.0 23.8 16.8 17.1 → → ∗ → → ∗ 1000 20.9 21.1 25.9 26.0 23.6 23.5 16.9 17.1 → → ∗∗ → → References

Table 4: BLEU scores (mkcls BIRA) and significance across Rami Botros, Kazuki Irie, Martin Sundermeyer, and → cluster sizes ( C ). Hermann Ney. 2015. On Efficient Training of | | Word Classes and their Application to Recurrent Neu- The BLEU score differences between using ral Network Language Models. In Proceedings of mkcls and our BIRA implementation are small but INTERSPEECH-2015, pages 1443–1447, Dresden, there are a few statistically significant changes, us- Germany. ing bootstrap resampling (Koehn, 2004). Table 4 Peter E. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The Mathematics presents the BLEU score changes across varying of Statistical Machine Translation: Parameter Estima- cluster sizes (*: p-value < 0.05, **: p-value < 0.01). tion. Computational Linguistics, 19(2):263–311. MERT tuning is quite erratic, and some of the BLEU Marie Candito and Djame´ Seddah. 2010. Pars- differences could be affected by noise in the tun- ing Word Clusters. In Proceedings of the NAACL ing process in obtaining quality weight values. Us- HLT 2010 First Workshop on Statistical Parsing of ing our BIRA implementation reduces the translation Morphologically-Rich Languages, pages 76–84, Los model training time with 500 clusters from 20 hours Angeles, CA, USA. using mkcls (of which 60% of the time is spent on Victor Chahuneau, Eva Schlinger, Noah A. Smith, and clustering) to just 8 hours (of which 5% is spent on Chris Dyer. 2013. Translating into Morphologically clustering). Rich Languages with Synthetic Phrases. In Proceed- ings of EMNLP, pages 1677–1687, Seattle, WA, USA. 5 Conclusion Colin Cherry. 2013. Improved Reordering for Phrase- Based Translation using Sparse Features. In Proceed- We have presented improvements to the predictive ings of NAACL-HLT, pages 22–31, Atlanta, GA, USA. exchange algorithm that address longstanding draw- Alexander Clark. 2003. Combining Distributional and backs of the original algorithm compared to other Morphological Information for Part of Speech Induc- tion. In Proceedings of EACL, pages 59–66. clustering algorithms, enabling new directions in us- ing large scale, high cluster-size word classes in Leon Derczynski and Sean Chester. 2016. Generalised Brown Clustering and Roll-up Feature Generation. In NLP. Proceedings of AAAI, Phoenix, AZ, USA. Botros et al. (2015) found that the one-sided Nadir Durrani, Philipp Koehn, Helmut Schmid, and model of the predictive exchange algorithm pro- Alexander Fraser. 2014. Investigating the Usefulness duces better results for training LSTM-based lan- of Generalized Word Representations in SMT. In Pro- guage models compared to two-sided models, while ceedings of Coling, pages 421–432, Dublin, Ireland.

1173

Page 77 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation

Joshua Goodman. 2001a. Classes for Fast Maximum Proceedings of NAACL, pages 526–536, Denver, CO, Entropy Training. In Proceedings of ICASSP, pages USA. 561–564. Toshiaki Nakazawa, Hideya Mino, Isao Goto, Sadao Joshua T. Goodman. 2001b. A Bit of Progress in Lan- Kurohashi, and Eiichiro Sumita. 2014. Overview of guage Modeling, Extended Version. Technical Report the first Workshop on Asian Translation. In Proceed- MSR-TR-2001-72, Microsoft Research. ings of the Workshop on Asian Translation (WAT). Spence Green, Daniel Cer, and Christopher Manning. Franz Josef Och and Hermann Ney. 2000. A Comparison 2014. An Empirical Comparison of Features and Tun- of Alignment Models for Statistical Machine Trans- ing for Phrase-based Machine Translation. In Proc. of lation. In Proceedings of Coling, pages 1086–1090, WMT, pages 466–476, Baltimore, MD, USA. Saarbrucken,¨ Germany. Reinhard Kneser and Hermann Ney. 1993. Im- Franz Josef Och. 1995. Maximum-Likelihood- proved clustering techniques for class-based statis- Schatzung¨ von Wortkategorien mit Verfahren der kom- tical language modelling. In Proceedings of EU- binatorischen Optimierung. Bachelor’s thesis (Studi- ROSPEECH’93, pages 973–976, Berlin, Germany. enarbeit), Friedrich-Alexander-Universitat¨ Erlangen- Philipp Koehn and Hieu Hoang. 2007. Factored Trans- Nurnburg,¨ Germany. Slav Petrov. 2009. Coarse-to-Fine Natural Language lation Models. In Proceedings of EMNLP-CoNLL, Processing. Ph.D. thesis, University of California at pages 868–876, Prague, Czech Republic. Berkeley, Berkeley, CA, USA. Philipp Koehn. 2004. Statistical significance tests for Lev Ratinov and Dan Roth. 2009. Design Challenges machine translation evaluation. In Proceedings of and Misconceptions in Named Entity Recognition. In EMNLP, pages 388–395. Proc. of CoNLL, pages 147–155, Boulder, CO, USA. Lingpeng Kong, Nathan Schneider, Swabha Alan Ritter, Sam Clark, Mausam, and Oren Etzioni. Swayamdipta, Archna Bhatia, Chris Dyer, and 2011. Named Entity Recognition in Tweets: An Ex- Noah A. Smith. 2014. A Dependency Parser for perimental Study. In Proceedings of EMNLP 2011, Tweets. In Proceedings of EMNLP, pages 1001–1012, pages 1524–1534, Edinburgh, Scotland. Doha, Qatar. Attapol Rutherford and Nianwen Xue. 2014. Discover- Terry Koo, Xavier Carreras, and Michael Collins. 2008. ing Implicit Discourse Relations Through Brown Clus- Simple Semi-supervised Dependency Parsing. In Pro- ter Pair Representation and Coreference Patterns. In ceedings of ACL: HLT, pages 595–603, Columbus, Proc. of EACL, pages 645–654, Gothenburg, Sweden. OH, USA. Sara Stymne. 2012. Clustered Word Classes for Pre- Percy Liang. 2005. Semi-Supervised Learning for Natu- ordering in Statistical Machine Translation. In Pro- ral Language. Master’s thesis, MIT. ceedings of the Joint Workshop on Unsupervised and Sven Martin, Jorg¨ Liermann, and Hermann Ney. 1998. Semi-Supervised Learning in NLP, pages 28–34, Avi- Algorithms for Bigram and Trigram Word Clustering. gnon, France. Speech Communication, 24(1):19–37. Oscar Tackstr¨ om,¨ Ryan McDonald, and Jakob Uszkor- Toma´sˇ Mikolov, Kai Chen, Greg Corrado, and Jeffrey eit. 2012. Cross-lingual Word Clusters for Direct Dean. 2013. Efficient Estimation of Word Represen- Transfer of Linguistic Structure. In Proceedings of tations in Vector Space. In Workshop Proceedings of NAACL:HLT, pages 477–487, Montreal,´ Canada. the International Conference on Learning Representa- Joseph Turian, Lev-Arie Ratinov, and Yoshua Bengio. tions (ICLR), Scottsdale, AZ, USA. 2010. Word Representations: A Simple and General Scott Miller, Jethran Guinness, and Alex Zamanian. Method for Semi-Supervised Learning. In Proceed- 2004. Name Tagging with Word Clusters and Discrim- ings of ACL, pages 384–394, Uppsala, Sweden. inative Training. In Susan Dumais, Daniel Marcu, and Jakob Uszkoreit and Thorsten Brants. 2008. Distributed Salim Roukos, editors, Proceedings of HLT-NAACL, Word Clustering for Large Scale Class-Based Lan- pages 337–342, Boston, MA, USA. guage Modeling in Machine Translation. In Proc. of Andriy Mnih and Geoffrey Hinton. 2009. A Scalable Hi- ACL: HLT, pages 755–762, Columbus, OH, USA. erarchical Distributed Language Model. In D. Koller, Joern Wuebker, Stephan Peitz, Felix Rietig, and Hermann D. Schuurmans, Y. Bengio, and L. Bottou, editors, Ad- Ney. 2013. Improving Statistical Machine Translation vances in NIPS-21, volume 21, pages 1081–1088. with Word Class Models. In Proceedings of EMNLP, Frederic Morin and Yoshua Bengio. 2005. Hierarchi- pages 1377–1381, Seattle, WA, USA. cal Probabilistic Neural Network Language Model. In Andreas Zollmann and Stephan Vogel. 2011. A Word- Proceedings of AISTATS, volume 5, pages 246–252. Class Approach to Labeling PSCFG Rules for Ma- chine Translation. In Proceedings of ACL-HLT, pages Thomas Muller¨ and Hinrich Schutze.¨ 2015. Robust Mor- 1–11, Portland, OR, USA. phological Tagging with Word Representations. In

1174

Page 78 of 78