Improved Learning for Machine Translation

This document is part of the Research and Innovation Action “Quality Translation 21 (QT21)”. This project has received funding from the European Union’s Horizon 2020 program for ICT under grant agreement no. 645452. Deliverable D1.5 Improved Learning for Machine Translation Ondřej Bojar (CUNI), Jan-Thorsten Peter (RWTH), Weiyue Wang (RWTH), Tamer Alkhouli (RWTH), Yunsu Kim (RWTH), Miloš Stanojević (UvA), Khalil Sima’an (UvA), and Jon Dehdari (DFKI) Dissemination Level: Public 31st July, 2016 Quality Translation 21 D1.5: Improved Learning for Machine Translation Grant agreement no. 645452 Project acronym QT21 Project full title Quality Translation 21 Type of action Research and Innovation Action Coordinator Prof. Josef van Genabith (DFKI) Start date, duration 1st February, 2015, 36 months Dissemination level Public Contractual date of delivery 31st July, 2016 Actual date of delivery 31st July, 2016 Deliverable number D1.5 Deliverable title Improved Learning for Machine Translation Type Report Status and version Final (Version 1.0) Number of pages 78 Contributing partners CUNI, DFKI, RWTH, UVA WP leader CUNI Author(s) Ondřej Bojar (CUNI), Jan-Thorsten Peter (RWTH), Weiyue Wang (RWTH), Tamer Alkhouli (RWTH), Yunsu Kim (RWTH), Miloš Stanojević (UvA), Khalil Sima’an (UvA), and Jon Dehdari (DFKI) EC project officer Susan Fraser The partners in QT21 are: • Deutsches Forschungszentrum für Künstliche Intelligenz GmbH (DFKI), Germany • Rheinisch-Westfälische Technische Hochschule Aachen (RWTH), Germany • Universiteit van Amsterdam (UvA), Netherlands • Dublin City University (DCU), Ireland • University of Edinburgh (UEDIN), United Kingdom • Karlsruher Institut für Technologie (KIT), Germany • Centre National de la Recherche Scientifique (CNRS), France • Univerzita Karlova v Praze (CUNI), Czech Republic • Fondazione Bruno Kessler (FBK), Italy • University of Sheffield (USFD), United Kingdom • TAUS b.v. (TAUS), Netherlands • text & form GmbH (TAF), Germany • TILDE SIA (TILDE), Latvia • Hong Kong University of Science and Technology (HKUST), Hong Kong For copies of reports, updates on project activities and other QT21-related information, contact: Prof. Stephan Busemann, DFKI GmbH [email protected] Stuhlsatzenhausweg 3 Phone: +49 (681) 85775 5286 66123 Saarbrücken, Germany Fax: +49 (681) 85775 5338 Copies of reports and other material can also be accessed via the project’s homepage: http://www.qt21.eu/ © 2016, The Individual Authors No part of this document may be reproduced or transmitted in any form, or by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval system, without permission from the copyright owner. Page 2 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation Contents 1 Executive Summary 4 2 Joint Translation and Reordering Sequences 5 3 Alignment-Based Neural Machine Translation 5 4 The QT21/HimL Combined Machine Translation System 6 5 Improved BEER 6 6 Particle Swarm Optimization for MERT 6 7 CharacTER: Translation Edit Rate on Character Level 6 8 Bag-of-Words Input Features for Neural Network 7 9 Vocabulary Reduction for Phrase Table Smoothing 7 10 Faster and Better Word Classes for Word Alignment 8 References 8 Appendices 10 Appendix A Joint Translation and Reordering Sequences 10 Appendix B Alignment-Based Neural Machine Translation 21 Appendix C The QT21/HimL Combined Machine Translation System 33 Appendix D Beer for MT evaluation and tuning 45 Appendix E Particle Swarm Optimization Submission for WMT16 Tuning Task 51 Appendix F CharacTER: Translation Edit Rate on Character Level 58 Appendix G Exponentially Decaying Bag-of-Words Input Features 58 Appendix H Study on Vocabulary Reduction for Phrase Table Smoothing 65 Appendix I BIRA: Improved Predictive Exchange Word Clustering 73 Page 3 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation 1 Executive Summary This deliverable reports on the progress in Task 1.3 Improved Learning for Machine Translation, as specified in the project proposal: Task 1.3 Improved Learning for Machine Translation [M01–M36] (CUNI, DFKI, RWTH, UVA) Experiments performed within this task will address the second ob- jective, namely full structured prediction, with discriminative and integrated training. Focus will be on better correlation of the training criterion with the target metrics, avoidance of overfitting by incorporating certain so far unused smoothing techniques into the training process, and efficient algorithm to allow the use of all training data in all phases of the training process. The first focus point of Task 1.3 is constructing full structured predictions. We analyzed two different approaches to achieve this. RWTH shows in Section 2 how to model bilingual sentence pairs on a word level together with reordering information as one sequence – joint translation and reordering (JTR) sequences. Paired with a phrased-based machine translation system it gave a significant improvement of up to 2:2 Bleu points over the baseline. RWTH proposes in Section 3 an alignment-based neural machine translation approach as an alternative to the popular attention-based approach. They demonstrate competitive results on the IWSLT 2013 German!English and BOLT Chinese!English tasks. Beyond systems designed within a unified framework with a single search for the best translation, we also experiment with a high-level combination of systems, which gives us the best possible performance. In this joint effort the groups from the QT21 and the HimL project managed to build the best system for the English!Romanian translation task of the ACL 2016 First Conference on Machine Translation (WMT 2016); see Section 4. These systems depend on automatic ways to optimize parameters with a good training criterion. The next focus of this Task is therefore to find a better correlation of the training criterion with the target metrics. To this end, we organized and also participated in the Tuning Task at WMT 2016. The organization of the task itself fits in WP4 and the details of the task as a whole will be described in the corresponding deliverable there. In this deliverable, we include our submissions to the Tuning Task. The Beer evaluation metric for MT was improved by the UvA team to be both higher quality and significantly faster to compute, which makes it competitive with Bleu for tuning; see Section 5. Another submission to the Tuning Task was contributed by CUNI, who based the standard Mert on Particle Swarm Optimization; see Section 6. Additionally RWTH introduced a character level TER, where the edit distance is calculated on character level, while the shift edit is performed on word level. It was not tested for tuning yet, but shows very high correlation with human judgment as described in Section 7. Even with better metrics in place we still need to avoid overfitting to rare events while learning from them. We developed multiple ways to avoid this by applying smoothing in different variations. One approach is the usage of bag-of-words input for feed-forward neural networks with individually-trained decay rates, which gave comparable results to LSTM networks as described in Section 8 by RWTH. The second approach from RWTH is based on smoothing the standard phrase translation probability by reducing the vocabulary with a word-label mapping. That empirically showed that smoothing is not significantly affected by the choice of vocabulary and is more effective for large-scale translation tasks; see Section 9. Section 10 presents work by DFKI on smoothing using word classes for morphologically- rich languages. These languages have large vocabularies, which increase training times and data sparsity. The research developed multiple techniques to improve both the scalability and quality of word clusters for word alignment. Page 4 of 78 Quality Translation 21 D1.5: Improved Learning for Machine Translation 2 Joint Translation and Reordering Sequences In joint work with WP2, RWTH introduced a method that converts bilingual sentence pairs and their word alignments into joint translation and reordering (JTR) sequences. This combines interdepending lexical and alignment dependencies into a single framework. A main advantage of JTR sequences is that it can be modeled in a similar way as a language model, allowing it to be used in well-known methods. RWTH tested three different methods: • count based n-gram models with modified Kneser-Ney smoothing, a well-tested technique for language modeling • feedforward neural networks (FFNN), a model that should generalize better for unseen events • recurrent neural networks (RNN), good generalization, and can in principle take the com- plete history into account Comparisons between the count-based JTR model and the operation sequence model (OSM) Durrani et al. (2013), both used in phrase-based decoding, showed that the JTR model performed at least as good as OSM, with a slight advantage for JTR. In comparison to the OSM, the JTR model operates on words, leading to a smaller vocabulary size. Moreover, it utilizes simpler reordering structures without gaps and only requires one log-linear feature to be tuned, whereas the OSM needs five. The strongest combination of count and neural network models yields an improvement over the phrase-based system by up to 2:2 Bleu points on the German!English IWSLT task. This combination also outperforms OSM by up to 1:2 Bleu points on the BOLT Chinese!English tasks. This work appeared as a long paper at EMNLP 2015 (Guta et al., 2015) and is available in Appendix A. 3 Alignment-Based Neural Machine Translation Neural machine translation (NMT) has emerged recently as a successful alternative to traditional phrase-based systems. In NMT, neural networks are used to generate translation candidates during decoding. This is in contrast to the integration of neural networks in phrase-based systems, where the networks are used to score, but not to generate the translation hypotheses. The neural networks used in NMT so far use an attention mechanism to allow the decoder to pay more attention to certain parts of the source sentence when generating a target word.

Improved Learning for Machine Translation

(Meta-) Evaluation of Machine Translation

Public Project Presentation and Updates

8.14: Annual Public Report

Statistical and Hybrid Machine Translation Between All European Languages

Abstract the Circle of Meaning: from Translation

Improving Statistical Machine Translation Efficiency by Triangulation

Factored Translation Models and Discriminative Training For

Convergence of Translation Memory and Statistical Machine Translation

Statistical Machine Translation

Edinburgh's Statistical Machine Translation Systems for WMT16

A Survey of Statistical Machine Translation

Statistical Machine Translation: the Basic, the Novel, and the Speculative