This document is part of the Research and Innovation Action “Quality Translation 21 (QT21)”. This project has received funding from the European Union’s Horizon 2020 program for ICT under grant agreement no. 645452.

Deliverable D1.4 Semantics in Shallow Models

Ondřej Bojar (CUNI), Rico Sennrich (UEDIN), Philip Williams (UEDIN), Khalil Sima’an (UvA), Stella Frank (UvA), Inguna Skadiņa (Tilde), Daiga Deksne (Tilde)

Dissemination Level: Public

31st January, 2018 Quality Translation 21 D1.4: Semantics in Shallow Models

Grant agreement no. 645452 Project acronym QT21 Project full title Quality Translation 21 Type of action Research and Innovation Action Coordinator Prof. Josef van Genabith (DFKI) Start date, duration 1st February, 2015, 36 months Dissemination level Public Contractual date of delivery 31st July, 2016 Actual date of delivery 31st July, 2016 Deliverable number D1.4 Deliverable title Semantics in Shallow Models Type Report Status and version Final (Version 1.0) Number of pages 107 Contributing partners CUNI, UEDIN, RWTH, UvA, Tilde WP leader CUNI Author(s) Ondřej Bojar (CUNI), Rico Sennrich (UEDIN), Philip Williams (UEDIN), Khalil Sima’an (UvA), Stella Frank (UvA), Inguna Skadiņa (Tilde), Daiga Deksne (Tilde) EC project officer Susan Fraser The partners in QT21 are: • Deutsches Forschungszentrum für Künstliche Intelligenz GmbH (DFKI), Germany • Rheinisch-Westfälische Technische Hochschule Aachen (RWTH), Germany • Universiteit van Amsterdam (UvA), Netherlands • Dublin City University (DCU), Ireland • University of Edinburgh (UEDIN), United Kingdom • Karlsruher Institut für Technologie (KIT), Germany • Centre National de la Recherche Scientifique (CNRS), France • Univerzita Karlova v Praze (CUNI), Czech Republic • Fondazione Bruno Kessler (FBK), Italy • University of Sheffield (USFD), United Kingdom • TAUS b.v. (TAUS), Netherlands • text & form GmbH (TAF), Germany • TILDE SIA (TILDE), Latvia • Hong Kong University of Science and Technology (HKUST), Hong Kong

For copies of reports, updates on project activities and other QT21-related information, contact: Prof. Stephan Busemann, DFKI GmbH [email protected] Stuhlsatzenhausweg 3 Phone: +49 (681) 85775 5286 66123 Saarbrücken, Germany Fax: +49 (681) 85775 5338 Copies of reports and other material can also be accessed via the project’s homepage: http://www.qt21.eu/

© 2018, The Individual Authors No part of this document may be reproduced or transmitted in any form, or by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval system, without permission from the copyright owner.

Page 2 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

Contents

1 Executive Summary 4

2 Improved Modelling of Target Grammar in SMT 5 2.1 Free Word-Order and Predicting Alternative Target Word Orders with Hierar- chical Models ...... 5 2.2 Dependency Language Models ...... 5 2.3 Unification-based Constraints ...... 6 2.4 Modeling Selectional Preferences ...... 6

3 Richer Source Annotation in SMT 6 3.1 Using Verb Patterns in PBMT ...... 7 3.2 Discriminative Models with Context Information for PBMT ...... 7 3.3 Source Annotation in Syntax-based Models ...... 7 3.4 Sampling Phrase Tables for the Moses Statistical System .. 8 3.5 Reordering Constraints in PBMT ...... 8

References 8

Appendices 12

Appendix A Examining the Relationship between Preordering and Word Order Freedom in Machine Translation 12

Appendix B A Joint Dependency Model of Morphological and Syntactic Struc- ture for Statistical Machine Translation 25

Appendix C Edinburgh’s Syntax-Based Systems at WMT 2015 32

Appendix D Edinburgh’s Statistical Machine Translation Systems for WMT16 43

Appendix E Modeling Selectional Preferences of Verbs and Nouns in String-to- Tree Machine Translation 55

Appendix F Using Verb Patterns in PBMT 66

Appendix G Target-Side Context for Discriminative Models in Statistical Ma- chine Translation 75

Appendix H CUNI-LMU Submissions in WMT2016: Chimera Constrained and Beaten 86

Appendix I Sampling Phrase Tables for the Moses Statistical Machine Transla- tion System 92

Appendix J Reordering constraints for English-Latvian SMT 104 J.1 SMT System and Tools ...... 104 J.2 Constraints on Main Constituents ...... 104 J.3 Constraints on Specific Phrases ...... 104 J.3.1 Noun Phrases ...... 104 J.3.2 Prepositional Phrases ...... 105 J.3.3 Verb Phrases ...... 105 J.3.4 Other Phrases ...... 105 J.3.5 Results of Reordering Constraints in SMT ...... 105

Page 3 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

1 Executive Summary

This deliverable reports all the results achieved by the consortium during the whole 36 months of the project falling under Task 1.2 as specified in the project proposal: Task 1.2 Semantics in Shallow Models [M01–M36] (CUNI, UEDIN, RWTH, UvA) In this task, syntactically oriented approaches will be experimented with. Shortcomings of current sequential models will be addressed by exploring string-to- tree and SCFG models. In addition, richer source-side annotation will be considered. We will also experiment with methods for automatic identification of systematic phrase table errors within the PBMT framework. As described in the introduction to Deliverable 1.2 (Semantic Translation Models), a major paradigm shift towards neural MT has happened over the course of the project, setting new baselines and somewhat postponing the benefit of explicit linguistically-motivated syntactic descriptions of the sentence over these baselines. Since the detailed description of this task mentions pre-neural approaches to MT explicitly (synchronous context-free grammars, SCFG, and phrase-based SMT, PBMT), we decided to reserve this deliverable to work on non-neural models. Due to the general shift of attention as an immediate consequence of the difference in translation quality, the majority of this work happened in the first half of the project. The first part of this deliverable (Section 2) addresses the first of the topics in Task 1.2, namely handling of syntax in shallow models and in string-to-tree models. In Section 2.1, UvA investigates the use of pre-orderings in the more challenging free word- order setting, showing where the shortcomings of current SCFG-based approaches arise and proposing a potential solution. In string-to-tree models, which incorporate syntactic annotation on the target-side, UEDIN aimed to make better use of this syntactic information. In Section 2.2, UEDIN improves the grammaticality of dependency-syntax models by developing a dependency representation of Ger- man compounds and particle verbs (the German counterpart of English phrasal verbs, where the separable prefix gets moved to the end of the clause or sentence). In Section 2.3, in collaboration with CUNI, UEDIN develops agreement constraints for Czech to try to ensure that grammatical features, including syntactic case, are expressed consistently. In Section 2.4, UEDIN introduces a model of semantic affinity between verbal predicates and nominal argument heads. The second part of this deliverable (Section 3) then comprises works that allow the SMT system to somehow benefit from a richer annotation of the source sentence. With respect to phrase-based models, disambiguation of words in the source sentence has been shown to improve performance. In Section 3.1, CUNI experiments with disambiguation of verb senses using verb patterns. A shortcoming of this approach is that it can only use a limited window of source context. We proceed with Section 3.2, where CUNI incorporated discriminative models into PBMT, to allow for inclusion of arbitrary additional information from the source and also in part from the target into consideration. Where a language pair has fixed word order, syntax is particularly informative for deter- mining the semantics of a sentence and we would like to exploit this source of information. In Section 3.3, UEDIN investigates the use of source-side syntactic annotation in tree-to-string models. This is particularly relevant for the common scenario of translation from English. In Section 3.4, UEDIN extends support for sampling phrase tables in the Moses toolkit, demonstrating that this method can match the performance of the conventional method of building large static phrase tables up-front, while expanding the potential for more rapid ex- perimentation with rich features. Finally, in Section 3.5, Tilde experiments with reordering constraints by applying them to the different parts of the sentence, as well as to several categories of phrases. For English- Latvian SMT, putting reordering limits on top level phrases did not show improvements in terms of BLEU score, while limiting reordering for specific phrases resulted in increase by 0.5 BLEU.

Page 4 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

2 Improved Modelling of Target Grammar in SMT

This part of the deliverable deals with SMT models that focus on target syntactic structure using various technical means. In Section 2.1, grammaticality of the target is improved by pre-ordering the source. In Section 2.2, a dedicated dependency language model is designed. The two following sections (Section 2.3 and Section 2.4), unification constraints and selectional preferences for syntax-based SMT are studied.

2.1 Free Word-Order and Predicting Alternative Target Word Orders with Hierar- chical Models UvA studied the impact of the degree of word-order freedom, in a morphologically-rich tar- get language, on translation performance for hierarchical pre-ordering models. UvA explored two kinds of hierarchical pre-ordering models: (1) an unsupervised hierarchical SCFG-based model working over Permutation Trees obtained by factorising word alignments and using state- splitting to produce a latent SCFG [17] and (2) a supervised neural pre-ordering model that estimates the probabilities of swaps for pairs of nodes in a source dependency tree [3]. Pre-ordering models typically pass a single permutation from the source side to the back-end translation system. But this is suboptimal, particularly for target languages with freer word order where multiple input orders are acceptable. This is empirically demonstrated in experi- ments with English-German (freer word order) and English-Japanese (stricter word order). For languages with relatively free word order, such as German, predicting a lattice of possible pre- orderings is more suitable than predicting a unique word order. UvA also presented a method for training the back-end MT system specifically for input that consists of a lattice of pre- orderings. Experiments show that lattices in conjunction with this training scheme are crucial for enabling good empirical performance when translating into the relatively free word order language, German (+0.67 BLEU over the first-best version of the same model), and even for fixed word order languages such as Japanese, the lattice representation can provide additional improvements over a first-best model (+0.36 BLEU). This work is described in Daiber et al. [2], which can be found in Appendix A.

2.2 Dependency Language Models To improve the grammaticality of string-to-tree SMT systems with dependency syntax, Sennrich [15] proposed a relational dependency language model which models the probability of a sentence given a labelled dependency tree, and of the labelled tree itself. UEDIN have continued this line of research within the QT21 project and published several extensions to the model. In Sennrich and Haddow [16], the authors proposed a joint dependency model of morpholog- ical and syntactic structure. When translating between two languages that differ in their degree of morphological synthesis, syntactic structures in one language may be realised as morpholog- ical structures in the other, and SMT models need a mechanism to learn such translations. Prior work has used morpheme splitting with flat representations. However, this fails to encode the hierarchical structure between morphemes, which is relevant for learning morphosyntactic constraints and selectional preferences. Sennrich and Haddow [16] present a dependency repre- sentation of German compounds and particle verbs that result in improvements in translation quality of 1.4–1.8 BLEU. An early version of the joint dependency language model was used in the UEDIN submission to the 2015 WMT shared translation task for English-German. At the time, the implementation included the new method of binarisation and compound representation, but did not yet include particle verb restructuring. This system was jointly ranked first out of 16 by human judges (it was tied with one other system). Sennrich and Haddow [16] can be found in Appendix B. A description of the WMT system is given in Section 3.3 of Williams et al. [24], which can be found in Appendix C. The work

Page 5 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models was developed jointly within WP1 and WP2 due to its applicability to both the problems of semantics in shallow syntactic models and morphological representations. It is therefore also reported in deliverable D2.3.

2.3 Unification-based Constraints Williams and Koehn [23] and Williams [22] proposed unification-based constraints as a means of improving morphosyntactic coherence in target languages with rich inflectional morphology, with a focus on translation into German. As part of QT21, UEDIN and CUNI collaborated to develop this further with constraints adapted for translation into Czech. The UEDIN/CUNI system used Treex1 to preprocess and parse the Czech-side of the training data. Treex uses the MST parser [12], which produces dependency graphs with non-projective arcs. After projectivisation, dependency trees were converted to CFG trees, enabling the ex- traction of SCFG translation rules. With this string-to-tree system in place, UEDIN/CUNI experimented with unification-based agreement and case government constraints. Specifically, the constraints were designed to enforce: i) case, gender, and number agreement between nouns and pre-nominal adjectival modifiers; ii) number and person agreement between subjects and verbs; iii) case agreement between prepositions and nouns; iv) use of nominative case for sub- ject nouns. In preliminary experiments, small but consistent gains in BLEU were obtained. Previous analysis for German has shown that BLEU lacks sensitivity to grammatical improve- ments when compared to human evaluators and so CUNI carried out a small manual analysis of the submitted system with and without unification-based constraints. While the use of hard constraints sometimes forced the system to select a worse translation, in the majority of cases the constraints led to better translations. This work is described in Williams et al. [25], which can be found in Appendix D. The work was developed jointly within WP1 and WP2 due to its applicability to both the problems of semantics in shallow syntactic models and morphological agreement. It is therefore also reported in deliverable D2.3.

2.4 Modeling Selectional Preferences Phrase-based and syntax-based approaches to MT both make overly-strong independence as- sumptions that can impact the semantic integrity of translation. In particular, predicates and their arguments are frequently translated in isolation, with only an n-gram language model to ensure that the choices are coherent. UEDIN introduced a model of semantic affinity between verbal predicates and nominal argument heads. The model assigns selectional preference scores to triples of the form (syntactic relation, predicate, argument). Since string-to-tree syntax-based models generate parse trees on the target-side, they provide a natural mechanism for identifying predicate-argument relations and applying such a model during decoding, and that is the approach taken here. While the selectional preference model did not improve translation quality (according to standard automatic evaluation metrics), due to the analysis carried out as part of this work it was possible to shed light on some wider issues that deserve attention in syntactic MT, including errors in verb translation and tree structure. This work (jointly funded by HimL) is described in Nadejde et al. [13], which can be found in Appendix E.

3 Richer Source Annotation in SMT

This part comprises our works that aimed to improve the semantic correctness of the translation by utilising richer annotation of the source sentence.

1http://ufal.mff.cuni.cz/treex

Page 6 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

In Section 3.1, this additional information explicitly indicates verb sense. In Section 3.2, a more general discriminative model is designed and applied for translation into Czech, German, Polish and Romanian, benefitting from source English tags, dependency edge labels and also the lemma of the parent node. Both of the above methods were tested in PBMT. In Section 3.3, the study is complemented by explicit structural annotation of the source with a tree-to-string model. The final two sections return to PBMT. Section 3.4 extends prior work on sampling phrase tables, showing that they can achieve state-of-the-art performance, and Section 3.5 uses the source syntactic annotation to constrain allowable reorderings to improve the translation quality.

3.1 Using Verb Patterns in PBMT Semantic ambiguity is a pervasive phenomenon in natural language. Since meaning does not map to words consistently and in the same way across different languages, semantic ambiguity provides additional challenges for MT. Within QT21, CUNI developed methods of disambiguat- ing the meaning of verbs in source input sentences for phrase-based MT. The approach employs the notion of “verb patterns” from Vincent Kríž and Martin Holub, based on the work of Hanks [10], and also valency frames from the valency lexicon EngVallex [1] for the purposes of Prague Czech-English Dependency (PCEDT, 9). Experiments include enhancement of Moses phrase tables with an additional factor rep- resenting verb senses that were automatically identified in the source-side input. In general, improvements do not show up in automatic evaluation except for a significant improvement of one of the setups with valency frames. In contrast, however, manual evaluation shows that the additional factor does in fact improve translation of verbs and verb phrases. This work is described in Sudarikov et al. [18], reproduced as Appendix F.

3.2 Discriminative Models with Context Information for PBMT Phrase-based systems struggle when information required to disambiguate the meaning of a word or phrase is not within the immediate surrounding context. One approach that we explore in QT21 to address this particular challenge is to employ discriminative models that assign word/phrase probabilities based on wider source-side context information. CUNI implemented such a discriminative model within the Moses toolkit and used it to include various types of rich source annotation in the decision process. The verb patterns described in Section 3.1 are another piece of information that could benefit from discriminative modelling, although these experiments are yet to be carried out. In Tamchyna et al. [19], CUNI show that these discriminative models can be trained ef- ficiently on large volumes of data and that discriminative models are beneficial for multiple language pairs (English to German, Czech, Romanian and Polish), improving translation qual- ity over standard PBMT setup even when the training data is large. In Tamchyna et al. [20], CUNI applied these models for the WMT Shared Translation Task and confirmed that for English-Romanian, the model brings significant improvements over a competitive baseline. For English-to-Czech system, the Chimera system (see Deliverable 1.2 for details) is more powerful and the discriminative model does not bring any additional improvement. One possible expla- nation for the lack of improvement is that Chimera already addresses the problems that the proposed discriminative framework aims to fix, due to its hybrid design comprising a phrase- based system combined with deep syntactic TectoMT transfer. Both papers [19, 20] can be found in Appendix G and H, respectively.

3.3 Source Annotation in Syntax-based Models For fixed word-order languages (such as English or Chinese), syntax is particularly informative for determining the semantics of a sentence. Annotating the input sentences with syntactic

Page 7 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models parse trees provides an additional source of information for translation. In order to facili- tate experiments with source-side syntax, UEDIN implemented a tree-to-string syntax-based translation framework in Moses. Its translation grammar is a Synchronous Tree-Substitution Grammar (STSG; 4) with parse-tree fragments on the source side and strings of terminals and non-terminals on the target side. As with string-to-tree models, the grammar is extracted from word-aligned parallel data using the GHKM algorithm [5, 6]. UEDIN developed a decoder for tree-to-string models that matches the source side of a grammar rule against an input parse tree. This decoder implementation is based on the rule matching algorithm by Zhang et al. [26], combined with language model (LM) integration via cube pruning. Both the tree-to-string decoder and the GHKM extractor for STSGs have been released as part of the open source Moses SMT toolkit. The tree-to-string implementation was applied successfully in Edinburgh’s syntax-based submission to the WMT 2015 shared translation task on multiple language pairs: English to Czech (ranked 6th out of 15 systems), English to Finnish (6th-8th out of 10), and English to Russian (7th out of 10). This work is described in Williams et al. [24], which can be found in Appendix C.

3.4 Sampling Phrase Tables for the Moses Statistical Machine Translation System In order to support the work on phrase-based statistical MT early in the project, UEDIN extended support for sampling phrase tables in the Moses toolkit. A sampling phrase table is one that constructs phrase table entries on demand by sampling from an pre-indexed parallel corpus instead of building a large static phrase table up-front. Support for sampling phrase tables had recently been added by Germann [7] but was missing support for lexicalized reordering models – a crucial component of state-of-the-art phrase based MT. UEDIN added this missing support and demonstrated that sampling phrase tables could match (or slightly outperform) static phrase tables in terms of translation speed and quality, while adding the benefits of: i) being much faster to build; ii) offering flexibility in the choice of feature functions used – feature functions can be added or disabled without creating the need to re-run the entire phrase table construction pipeline; and iii) having a lower memory footprint. It was originally planned that support for factors would be added, enabling efficient experimentation with syntactic and semantic features (since features could be rapidly added or dropped during experimentation in order to find effective combinations). However, with the switch in focus to neural MT, this direction was ultimately not pursued. This work was published in The Prague Bulletin of Mathematical Linguistics [8] and can be found in Appendix I.

3.5 Reordering Constraints in PBMT There is a potential problem with the computational cost of evaluating target-side context during decoding. Tilde contributed to the task by exploring how to use sentence structure to improve word reordering in phrase-based statistical MT for translation into morphologically rich languages (from English into Latvian). Two groups of experiments that explore syntactic structures of the source language together with phrase reordering constraints were performed and analysed:

• limiting reordering of top-level phrases; • limiting reordering of specific phrase structures (e.g. different types of noun phrases, prepositional phrases, and complex verb phrases).

Putting reordering limits on top-level phrases did not show improvements in terms of BLEU score, while limiting reordering of specific phrase structures resulted in increase by 0.5 BLEU. This work is described in Appendix J.

Page 8 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

References

[1] Silvie Cinková. From PropBank to EngValLex: Adapting the PropBank-Lexicon to the Valency Theory of the Functional Generative Description. In Proceedings of the fifth Inter- national conference on Language Resources and Evaluation (LREC 2006), Genova, Italy, 2006.

[2] Joachim Daiber, Miloš Stanojević, Wilker Aziz, and Khalil Sima’an. Examining the rela- tionship between preordering and word order freedom in machine translation. In Proceed- ings of the First Conference on Machine Translation, pages 118–130, Berlin, Germany, August 2016. Association for Computational Linguistics. URL http://www.aclweb.org/ anthology/W/W16/W16-2213.

[3] Adrià de Gispert, Gonzalo Iglesias, and Bill Byrne. Fast and accurate preordering for SMT using neural networks. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technolo- gies, pages 1012–1017, Denver, Colorado, May–June 2015. URL http://www.aclweb. org/anthology/N15-1105.

[4] Jason Eisner. Learning non-isomorphic tree mappings for machine translation. In The Companion Volume to the Proceedings of 41st Annual Meeting of the Association for Com- putational Linguistics, pages 205–208, Sapporo, Japan, July 2003. Association for Compu- tational Linguistics.

[5] Michel Galley, Mark Hopkins, Kevin Knight, and Daniel Marcu. What’s in a translation rule? In HLT-NAACL ’04, 2004.

[6] Michel Galley, Jonathan Graehl, Kevin Knight, Daniel Marcu, Steve DeNeefe, Wei Wang, and Ignacio Thayer. Scalable inference and training of context-rich syntactic translation models. In ACL-44: Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pages 961–968, Morristown, NJ, USA, 2006. Association for Computational Linguistics.

[7] Ulrich Germann. Dynamic Phrase Tables for Machine Translation in an Interactive Post- editing Scenario. In Proceedings of the Workshop on Interactive and Adaptive Machine Translation, pages 20–31, 2014.

[8] Ulrich Germann. Sampling phrase tables for the moses statistical machine translation system. The Prague Bulletin of Mathematical Linguistics, (104):39–50, 2015.

[9] Jan Hajič, Eva Hajičová, Jarmila Panevová, Petr Sgall, Ondřej Bojar, Silvie Cinková, Eva Fučíková, Marie Mikulová, Petr Pajas, Jan Popelka, Jiří Semecký, Jana Šindlerová, Jan Štěpánek, Josef Toman, Zdeňka Urešová, and Zdeněk Žabokrtský. Announcing Prague Czech-English Dependency Treebank 2.0. In Proceedings of the Eighth International Language Resources and Evaluation Conference (LREC’12), pages 3153–3160, Istanbul, Turkey, May 2012. ELRA, European Language Resources Association. ISBN 978-2- 9517408-7-7.

[10] Patrick Hanks. Norms and Exploitations: Corpus, Computing, and Cognition in Lexical Analysis. MIT Press, January 2013. ISBN 9780262018579.

[11] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, et al. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions, pages 177–180. Association for Computational Linguistics, 2007.

Page 9 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

[12] Ryan McDonald, Fernando Pereira, Kiril Ribarov, and Jan Hajic. Non-projective depen- dency parsing using spanning tree algorithms. In Proceedings of Human Language Tech- nology Conference and Conference on Empirical Methods in Natural Language Processing, 2005.

[13] Maria Nadejde, Alexandra Birch, and Philipp Koehn. Modeling selectional preferences of verbs and nouns in string-to-tree machine translation. In Proceedings of the First Confer- ence on Statistical Machine Translation (WMT16), Berlin, Germany, August 2016. Asso- ciation for Computational Linguistics.

[14] Slav Petrov, Leon Barrett, Romain Thibaux, and Dan Klein. Learning accurate, compact, and interpretable tree annotation. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computa- tional Linguistics, pages 433–440. Association for Computational Linguistics, 2006.

[15] Rico Sennrich. Modelling and optimizing on syntactic n-grams for statistical machine translation. Transactions of the Association for Computational Linguistics, 3:169–182, 2015.

[16] Rico Sennrich and Barry Haddow. A joint dependency model of morphological and syntactic structure for statistical machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 2081–2087, Lisbon, Portugal, September 2015. Association for Computational Linguistics.

[17] Miloš Stanojević and Khalil Sima’an. Reordering grammar induction. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 44–54, Lisbon, Portugal, September 2015. URL http://aclweb.org/anthology/D15-1005.

[18] Roman Sudarikov, Ondřej Bojar, Ondřej Dušek, Martin Holub, and Vincent Kríž. Verb sense disambiguation in machine translation. In Sixth Workshop on Hybrid Approaches to Translation (HyTra-6), pages 42–50, Stroudsburg, PA, USA, 2016. Association for Compu- tational Linguistics, Association for Computational Linguistics. ISBN 978-4-87974-713-6.

[19] Aleš Tamchyna, Alexander Fraser, Ondřej Bojar, and Marcin Junczys-Dowmunt. Target- side context for discriminative models in statistical machine translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, August 2016.

[20] Aleš Tamchyna, Roman Sudarikov, Ondřej Bojar, and Alexander Fraser. CUNI-LMU sub- missions in WMT2016: Chimera constrained and beaten. In Proceedings of the Conference on Machine Translation (WMT), Berlin, Germany, August 2016.

[21] Andrejs Vasiļjevs, Raivis Skadiņš, and Jörg Tiedemann. Letsmt!: a cloud-based platform for do-it-yourself machine translation. In Proceedings of the ACL 2012 System Demonstra- tions, pages 43–48. Association for Computational Linguistics, 2012.

[22] Philip Williams. Unification-based Constraints for Statistical Machine Translation. PhD thesis, University of Edinburgh, 2014.

[23] Philip Williams and Philipp Koehn. Agreement constraints for statistical machine transla- tion into german. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 217–226, Edinburgh, Scotland, July 2011. Association for Computational Linguistics.

[24] Philip Williams, Rico Sennrich, Maria Nadejde, Matthias Huck, and Philipp Koehn. Ed- inburgh’s syntax-based systems at wmt 2015. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 199–209, Lisbon, Portugal, September 2015. Asso- ciation for Computational Linguistics.

Page 10 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

[25] Philip Williams, Rico Sennrich, Maria Nadejde, Matthias Huck, Barry Haddow, and Ondřej Bojar. Edinburgh’s statistical machine translation systems for wmt16. In Proceedings of the First Conference on Statistical Machine Translation (WMT16), Berlin, Germany, August 2016. Association for Computational Linguistics.

[26] Hui Zhang, Min Zhang, Haizhou Li, and Chew Lim Tan. Fast translation rule matching for syntax-based statistical machine translation. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 1037–1045, Singapore, August 2009. Association for Computational Linguistics.

Page 11 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

A Examining the Relationship between Preordering and Word Order Freedom in Machine Translation

Examining the Relationship between Preordering and Word Order Freedom in Machine Translation

Joachim Daiber Milosˇ Stanojevic´ Wilker Aziz Khalil Sima’an Institute for Logic, Language and Computation (ILLC) University of Amsterdam initial.last @uva.nl { }

Abstract sentence has been embraced as a way to ensure the reachability of certain target word order con- We study the relationship between word stellations for improved prediction of the target order freedom and preordering in statisti- word order. Preordering aims at predicting a per- cal machine translation. To assess word mutation of the source sentence which has min- order freedom, we first introduce a novel imal word order differences with the target sen- entropy measure which quantifies how dif- tence; the permuted source sentence is passed on ficult it is to predict word order given a to a backend translation system trained to translate source sentence and its syntactic analysis. target-order source sentences into target sentences. We then address preordering for two target In essence, the preordering approach makes the as- languages at the far ends of the word order sumption that it is feasible to predict target word freedom spectrum, German and Japanese, order given only clues from the source sentence. and argue that for languages with more In the vast majority of work on preordering, a sin- word order freedom, attempting to predict gle preordered source sentence is passed on to the a unique word order given source clues backend system, thereby making the stronger as- only is less justified. Subsequently, we ex- sumption that it is feasible to predict a unique pre- amine lattices of n-best word order predic- ferred target word order. But how reasonable are tions as a unified representation for lan- these assumptions and for which target languages? guages from across this broad spectrum Intuitively, the assumption of a unique pre- and present an effective solution to a re- ordering seems reasonable for translating into sulting technical issue, namely how to se- fixed word order languages such as Japanese, but lect a suitable source word order from for translation into languages with less strict word the lattice during training. Our experi- order such as German, this is unlikely to work. ments show that lattices are crucial for In such languages there are multiple comparably good empirical performance for languages plausible target word orders per source sentence with freer word order (English–German) because the underlying predicate-argument struc- and can provide additional improvements ture can be expressed with mechanisms other than for fixed word order languages (English– word order alone (e.g. morphological inflections Japanese). or intonation). For these languages, it seems rather unlikely to be able to choose a unique word order 1 Introduction given only source sentence clues. In this paper, we Word order differences between a source and a tar- want to shed light on the relationship between the get language are a major challenge for machine target language’s word order freedom and the fea- translation systems. For phrase-based models, sibility of preordering. We start out by contribut- the number of possible phrase permutations is so ing an information-theoretic measure to quantify large that reordering must be constrained locally the difficulty in predicting a preferred word order to make the search space for the best hypothe- given the source sentence and its syntax. Our mea- sis feasible. However, constraining the space lo- sure provides empirical support for the intuition cally runs the risk that the optimal hypothesis is that it is often not possible to predict a unique word rendered out of reach. Preordering of the source order for free word order languages, whereas it is

118 Proceedings of the First Conference on Machine Translation, Volume 1: Research Papers, pages 118–130, Berlin, Germany, August 11-12, 2016. c 2016 Association for Computational Linguistics

Page 12 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

more feasible for fixed word order languages such 2.1 Source Syntax-Based Preordering as Japanese. Subsequently, we study the option of Many approaches to preordering have made use passing the n-best word order predictions, instead of syntactic representations of the source sentence, of 1-best, to the backend system as a lattice of pos- including Collins et al. (2005) who restructure the sible word orders of the source sentence. source phrase structure parse tree by applying a For the training of the backend system, the sequence of transformation rules. More recently, use of such permutation lattices raises a question: Jehl et al. (2014) learn to order sibling nodes in What should constitute the training corpus for a the source-side dependency parse tree. The space lattice-preordered translation system? In previ- of possible permutations is explored via depth-first ous work using single word order predictions, the branch-and-bound search (Balas and Toth, 1983). training data consists of pairs of source and target In later work, the authors further improve this sentences where the source sentence is either in model by replacing the logistic regression classi- target order (i.e. order based on word alignments) fier with a feed-forward neural network (de Gis- or preordered (i.e. predicted order). In this work pert et al., 2015), which results in improved em- we contribute a novel approach for selecting train- pirical results and eliminates the need for feature ing instances from the lattice of word order permu- engineering. Lerner and Petrov (2013) train clas- tations: We select the permutation providing the sifiers to predict the permutations of up to 6 tree best match with the target-order source sentence nodes in the source dependency tree. The authors (we call this process “lattice silver training”). found that by only predicting the best 20 permuta- Our experiments show that for English– tions of n nodes, they could cover a large majority Japanese and English–German lattice preordering of the reorderings in their data. has a positive impact on the translation quality. Whereas lattices enable further improvement for 2.2 Preordering without Source Syntax preordering English into the strict word order lan- Tromble and Eisner (2009) learn to predict the ori- guage Japanese, lattices in conjunction with our entation of any two words (straight or inverted or- proposed lattice silver training scheme turn out to der) using a perceptron. The search for the best re- be crucial to reach satisfactory empirical perfor- ordering is performed with a O(n3) chart parsing mance for English–German. This result highlights algorithm. More basic approaches to syntax-less that when predicting word order of free word order preordering include the application of multiple languages given source clues only, it is important MT systems (Costa-jussa` and Fonollosa, 2006), to ensure that the word order predictions and the where a first system learns preordering and a sec- backend system are suitably fitted together. ond learns to translate the preordered sentence into the target sentence. Finally, there have been 2 Related Work successful attempts at the automatic induction of parse trees from aligned data (DeNero and Uszko- Preordering has been explored from the per- reit, 2011) and the estimation of latent reordering spective of the upper-bound achievable transla- grammars (Stanojevic´ and Sima’an, 2015) based tion quality in several studies, including Khalilov on permutation trees (Zhang and Gildea, 2007). and Sima’an (2012) and Herrmann et al. (2013), 2.3 Lattice Translation which compare various systems and provide or- acle scores for syntax-based preordering models. A lattice is an acyclic finite-state automaton defin- Target-order source sentences, in which the word ing a finite language. A more restricted class of order is determined via automatic alignments, en- lattices, namely, confusion networks (Bertoldi et able translation systems great jumps in translation al., 2007), has been extensively used to pack alter- 1 quality and provide improvements in compactness native input sequences for decoding. However, and efficiency of downstream phrase-based trans- applications mostly focused on speech translation lation models. Approaches have largely followed (Ney, 1999; Bertoldi et al., 2007), or to account for two directions: (1) predicting word order based lexical and/or segmentation ambiguity due to pre- on some form of source-syntactic representation processing (Xu et al., 2005; Dyer, 2007). In very and (2) approaches which do not depend on source 1A confusion network is a special case of a lattice where syntax. every path from start to final state goes through every node.

119

Page 13 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

few occasions, lattice input has been used to deter- 3 Quantifying Word Order Freedom mine the space of permutations of the input con- While varying degrees of word order freedom are sidered by the decoder (Knight and Al-Onaizan, a well-studied topic in linguistics, word order free- 1998; Kumar and Byrne, 2003). The effectiveness dom has only recently been studied from a quanti- of lattices of permutations was demonstrated by tative perspective. This has been enabled partly by Zhang et al. (2007). However, except in the cases the increasing availability of syntactic . of n-gram based decoders (Khalilov et al., 2009) Kubonˇ and Lopatkova´ (2015) propose a measure this approach is not a common practice. of word order freedom based on a set of six com- Dyer et al. (2008) formalized lattice transla- mon word order types (SVO, SOV, etc.). Futrell et tion both for phrase-based and hierarchical phrase- al. (2015) define various entropy measures based based MT. The former requires a modification of on the prediction of word order given unordered the standard phrase-based decoding algorithm as dependency trees. Both approaches require a de- to maintain a coverage vector over states, rather pendency treebank for each language. than input word positions. The latter requires in- In practical applications such as machine trans- tersecting a lattice and a context-free grammar, lation, it is difficult to quantify the influence of which can be seen as a generalized form of pars- word order freedom. For an arbitrary language ing (Klein and Manning, 2001). In this work, we pair, our goal is to quantify a notion of the target focus on phrase-based models. language’s word order freedom based only on par- The space of translation options in standard allel sentences and source syntax. In their head phrase-based decoding with a distortion limit d direction entropy measure, Futrell et al. (2015) grows with O(stack size n 2d) where n repre- × × approach the problem of quantifying word order sents the input length, and the number of transla- freedom by measuring the difficulty of recover- tion options is capped due to beam search (Koehn ing the correct linear order from a sentence’s un- et al., 2003). With lattice input, the dependency ordered dependency tree. We approach the prob- on n is replaced by Q where Q is the set of states | | lem of quantifying a target language’s word order of the lattice. The stack size makes the number of freedom by measuring the difficulty of predicting translation options explored by the decoder inde- target word order based on the source sentence’s pendent of the number of transitions in the lattice. dependency tree. Hence, we ask questions such As in standard decoding, the states of a lattice as: How difficult is it to predict French word or- can also be visited non-monotonically. However, der based on the syntax of the English source sen- two states in a lattice are not always connected by tence? a path, and, in general, paths connecting two nodes might differ in length. Dyer et al. (2008) proposed 3.1 Source Syntax and Target Word Order to pick the shortest path between two nodes to be We represent the target sentence’s word order as a representative of the distance between them.2 Just sequence of order decisions. Each order decision like in standard decoding, a distortion limit is im- encodes for two source words, a and b, whether posed to keep the space of translations tractable. their translation equivalents are in the order (a, b) In this work, we use lattice input to constrain the or (b, a). The source sentences are parsed with a space of permutations of the source allowed within dependency parser.3 The target-language order of the decoder. Moreover, in most cases we com- the words in the source dependency tree is then pletely disable the decoder’s further reordering ca- determined by comparing the target sentence po- pabilities. Because our models can perform global sitions of the words aligned to each source word. permutation operations without ad hoc distortion Figure 1 shows the percentage of dependent-head limits, we can reach far more complex word or- pairs in the source dependency tree whose target ders. Crucially, our models are better predictors order can be correctly guessed by always choos- of word order than standard distortion-based re- ing the more common decision.4 ordering, thus we manage to decode with rela- tively small permutation lattices. 3http://cs.cmu.edu/˜ark/TurboParser/ 4For English–Japanese, we use manual word alignments 2This is achieved by running an all-pairs shortest path al- of 1,235 sentences from the Kyoto Free Translation Task gorithm prior to decoding – see for example Chapter 25 of (Neubig, 2011) and for English–German, we use a manu- (Cormen et al., 2001). MOSES uses the Floyd-Warshall algo- ally word-aligned subset of Europarl (Pado´ and Lapata, 2006) rithm, which runs in time O( Q 3). consisting of 987 sentences. | |

120

Page 14 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

Head label Head label Head label Head label

verb noun verb noun

noun 79.7% adj 79.5% noun 44.7% adj 76.7%

- Sbj 63.3% det 78.6% - Sbj 66.3% det 92.9%

- Obj 82.6% prep 49.3% - Obj 22.7% prep 62.2%

- Adv 84.3% - Adv 35.4% Dependent label Dependent label

adv 54.5% adv 41.3%

(a) English–Japanese (b) English–German

Figure 1: Source word pairs whose target order can be predicted using only the words’ labels.

German and Japanese Both language pairs dif- target word order is determined by source syntax. fer significantly in how strictly the target lan- H(Y X) = p(x) p(y x) log p(y x) guage’s word order is determined by the source | − | | x y language’s syntax. English–German shows strict ∑∈X ∑∈Y order constraints within phrases, such as that ad- Conditional entropy measures the amount of infor- jectives and determiners precede the noun they mation required to describe the outcome of a ran- modify in the vast majority of cases (Figure 1b). dom variable Y given the value of a second ran- However, English–German also shows more free- dom variable X. Given a dependent-head pair in dom on the clause level, where basic syntax- the source dependency tree, X consists of the de- based predictions for the positions of nouns rela- pendent’s and the head’s part of speech, as well as tive to the main verb are insufficient. For English– the dependency relation between them. Note that Japanese on the other hand, the position of the as in all of our experiments the source language is nouns relative to the main verb is more rigid, English, the space of outcomes of X is the same which is demonstrated by the high scores in Fig- across all language pairs.Y in this case is the word ure 1a. These results are in line with the linguistic pair’s target-side word order in the form of a (a, b) descriptions of both target languages. From a tech- or (b, a) decision. We estimate H(Y X) using the | nical point of view, they highlight that any treat- bootstrap estimator of DeDeo et al. (2013), which ment of English–German word order must take is less prone to sample bias than maximum likeli- into account information beyond the basic syntac- hood estimation.5 tic level and must allow for a given amount of word order freedom. Influence of word alignments Futrell et al. (2015) use human-annotated dependency trees for 3.2 Bilingual Head Direction Entropy each language they consider. Our estimation only involves word-aligned bilingual sentence pairs While such a qualitative comparison provides in- with a source dependency tree. Manual align- sight into the order differences of selected lan- ments are available for a limited number of lan- guage pairs, it is not straight-forward to compare guage pairs and often only for a diminishingly across many language pairs. From a linguistic per- small number of sentences. Consequently the spective, Futrell et al. (2015) use entropy to com- question arises, whether automatic word align- pare word order freedom in dependency corpora ments are sufficient for this task. To answer this across various languages. While the authors ob- question, we apply our measure to a set of manu- served that artifacts of the data such as treebank ally aligned as well as a larger set of automatically annotation style can hamper comparability, they aligned sentence pairs. In addition to the German found that a simple entropy measure for the pre- and Japanese alignments mentioned above, we use diction of word order based on the dependency manual alignments for English–Italian (Farajian et structure provided a good quantitative measure of al., 2014), English–French (Och and Ney, 2003), word order freedom. English–Spanish (Grac¸a et al., 2008) and English– We follow Futrell et al. (2015) in basing our Portuguese (Grac¸a et al., 2008). measure on conditional entropy, which provides a 5We observe an average of 1,033 values for X per lan- straight-forward way to quantify to which extent guage pair and perform 10,000 Monte-Carlo samples.

121

Page 15 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

their caveats, the influence of the widely varying Berber dependency annotation styles across treebanks, is Turkish also not present in our method, since a single de- Hebrew German pendency style is used for the source language. Russian We have demonstrated that automatic alignments French perform on a comparable level to manual align- Spanish ments. Accordingly, the amount of data that can be Italian used to estimate the measure is only limited by the Portuguese availability of parallel sentences. Finally, while Esperanto dependency treebanks rarely cover the same cor- Japanese pora or even domains, our method can utilize sen- Mandarin tences from the same or similar corpora for each

0 0.2 0.4 0.6 language, thus minimizing potential corpus biases.

Figure 2: Bilingual head direction entropy with Translation from English Figure 2 plots bilin- English source side. gual head direction entropy for an English source side and a set of typologically diverse languages on the target side. For each language pair, we use Since a limited number of manually aligned 18,000 sentence pairs and automatic alignments sentences are available, it is important to avoid from the Tatoeba corpus (Tiedemann, 2012).6 bias due to sample size. Hence, we randomly sam- Languages at the top of the plot in Figure 2 ple the same number of dependency relations from show a greater degree of word order freedom with each language pair. Considering only those lan- respect to the English source syntax. Thus, pre- guages for which we have both manual and au- dicting their word order from English source clues tomatic alignments, we can determine how well alone is likely to be difficult. We argue that in such their word order freedom rankings correlate. Even cases it is crucial to pass on the ambiguity over though the number of samples for the manually the space of predictions to the translation model. aligned sentences is limited to 500 due to the size By doing so, word order decisions can be influ- of the smallest set of manual alignments, we find a enced by translation decisions, while still shaping high correlation of Spearman’s ρ = 0.77 between the space of reachable translations. the rankings of the 6 languages that occur in both sets (Zwillinger and Kokoska, 1999). 4 Preordering Free and Fixed Word Order Languages Influence of source syntax Another factor that may influence our estimated degree of word order The measure of word order freedom introduced in freedom is the form and granularity of the source the previous section enables us to estimate how side’s syntactic representation: More detailed rep- difficult it is to predict the target language’s word resentations may disambiguate cases that are diffi- order based on the source language. In this sec- cult to predict with a more bare representation. As tion, we introduce the two preordering models we are interested in the bilingual case and, specif- we use to predict the word order of German and ically, in preordering, we content ourselves with Japanese. Experiments with these models will al- using the same syntactic representation, i.e. de- low us to examine the relationship between pre- pendency trees, that many preordering models use ordering and word order freedom. (e.g., Jehl et al. (2014), Lerner and Petrov (2013)). 4.1 Neural Lattice Preordering Comparison to monolingual measures Our measure is similar to Futrell et al. (2015)’s head Based on their earlier work, which used logistic direction entropy; however, it also offers sev- regression and graph search for preordering (Jehl eral advantages. While monolingual head direc- et al., 2014), de Gispert et al. (2015) introduce a tion entropy requires a dependency treebank for neural preordering model. In this model, a feed- each language, our bilingual head direction en- forward neural network is trained to estimate the tropy only requires dependency annotation for the 6The alignments were produced using GIZA++ (Och and source language (English in our case). One of Ney, 2003) with grow-diag-final-and symmetrization.

122

Page 16 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

swap probabilities of nodes in the source-side de- one element can be computed efficiently as: pendency tree. Search is performed via the depth- score(π′ i ) = score(π′) first branch-and-bound algorithm. The authors ·⟨ ⟩ have found this model to be fast and to produce p(i, j) · high quality word order predictions for a variety j V i>j of languages. ∈∏| 1 p(i, j) · − j V i j | { ∈ × } | trieve k-best results while keeping the same guar- Training instances generated in this manner are antees and computational complexity. Only mi- then used to estimate the swap probability p(i, j) nor changes are necessary to adapt the search for for two indexes i and j. For each node in the the best permutation to finding the k-best permu- source dependency tree, the best possible permu- tations: We keep a set bestk of the best permuta- tation of its children (including the head) is deter- tions and a single bound. If for a permutation π′, mined via graph search. The score of a permuta- score(π′) > bound, instead of updating the bound tion of length k is defined as follows: to the single best permutation and remembering it, the following steps are performed:

score(π) = p(i, j) 1. If bestk = k: 1 iπ[j] | | ≤ ≤∏| Remove worst permutation from the set. 1 p(i, j) − · − 2. Add π′ to bestk. 1 i

123

Page 17 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

18:18 19 11:11 18 17:17 17 16:16 16 15 18:18 15:15 17:17 37 38 10:10 12:12 13:13 14:14 4:4 5:5 8:8 9:9 10 11 12 13 14 16:16 36 6:6 6 7 8 9 15:15 35 7:7 3:3 4 5 1:1 2:2 14:14 34 18:18 1 2 3 11:11 56 57 4:4 5:5 8:8 9:9 10:10 11:11 12:12 13:13 6:6 25 26 27 28 29 30 31 32 33 17:17 55 7:7 0:0 3:3 23 24 16:16 54 1:1 2:2 20 21 22 15:15 53 18:18 5:5 8:8 9:9 10:10 12:12 13:13 14:14 17:17 75 76 6:6 45 46 47 48 49 50 51 52 74 7:7 44 16:16 0:0 4:4 43 2:2 3:3 42 1:1 41 15:15 73 39 40 14:14 5:5 8:8 9:9 10:10 11:11 12:12 13:13 72 11:11 18:18 6:6 64 65 66 67 68 69 70 71 94 95 17:17 93 4:4 7:7 0:0 3:3 62 63 1:1 2:2 61 59 60 16:16 92 58 15:15 4:4 8:8 9:9 10:10 12:12 13:13 14:14 91 5:5 83 84 85 86 87 88 89 90 18:18 6:6 114 7:7 81 82 11:11 113 sentence and its permutation are observed during 0:0 3:3 1:1 2:2 80 17:17 77 78 79 16:16 111 112 5:5 8:8 9:9 10:10 12:12 13:13 14:14 15:15 110 4:4 102 103 104 105 106 107 108 109 3:3 0:0 7:7 6:6 100 101 18:18 133 1:1 2:2 98 99 17:17 132 96 97 16:16 15:15 130 131 4:4 8:8 9:9 10:10 11:11 12:12 13:13 14:14 5:5 121 122 123 124 125 126 127 128 129 0:0 6:6 1:1 2:2 3:3 7:7 119 120 115 116 117 118 18:18 152 11:11 151 17:17 16:16 150 6:6 8:8 9:9 10:10 12:12 13:13 14:14 15:15 149 7:7 140 141 142 143 144 145 146 147 148 0:0 4:4 5:5 1:1 2:2 3:3 138 139 134 135 136 137 18:18 17:17 171 16:16 170 14:14 15:15 168 169 11:11 12:12 13:13 167 9:9 10:10 163 164 165 166 0:0 3:3 4:4 5:5 8:8 162 1:1 2:2 7:7 6:6 158 159 160 161 153 154 155 156 157 18:18 190 17:17 189 16:16 188 0:0 13:13 14:14 15:15 7:7 6:6 8:8 9:9 10:10 11:11 12:12 185 186 187 1:1 2:2 3:3 4:4 5:5 177 178 179 180 181 182 183 184 172 173 174 175 176 18:18 11:11 208 209 17:17 207 0:0 15:15 16:16 4:4 8:8 9:9 10:10 12:12 13:13 14:14 206 1:1 2:2 3:3 5:5 7:7 6:6 198 199 200 201 202 203 204 205 191 192 193 194 195 196 197 18:18 17:17 228 16:16 226 227 14:14 15:15 0:0 9:9 10:10 11:11 12:12 13:13 225 1:1 2:2 3:3 5:5 7:7 6:6 4:4 8:8 219 220 221 222 223 224 training. The exact PET that generated this per- 210 211 212 213 214 215 216 217 218 18:18 11:11 246 247 15:15 16:16 17:17 0:0 13:13 14:14 244 245 1:1 2:2 4:4 3:3 7:7 6:6 5:5 8:8 9:9 10:10 12:12 240 241 242 243 229 230 231 232 233 234 235 236 237 238 239

18:18 11:11 265 266 15:15 16:16 17:17 264 0:0 13:13 14:14 263 1:1 2:2 7:7 6:6 3:3 5:5 4:4 8:8 9:9 10:10 12:12 259 260 261 262 248 249 250 251 252 253 254 255 256 257 258

17:17 18:18 0:0 12:12 13:13 14:14 15:15 16:16 284 285 1:1 2:2 4:4 3:3 7:7 6:6 5:5 8:8 9:9 10:10 11:11 279 280 281 282 283 267 268 269 270 271 272 273 274 275 276 277 278

18:18 0:0 17:17 11:11 304 1:1 2:2 3:3 5:5 4:4 7:7 6:6 8:8 9:9 10:10 12:12 13:13 14:14 15:15 16:16 302 303 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301

0:0 1:1 2:2 7:7 6:6 3:3 5:5 4:4 8:8 9:9 10:10 11:11 12:12 13:13 14:14 15:15 16:16 17:17 18:18 323 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322

0:0

0:0 1:1 2:2 3:3 5:5 4:4 7:7 6:6 8:8 9:9 10:10 11:11 12:12 13:13 14:14 15:15 16:16 17:17 18:18 6 0 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 5:5 0:0 3:3 5 3:3 6:6 0:0 21 1:1 2:2 7:7 6:6 4:4 3:3 5:5 8:8 9:9 10:10 12:12 13:13 14:14 15:15 16:16 5:5 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 17:17 11:11 18:18 6:6 359 360 361 4 4:4 24 mutation is not observed and there could be (ex- 0:0 20 7:7 23 6:6 1:1 2:2 4:4 3:3 5:5 7:7 6:6 8:8 9:9 10:10 12:12 13:13 14:14 15:15 7:7 7:7 362 363 364 365 366 367 368 369 370 371 372 373 374 375 16:16 4:4 376 377 17:17 11:11 18:18 378 379 380 0:0 5:5 22 4:4 19 3:3 7:7 5:5 25 30 6:6 1:1 2:2 7:7 6:6 4:4 3:3 5:5 8:8 9:9 10:10 0:0 1:1 2:2 4:4 381 382 383 384 385 386 387 388 389 390 391 11:11 12:12 0 1 2 3 7:7 0:0 392 13:13 14:14 15:15 16:16 5:5 393 394 395 396 17:17 18:18 31 8:8 397 398 399 3:3 4:4 5:5 9:9 10:10 12:12 13:13 14:14 15:15 16:16 17:17 7 32 33 34 35 36 37 38 39 40 0:0 4:4 26 29 28 11:11 6:6 11:11 1:1 2:2 4:4 3:3 5:5 7:7 6:6 8:8 9:9 10:10 5:5 18:18 400 401 402 403 404 405 406 407 408 409 410 11:11 52 53 0:0 12:12 13:13 14:14 15:15 17:17 411 412 413 414 415 16:16 17:17 7:7 3:3 416 417 18:18 12:12 13:13 14:14 15:15 16:16 418 46 47 48 49 50 51 0:0 3:3 18 27 41 9:9 10:10 11:11 6:6 43 44 45 1:1 2:2 5:5 3:3 7:7 6:6 4:4 8:8 9:9 10:10 7:7 419 420 421 422 423 424 425 426 427 12:12 7:7 8:8 428 429 13:13 14:14 15:15 7:7 0:0 430 431 432 433 16:16 17:17 11 434 435 11:11 436 18:18 5:5 3:3 437 7:7 6:6 13 14 42 0:0 12 1:1 2:2 4:4 7:7 6:6 3:3 5:5 8:8 9:9 4:4 438 439 440 441 442 443 444 445 446 447 10:10 5:5 0:0 448 12:12 13:13 6:6 449 14:14 15:15 16:16 17:17 6:6 3:3 450 451 452 453 11:11 18:18 9 10 454 455 456 8 0:0 1:1 2:2 5:5 3:3 7:7 6:6 457 458 459 460 461 462 463 4:4 464 8:8 9:9 10:10 11:11 12:12 13:13 3:3 7:7 465 466 467 468 469 470 14:14 15:15 16:16 15 16 17 471 472 473 17:17 18:18 0:0 474 475 1:1 2:2 4:4 7:7 6:6 476 477 478 479 480 481 3:3 5:5 8:8 9:9 10:10 11:11 12:12 13:13 482 483 484 485 486 487 488 489 14:14 15:15 ponentially) many PETs that could have generated 0:0 490 491 16:16 492 17:17 18:18 1:1 2:2 7:7 6:6 5:5 493 494 495 496 497 498 499 500 3:3

501 4:4 8:8 9:9 10:10 12:12 13:13 14:14 502 503 504 505 506 507 508 15:15 16:16 509 510 17:17 511 11:11 18:18 1:1 2:2 5:5 512 514 515 516 517 3:3 513 0:0 518 4:4 7:7 6:6 519 520 521 8:8 522 9:9 1:1 2:2 7:7 6:6 10:10 12:12 13:13 14:14 15:15 16:16 17:17 11:11 18:18 533 534 535 536 523 524 525 526 527 528 529 530 531 532 537 5:5

538 3:3 4:4 8:8 9:9 10:10 11:11 12:12 13:13 14:14 15:15 16:16 1:1 2:2 539 540 541 542 543 544 545 546 547 548 549 17:17 552 553 5:5 550 18:18 554 555 3:3 551 0:0 556 4:4 557 7:7 1:1 6:6 8:8 9:9 10:10 11:11 12:12 13:13 14:14 15:15 16:16 571 572 558 559 560 561 562 563 564 565 566 567 568 17:17 2:2 18:18 573 7:7 569 570 574 6:6

4:4 5:5 3:3 8:8 9:9 10:10 11:11 12:12 13:13 14:14 15:15 590 575 576 577 578 579 580 581 582 583 584 585 586 16:16 1:1 587 17:17 588 18:18 2:2 4:4 589 591 592 593 5:5 3:3 7:7 6:6 8:8 9:9 10:10 11:11 12:12 13:13 14:14 15:15 609 594 595 596 597 598 599 600 601 602 603 604 605 16:16 0:0 1:1 606 17:17 18:18 2:2 5:5 607 610 611 608 the observed permutation. Hence, the bracketings 612 7:7 6:6 3:3 4:4 8:8 9:9 10:10 11:11 12:12 13:13 14:14 613 614 615 616 617 618 619 620 621 622 623 15:15 628 1:1 16:16 624 625 17:17 2:2 629 630 4:4 626 18:18 631 7:7 627 6:6 5:5 3:3 8:8 9:9 10:10 11:11 12:12 13:13 14:14 632 633 634 635 636 637 638 639 640 641 15:15 647 642 16:16 1:1 643 644 17:17 648 2:2 645 18:18 649 3:3 646 650 4:4 5:5 6:6 7:7 8:8 9:9 10:10 11:11 12:12 13:13 14:14 651 652 653 654 655 656 657 658 659 660 661 15:15 662 16:16 663 17:17 18:18 of potential PETs are treated as latent variables. 664 665 The second source of latent variables is state (a) Linear form. (b) Minimized lattice. splitting of non-terminals (labels that indicate how Figure 3: Example permutation lattice. to reorder the children) in a similar way as done in monolingual parsing (Matsuzaki et al., 2005; Petrov et al., 2006; Prescher, 2005). Each la- 5 Machine Translation with tent permutation tree has many latent derivations Permutation Lattices and the generative probabilistic model needs to ac- count for them. The probability of the observed 5.1 Permutation Lattices permutation π is defined in the following way: We call a permutation lattice for sentence s = s ,..., s an acyclic finite-state automaton ⟨ 1 n⟩ P (π) = P (r) where every path from the initial state reaches ∆ PEF(π) d ∆ r d ∈∑ ∑∈ ∏∈ an accepting state in exactly n uniquely labeled where PEF(π) returns the Permutation Forest of transitions. Transitions are labeled with pairs in (i, s )n and each path represents an arbitrary π (i.e., the set of PETs that can generate the per- { i i=1} mutation π), ∆ represents a permutation tree, d permutation of the source’s n tokens. represents a derivation of a permutation tree and In a permutation lattice with states Q and transi- tions E, every path between any two states u, v r represents a production rule. Efficient estima- ∈ tion for this model is done by using the standard Q has exactly the same length. Let out∗(x) de- note the transitive closure of x Q, that is, the set Inside-Outside algorithm (Lari and Young, 1990). ∈ At test time, the source sentence is parsed with of states reachable from x. If two nodes are at all connected, v out (u), then the distance between the estimated grammar in order to find the deriva- ∈ ∗ them equals d d , where d is x’s distance from tion of a permutation tree with the lowest expected v − u x cost. More formally, the decoding task can be de- the initial state. This observation allows a speed scribed as: up of non-monotone translation of a permutation lattice. Namely, to precompute shortest distances,

dˆ= arg min P (d′) cost(d, d′) necessary to impose a distortion limit, instead of d Chart(s) running a fully fledged all-pairs shortest path al- ∈ d′ Chart(s) ∈ ∑ gorithm O( Q 3) (Cormen et al., 2001), we can | | where P (d) = P (r) is the probability of a compute transitive closure in time O( Q E ) r d | | × | | derivation, and Chart(∈ s) is the space of all pos- (Simon, 1988) followed by single-source distance ∏ sible derivations of all possible permutation trees in time O( Q + E ) (Mohri, 2002). | | | | for source sentence s. Two main modifications to We produce permutation lattices by compress- this formula are made in order to make inference ing the n-best outputs from the reordering mod- fast: First, Kendall τ is used as a cost function be- els into a minimal deterministic acceptor. Un- cause it decomposes well,8 which allows usage of weighted determinization and minimization are efficient dynamic programming minimum Bayes- performed using OpenFST (Allauzen et al., 2007). risk (MBR) computation (DeNero et al., 2009). The results of this process are very compact rep- Second, instead of computing the MBR deriva- resentations that can be decoded efficiently. As an tion over the full chart, computation is done over illustration, Figure 3 shows an English sentence 10,000 unbiased samples from the chart. To build from WMT newstest 2014 preordered for transla- the permutation lattice with this model we use tion into German before (3a) and after minimiza- the top n permutations which have the lowest ex- tion (3b).9 Table 1 shows the influence of the num- pected Kendall τ cost. ber of predicted permutations on the lattice sizes

8More precisely, we use the Kendall τ distance between 9Example sentence: The Kluser lights protect cyclists, as the permutations that are yields of the derivations. well as those travelling by bus and the residents of Bergle.

124

Page 18 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

for English–German. Permutation quality is mea- the cost function, we use n-gram overlap, as com- sured by Kendall τ distance to the gold permuta- monly used in string kernels (Lodhi et al., 2002): tion (best-out-of-n). 7

Lattice overlap(ˆsL′ , s′) = countˆs (c)  L′  n=2 c Cn Permutations Kendall τ States Transitions ∑ ∑∈ s′   Monotone 83.78 23 22 n where Cs denotes all candidate n-grams of length 5 84.69 24 52 ′ n in s′ and countˆ (c) denotes the number of oc- 10 85.23 33 69 sL′ 100 86.20 72 138 currences of n-gram c in ˆsL′ . Ties between permu- 1000 86.75 123 233 tations with the same overlap are broken using the permutations’ scores from the preordering model. Table 1: Permutations and lattice size (En–De). 6 Experiments

5.2 Lattice Silver Training 6.1 Experimental Setup While for first-best word order predictions, there In our translation experiments, we use the follow- are two straight-forward options for how to se- ing experimental setup, datasets and parameters. lect training instances for the MT system, it is less Translation system Translation experiments are clear how to do this in the case of permutation lat- performed with a phrase-based machine transla- tices. In standard preordering, the word order of tion system, a version of Moses (Koehn et al., the source sentence in the training set is commonly 2007) with extended lattice support.10 We use the determined by reordering the source sentence to basic Moses features and perform 15 iterations of minimize the number of crossing alignment links batch MIRA (Cherry and Foster, 2012). (we denote this as s′). Alternatively, the trained preordering model can be applied to the source English–Japanese Our experiments are per- side of the training set, which we call ˆs1′ . There formed on the NTCIR-8 Patent Translation is a trade-off between both methods: While s′ will (PATMT) Task. Tuning is performed on the generally produce more compact and less noisy NTCIR-7 dev sets, and translation is evaluated on phrase tables, it may include phrases that are not the test set from NTCIR-9. All data is tokenized reachable by the preordering model. The predicted (using the Moses tokenizer for English and KyTea order ˆs1′ , on the other hand, may be too constrained 5 for Japanese (Neubig et al., 2011)) and filtered to reach helpful hypotheses. For lattices, one op- for sentences between 4 and 50 words. As a base- tion would be to extract all possible phrases from line we use a translation system with distortion the lattice directly. Here, we consider a simpler al- limit 6 and a lexicalized reordering model (Galley ternative: Instead of selecting either the gold order and Manning, 2008). We use a 5-gram language s′ or the predicted order ˆs1′ , we select the order ˆs′ model estimated using lmplz (Heafield et al., 2013) which is closest to both the lattice predictions and on the target side of the parallel corpus. the gold order s . Since this order is a mix of the ′ English–German For translation into German, lattice predictions and the gold order, we call this we built a machine translation system based on the training scheme lattice silver training. WMT 2016 news translation data.11 The system is Let (s, t) be a training instance consisting of a trained on all available parallel data, consisting of source sentence s and a target sentence t and let 4.5m sentence pairs from Europarl (Koehn, 2005), s be the target-order source sentence obtained via ′ Common Crawl (Smith et al., 2013) and the News the word alignments. For each training instance, Commentary corpus. We removed all sentences we select the preordered source ˆs as follows: ′ longer than 80 words and tokenization and true- casing is performed using the standard Moses tok- ˆs′ = arg max overlap(ˆs′ , s′) L enizer and truecaser. We use a 5-gram Kneser-Ney ˆsL′ πk(s) ∈ language model, estimated using lmplz (Heafield where π (s) is the set of k-best permutations pre- k 10Made available at https://github.com/ dicted by the preordering model. Each ˆs π (s) L′ ∈ k wilkeraziz/mosesdecoder. represents a single path through the lattice. As 11http://statmt.org/wmt16/

125

Page 19 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

et al., 2013). The language model is trained on Translation Word order 189m sentences from the target sides of Europarl DL BLEU Kendall τ and News Commentary, as well as the News Crawl Baseline 6 21.76 54.75 2007-2015 corpora. Word alignment is performed 6 26.68 58.05 Oracle order using MGIZA (gdfa with 6, 6, 3 and 3 iterations 0 26.41 57.92 of IBM M1, HMM, IBM M3 and IBM M4). As First-best 6 21.21A 53.44 a baseline we use a translation system with dis- Lattice (silver) 0 21.88B 54.51

tortion limit 6 and a distortion-based reordering AStat. significant against baseline. BStat. significant against first-best. model. Tuning is performed on newstest 2014 and Table 2: Translation results English–German. we evaluate on newstest 2015.

Preordering models For German, we use the performs better even when translating monotoni- neural lattice preordering model introduced in cally with a distortion limit of 0. Section 4.1. The model is trained on the full par- allel training data (4.5m sentences) based on the Lattice silver training To examine the utility of automatic word alignments used by the translation the lattice silver training scheme, we train sys- system. Source dependency trees are produced by tems which differ only in the way the training TurboParser,12 which was trained on the English data is extracted. Table 3 shows that for English– version of HamleDT (Zeman et al., 2012) with German, lattice silver training is successful in content-head dependencies. For translation into bridging the gap between the preordering model Japanese, we train a Reordering Grammar model and the alignment-based target word order, both for 10 iterations of EM on a training set consisting for monotonic translation and when allowing the of 786k sentence pairs with automatic alignments. decoder to additionally reorder translations.

6.2 Translation Experiments Distortion limit We report lowercased BLEU (Papineni et al., 0 3 2002) and Kendall τ calculated from the force- Gold training 21.44 21.60 Lattice silver training 21.88 21.88 aligned hypothesis and reference. Statistical sig- nificance tests are performed for the translation Table 3: Lattice silver training (BLEU, En–De). scores using the bootstrap resampling method with p-value < 0.05 (Koehn, 2004). The standard pre- ordering systems (“first-best” in Table 2 and 4) use English–Japanese Results for translation into an additional lexicalized reordering model (MSD), Japanese are shown in Table 4. while the lattice systems use only lattice distor- Discussion Although preordering with a single tion. For training preordered translation models, permutation already works well for the strict word we recreate word alignments from the original order language Japanese, packing the word order MGIZA alignments and the permutation for En– ambiguity into a lattice allows the machine trans- De and re-align preordered and target sentences lation system to achieve even better translation for En–Ja using MGIZA.13 monotonically than allowing a distortion of 6 and English–German Translation results for trans- an additional lexicalized reordering model on top lation into German are shown in Table 2. For this language pair, we found standard pre- Translation Word order ordering to work poorly. This is despite the fact DL BLEU Kendall τ that the oracle order (i.e. the source words in Baseline 6 29.65 44.87 the test set are preordered according to the word 6 34.22 56.23 Oracle order alignments) shows significant potential. A lattice 0 30.55 53.98

packed with 1000 permutations on the other hand, A First-best 6 32.14 49.68 Lattice 0 32.50AB 50.79 12http://cs.cmu.edu/˜ark/TurboParser/ 13 Re-aligning the sentences with MGIZA generally im- AStat. significant against baseline. BStat. significant against first-best. proves results, which implies that we are likely underestimat- ing the results for En–De. Table 4: Translation results English–Japanese.

126

Page 20 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

of a single permutation. We noticed that lexical- ticular doing so with permutation lattices, can be ized reordering helped the first-best systems and an indispensable tool for dealing with word order hence report this stronger baseline. In principle, in machine translation. The experiments we per- lexicalized reordering can also be used with 0- formed in this paper confirm this previous finding distortion lattice translation, and we plan to inves- and we further build on it by introducing a new tigate this option in the future. Linguistic intuition method for training machine translation systems and the empirical results presented in Section 3 for lattice-preordered input, which we call lattice suggest that compared to Japanese, German shows silver training. Finally, we found that while lat- more word order freedom. Consequently, we as- tices are indeed helpful for English–Japanese, for sumed that a first-best preordering model would which standard preordering already works well, not perform well on the language pair English– they are crucial for translation into the freer word German, and indeed the results in Table 2 confirm order language German. this assumption. For both language pairs, translat- ing a lattice of predicted permutations outperforms Acknowledgements the baselines, thus reducing the gap between trans- We thank the three anonymous reviewers for their lation with predicted word order and oracle word constructive comments and suggestions. This order. However, permutation lattices turn out to be work received funding from EXPERT (EU FP7 the key to enabling any improvement at all for the Marie Curie ITN nr. 317471), NWO VICI grant language pair English–German in the context of nr. 277-89-002 (Khalil Sima’an), DatAptor project preordering. This language pair can benefit from STW grant nr. 12271 and QT21 project (H2020 nr. the improved interaction between word order and 645452). translation decisions. These findings go in tandem with our analysis in Section 3 (see Figures 1 and 2), particularly, the prediction of our information- References theoretic word order freedom metric that it should Cyril Allauzen, Michael Riley, Johan Schalkwyk, Wo- be more difficult to determine German word or- jciech Skut, and Mehryar Mohri. 2007. OpenFst: A der from English clues. Our main focus in this general and efficient weighted finite-state transducer paper was on the language pairs English–German library. In Proceedings of the Ninth International Conference on Implementation and Application of and English–Japanese. Hence, while our results Automata, (CIAA 2007), volume 4783 of Lecture provide an empirical data point for the utility of Notes in Computer Science, pages 11–23. Springer. permutation lattices for free word order languages, http://www.openfst.org. we plan to provide further empirical support by Egon Balas and Paolo Toth. 1983. Branch and bound performing experiments with a broader range of methods for the traveling salesman problem. Tech- language pairs in future work. nical report, Carnegie-Mellon Univ. Pittsburgh PA Management Sciences Research Group. 7 Conclusion Yoshua Bengio, Rejean´ Ducharme, Pascal Vincent, and Christian Janvin. 2003. A neural probabilistic lan- The world’s languages differ widely in how they guage model. J. Mach. Learn. Res., 3:1137–1155, express meaning, relying on indicators such as March. word order, intonation or morphological mark- Nicola Bertoldi, Richard Zens, and Marcello Federico. ings. Consequently, some languages exhibit 2007. Speech translation by confusion network stricter word order than others. Our goal in this decoding. In IEEE International Conference on paper was to examine the effect of word order Acoustics, Speech and Signal Processing, volume 4 of ICASSP ’07, pages 1297–1300, Honolulu, HI, freedom on machine translation and preordering. April. IEEE. We provided an empirical comparison of language pairs in terms of the difficulty of predicting the tar- Colin Cherry and George Foster. 2012. Batch tuning strategies for statistical machine translation. In Pro- get language’s word order based on the source lan- ceedings of the 2012 Conference of the North Amer- guage. Our metric’s predictions agree both with ican Chapter of the Association for Computational the intuition provided by linguistic theory and the Linguistics: Human Language Technologies, pages empirical support we present in the form of trans- 427–436, Montreal,´ Canada, June. lation experiments. We show that addressing un- Michael Collins, Philipp Koehn, and Ivona Kucerova. certainty in word order predictions, and in par- 2005. Clause restructuring for statistical machine

127

Page 21 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

translation. In Proceedings of the 43rd Annual (Depling 2015), pages 91–100, Uppsala, Sweden, Meeting of the Association for Computational Lin- August. Uppsala University, Uppsala, Sweden. guistics (ACL’05), pages 531–540, Ann Arbor, Michigan, June. Michel Galley and Christopher D. Manning. 2008. A simple and effective hierarchical phrase reordering Thomas H. Cormen, Clifford Stein, Ronald L. Rivest, model. In Proceedings of the 2008 Conference on and Charles E. Leiserson. 2001. Introduction to Al- Empirical Methods in Natural Language Process- gorithms. McGraw-Hill Higher Education, 2nd edi- ing, pages 848–856, Honolulu, Hawaii, October. tion. Joao˜ Grac¸a, Joana Paulo Pardal, and Lu´ısa Coheur. Marta R. Costa-jussa` and Jose´ A. R. Fonollosa. 2006. 2008. Building a golden collection of parallel multi- Statistical machine reordering. In Proceedings of language word alignments. the 2006 Conference on Empirical Methods in Nat- ural Language Processing, pages 70–76, Sydney, Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H. Australia, July. Clark, and Philipp Koehn. 2013. Scalable modi- Adria` de Gispert, Gonzalo Iglesias, and Bill Byrne. fied Kneser-Ney language model estimation. In Pro- 2015. Fast and accurate preordering for SMT us- ceedings of the 51st Annual Meeting of the Associa- ing neural networks. In Proceedings of the 2015 tion for Computational Linguistics, pages 690–696, Conference of the North American Chapter of the Sofia, Bulgaria, August. Association for Computational Linguistics: Human Language Technologies, pages 1012–1017, Denver, Teresa Herrmann, Jochen Weiner, Jan Niehues, and Colorado, May–June. Alex Waibel. 2013. Analyzing the potential of source sentence reordering in statistical machine Simon DeDeo, Robert X. D. Hawkins, Sara Klin- translation. In Proceedings of the International genstein, and Tim Hitchcock. 2013. Bootstrap Workshop on Spoken Language Translation (IWSLT methods for the empirical study of decision-making 2013). and information flows in social systems. Entropy, 15(6):2246–2276. Laura Jehl, Adria` de Gispert, Mark Hopkins, and Bill Byrne. 2014. Source-side preordering for transla- John DeNero and Jakob Uszkoreit. 2011. Inducing tion using logistic regression and depth-first branch- sentence structure from parallel corpora for reorder- and-bound search. In Proceedings of the 14th Con- ing. In Proceedings of the 2011 Conference on Em- ference of the European Chapter of the Associa- pirical Methods in Natural Language Processing, tion for Computational Linguistics, pages 239–248, pages 193–203, Edinburgh, Scotland, UK., July. Gothenburg, Sweden, April.

John DeNero, David Chiang, and Kevin Knight. 2009. Maxim Khalilov and Khalil Sima’an. 2012. Statistical Fast consensus decoding over translation forests. In translation after source reordering: Oracles, context- Proceedings of the Joint Conference of the 47th An- aware models, and empirical analysis. Natural Lan- nual Meeting of the ACL and the 4th International guage Engineering, 18:491–519, 10. Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2, ACL ’09, pages Maxim Khalilov, Jose´ A. R. Fonollosa, and Mark Dras. 567–575, Stroudsburg, PA, USA. 2009. Coupling hierarchical word reordering and decoding in phrase-based statistical machine transla- Christopher Dyer, Smaranda Muresan, and Philip tion. In Proceedings of the Third Workshop on Syn- Resnik. 2008. Generalizing word lattice translation. tax and Structure in Statistical Translation, SSST In Proceedings of ACL-08: HLT, pages 1012–1020, ’09, pages 78–86, Stroudsburg, PA, USA. Columbus, Ohio, June. Christopher J. Dyer. 2007. The “noisier chan- Dan Klein and Christopher D. Manning. 2001. Parsing nel”: Translation from morphologically complex and hypergraphs. In Seventh International Work- languages. In Proceedings of the Second Workshop shop on Parsing Technologies (IWPT- 2001), Octo- on Statistical Machine Translation, pages 207–211, ber. Prague, Czech Republic, June. Kevin Knight and Yaser Al-Onaizan. 1998. Transla- M. Amin Farajian, Nicola Bertoldi, and Marcello Fed- tion with finite-state devices. In Proceedings of the erico. 2014. Online word alignment for online Association for Machine Translation in the Ameri- adaptive machine translation. In Proceedings of the cas, AMTA, pages 421–437, Langhorne, PA, USA. EACL 2014 Workshop on Humans and Computer- assisted Translation, pages 84–92, Gothenburg, Philipp Koehn, Franz Josef Och, and Daniel Marcu. Sweden, April. 2003. Statistical phrase-based translation. In Pro- ceedings of the 2003 Conference of the North Amer- Richard Futrell, Kyle Mahowald, and Edward Gibson. ican Chapter of the Association for Computational 2015. Quantifying word order freedom in depen- Linguistics on Human Language Technology - Vol- dency corpora. In Proceedings of the Third In- ume 1, NAACL ’03, pages 48–54, Stroudsburg, PA, ternational Conference on Dependency Linguistics USA.

128

Page 22 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Graham Neubig, Yosuke Nakata, and Shinsuke Mori. Callison-Burch, Marcello Federico, Nicola Bertoldi, 2011. Pointwise prediction for robust, adaptable Brooke Cowan, Wade Shen, Christine Moran, japanese morphological analysis. In Proceedings of Richard Zens, Chris Dyer, Ondrejˇ Bojar, Alexandra the 49th Annual Meeting of the Association for Com- Constantin, and Evan Herbst. 2007. Moses: Open putational Linguistics: Human Language Technolo- source toolkit for statistical machine translation. In gies, pages 529–533, Portland, Oregon, USA, June. Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, Graham Neubig. 2011. The Kyoto free translation ACL ’07, pages 177–180, Stroudsburg, PA, USA. task. http://www.phontron.com/kftt.

Philipp Koehn. 2004. Statistical significance tests for Hermann Ney. 1999. Speech translation: coupling of machine translation evaluation. In Proceedings of recognition and translation. In IEEE International the 2004 Conference on Empirical Methods in Nat- Conference on Acoustics, Speech, and Signal Pro- ural Language Processing, pages 388–395. cessing, volume 1, pages 517–520, Phoenix, AZ, March. IEEE. Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In Proceedings Franz Josef Och and Hermann Ney. 2003. A sys- of Machine Translation Summit X, volume 5, pages tematic comparison of various statistical alignment 79–86. models. Computational Linguistics, 29(1):19–51.

Vladislav Kubonˇ and Marketa´ Lopatkova.´ 2015. Free Sebastian Pado´ and Mirella Lapata. 2006. Optimal or fixed word order: What can treebanks reveal? constituent alignment with edge covers for seman- In Jakub Yaghob, editor, ITAT 2015: Information tic projection. In Proceedings of the 21st Interna- Technologies Applications and Theory, Proceedings tional Conference on Computational Linguistics and of the 15th conference ITAT 2015, volume 1422 of 44th Annual Meeting of the Association for Com- CEUR Workshop Proceedings, pages 23–29, Praha, putational Linguistics, pages 1161–1168, Sydney, Czechia. Charles University in Prague, CreateSpace Australia, July. Independent Publishing Platform. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Shankar Kumar and William Byrne. 2003. A weighted Jing Zhu. 2002. BLEU: a method for automatic finite state transducer implementation of the align- evaluation of machine translation. In Proceedings ment template model for statistical machine trans- of the 40th annual meeting on association for com- lation. In Proceedings of the 2003 Conference putational linguistics, pages 311–318. of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL ’03, pages 63–70, Slav Petrov, Leon Barrett, Romain Thibaux, and Dan Stroudsburg, PA, USA. Klein. 2006. Learning accurate, compact, and interpretable tree annotation. In Proceedings of K. Lari and S. J. Young. 1990. The estimation of the 21st International Conference on Computational stochastic context-free grammars using the inside- Linguistics and the 44th Annual Meeting of the As- outside algorithm. Computer Speech and Language, sociation for Computational Linguistics, ACL-44, 4:35–56. pages 433–440, Stroudsburg, PA, USA.

Uri Lerner and Slav Petrov. 2013. Source-side classi- Detlef Prescher. 2005. Inducing head-driven pcfgs fier preordering for machine translation. In Proceed- with latent heads: Refining a tree-bank grammar for ings of the 2013 Conference on Empirical Methods parsing. In In ECML05. in Natural Language Processing, pages 513–523, Seattle, Washington, USA, October. K. Simon. 1988. An improved algorithm for transitive closure on acyclic digraphs. Theor. Comput. Sci., Huma Lodhi, Craig Saunders, John Shawe-Taylor, 58(1-3):325–346, June. Nello Cristianini, and Chris Watkins. 2002. Text classification using string kernels. J. Mach. Learn. Jason R. Smith, Herve Saint-Amand, Magdalena Pla- Res., 2:419–444, March. mada, Philipp Koehn, Chris Callison-Burch, and Adam Lopez. 2013. Dirt cheap web-scale parallel Takuya Matsuzaki, Yusuke Miyao, and Jun’ichi Tsujii. text from the common crawl. In Proceedings of the 2005. Probabilistic CFG with latent annotations. In 51st Annual Meeting of the Association for Compu- Proceedings of the 43rd Annual Meeting of the As- tational Linguistics (Volume 1: Long Papers), pages sociation for Computational Linguistics (ACL’05), 1374–1383, Sofia, Bulgaria, August. pages 75–82, Ann Arbor, Michigan, June. Milosˇ Stanojevic´ and Khalil Sima’an. 2015. Reorder- Mehryar Mohri. 2002. Semiring frameworks and ing grammar induction. In Proceedings of the 2015 algorithms for shortest-distance problems. Jour- Conference on Empirical Methods in Natural Lan- nal of Automata, Languages and Combinatorics, guage Processing, pages 44–54, Lisbon, Portugal, 7(3):321–350, January. September.

129

Page 23 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

Jorg¨ Tiedemann. 2012. Parallel data, tools and interfaces in opus. In Nicoletta Calzolari (Con- ference Chair), Khalid Choukri, Thierry Declerck, Mehmet Ugur Dogan, Bente Maegaard, Joseph Mar- iani, Jan Odijk, and Stelios Piperidis, editors, Pro- ceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Is- tanbul, Turkey, may. Roy Tromble and Jason Eisner. 2009. Learning linear ordering problems for better translation. In Proceed- ings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 1007–1016, Singapore, August.

Edo S van der Poort, Marek Libura, Gerard Sierksma, and Jack A.A van der Veen. 1999. Solving the k- best traveling salesman problem. Computers & Op- erations Research, 26(4):409 – 425.

Dekai Wu. 1997. Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Computational linguistics, 23(3):377–403. Jia Xu, Evgeny Matusov, Richard Zens, and Hermann Ney. 2005. Integrated chinese word segmentation in statistical machine translation. In International Workshop on Spoken Language Translation, Pitts- burgh. Daniel Zeman, David Marecek,ˇ Martin Popel, Loganathan Ramasamy, Jan Stˇ epˇ anek,´ Zdenekˇ Zabokrtskˇ y,´ and Jan Hajic.ˇ 2012. Hamledt: To parse or not to parse? In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey, may. European Language Resources Asso- ciation (ELRA). Hao Zhang and Daniel Gildea. 2007. Factorization of synchronous context-free grammars in linear time. In NAACL Workshop on Syntax and Structure in Sta- tistical Translation (SSST), pages 25–32. Yuqi Zhang, Richard Zens, and Hermann Ney. 2007. Chunk-level reordering of source language sen- tences with automatically learned rules for statistical machine translation. In Proceedings of the NAACL- HLT 2007/AMTA Workshop on Syntax and Struc- ture in Statistical Translation, SSST ’07, pages 1–8, Stroudsburg, PA, USA. Daniel Zwillinger and Stephen Kokoska. 1999. CRC Standard Probability and Statistics Tables and For- mulae. CRC Press.

130

Page 24 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

B A Joint Dependency Model of Morphological and Syntactic Struc- ture for Statistical Machine Translation

A Joint Dependency Model of Morphological and Syntactic Structure for Statistical Machine Translation

Rico Sennrich and Barry Haddow School of Informatics, University of Edinburgh [email protected], [email protected]

Abstract function/postion English/German example he quickly finite (main) walks away er geht schnell weg When translating between two languages [...] because he quickly finite (sub.) walks away that differ in their degree of morpholog- [...] weil er schnell weggeht he can quickly ical synthesis, syntactic structures in one bare infinitive walk away er kann schnell weggehen language may be realized as morphologi- he promises quickly to/zu-infinitive to walk away cal structures in the other, and SMT mod- er verspricht, schnell wegzugehen els need a mechanism to learn such trans- lations. Prior work has used morpheme Table 1: Surface realizations of particle verb splitting with flat representations that do weggehen ’walk away’. not encode the hierarchical structure be- tween morphemes, but this structure is rel- they charge a carry-on bag fee. evant for learning morphosyntactic con- straints and selectional preferences. We In example 1, agreement in case, number and propose to model syntactic and morpho- gender is enforced between eine ’a’ and Gebühr logical structure jointly in a dependency ’fee’, and selectional preference between erheben translation model, allowing the system ’charge’ and Gebühr ’fee’. A flat representation, to generalize to the level of morphemes. as is common in phrase-based SMT, does not en- We present a dependency representation code these relationships, but a dependency repre- of German compounds and particle verbs sentation does so through dependency links. that results in improvements in transla- In this paper, we investigate a dependency rep- tion quality of 1.4–1.8 BLEU in the WMT resentation of morphologically segmented words English–German translation task. for SMT. Our representation encodes syntactic and morphological structure jointly, allowing a single 1 Introduction model to learn the translation of both. Specifi- When translating between two languages that dif- cally, we work with a string-to-tree model with fer in their degree of morphological synthesis, GHKM-style rules (Galley et al., 2006), and a syntactic structures in one language may be re- relational dependency language model (Sennrich, alized as morphological structures in the other. 2015). We focus on the representation of German Machine Translation models that treat words as syntax and morphology in an English-to-German atomic units have poor learning capabilities for system, and two morphologically complex word such translation units, and morphological segmen- classes in German that are challenging for transla- tations are commonly used (Koehn and Knight, tion, compounds and particle verbs. 2003). Like words in a sentence, the morphemes German makes heavy use of compounding, and of a word have a hierarchical structure that is rel- compounds such as Abwasserbehandlungsanlage evant in translation. For instance, compounds in ‘waste water treatment plant’ are translated into Germanic languages are head-final, and the head is complex noun phrases in other languages, such as the segment that determines agreement within the French station d’épuration des eaux résiduaires. noun phrase, and is relevant for selectional prefer- German particle verbs are difficult to model be- ences of verbs. cause their surface realization differs depending on the finiteness of the verb and the type of clause. 1. sie erheben eine Hand|gepäck|gebühr. Verb particles are separated from the finite verb in

2081 Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 2081–2087, Lisbon, Portugal, 17-21 September 2015. c 2015 Association for Computational Linguistics.

Page 25 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

root main clauses, but prefixed to the verb in subordi- obja det nated clauses, or when the verb is non-finite. The subj infinitive marker zu ’to’, which is normally a pre- sie erheben eine Handgepäckgebühr modifying particle, appears as an infix in particle PPER VVFIN ART NN verbs. Table 1 shows an illustrating example. they charge a carry-on bag fee obja det 2 A Dependency Representation of root mod mod subj Compounds and Particle Verbs link link The main focus of research on compound split- sie erheben eine Hand  gepäck  gebühr ting has been on the splitting algorithm (Popovic PPER VVFIN ART SEG LN SEG LN SEG they charge a carry-on bag fee et al., 2006; Nießen and Ney, 2000; Weller et al., 2014; Macherey et al., 2011). Our focus is not the Figure 1: Original and proposed representation of splitting algorithm, but the representation of com- German compound. pounds. For splitting, we use an approach simi- obji lar to (Fritzinger and Fraser, 2010), with segmen- root comma

tation candidates identified by a finite-state mor- subj adv phology (Schmid et al., 2004; Sennrich and Kunz, 2014), and statistical evidence from the training er verspricht , schnell wegzugehen PPER VVFIN $, ADJD VVIZU corpus to select a split (Koehn and Knight, 2003). German compounds are head-final, and pre- he promises to go away quickly modifiers can be added recursively. Compounds obji are structurally ambiguous if there is more than comma adv one modifier. Consider the distinction between root part

(Stadtteil)projekt (literally: ’(city part) project)’) subj avz and Stadt(teilprojekt) ’city sub-project’. We opt 1 er verspricht , schnell zu weg gehen for a left-branching representation by default. We PPER VVFIN $, ADJD PTKZU PTKVZ VVINF also split linking elements, and represent them as a postmodifier of each non-final segment, includ- he promises to go away quickly ing the empty string (""). We use the same repre- sentation for noun compounds and adjective com- Figure 2: Original and proposed representation of pounds. German particle verb with infixed zu-marker. An example of the original2 and the proposed compound representation is shown in Figure 1. Importantly, the head of the compound is also verb particles are reordered to be the closest pre- the parent of the determiners and attributes in modifier of the verb. Prefixed particles and the zu- the noun phrase, which makes a bigram depen- infix are identified by the finite-state-morphology, dency language model sufficient to enforce agree- and split from the verb so that the particle is ment. Since we model morphosyntactic agree- the closest, the zu marker the next-closest pre- ment within the main translation step, and not in modifier of the verb, as shown in Figure 2. Agree- a separate step as in (Fraser et al., 2012), we deem ment, selectional preferences, and other phenom- it useful that inflection is marked at the head of ena involve the verb and its dependents, and the the compound. Consequently, we do not split off proposed representation retains these dependency inflectional or derivational morphemes. links, but reduces data sparsity from affixation and For German particle verbs, we define a common avoids discontinuity of the verb and its particle. representation that abstracts away from the vari- ous surface realizations (see Table 1). Separated 3 Tree Binarization

1We follow prior work in leaving frequent words or sub- We follow Williams et al. (2014) and map de- words unsplit, which has a disambiguating effect. With more pendency trees into a constituency representation, aggressive splitting, frequency information could be used for which allows for the extraction of GHKM-style the structural disambiguation of internal structure. 2The original dependency trees follow the annotation translation rules (Galley et al., 2006). This con- guidelines by Foth (2005). version is lossless, and we can still apply a de-

2082

Page 26 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

pendency language model (RDLM). Figure 3 (a) ROOT shows the constituency representation of the ex- SUBJ VVFIN OBJA ample in Figure 1. PPER DET MOD SEG Our model should not only be able to produce ART MOD SEG LINK new words productively, but also to memorize SEG LINK LN words it has observed during training. Looking at LN the compound Handgepäckgebühr in Figure 3 (a), sie erheben eine Hand  gepäck  gebühr we can see that it does not form a constituent, and (a) cannot be extracted with GHKM extraction heuris- tics. To address this, we binarize the trees in our ROOT training data (Wang et al., 2007). ROOT OBJA A complicating factor is that the binarization SUBJ VVFIN DET OBJA should not impair the RDLM. During decoding, PPER ART MOD SEG we map the internal tree structure of each hypoth- MOD LINK esis back to the unbinarized form, which is then MOD SEG LN scored by the RDLM. Virtual nodes introduced by SEG LINK the binarization must also be scorable by RDLM LN if they form the root of a translation hypothesis. A sie erheben eine Hand  gepäck  gebühr simple right or left binarization would produce vir- (b) tual nodes without head and without meaningful dependency representation. We ensure that each Figure 3: Unbinarized (a) and head-binarized (b) virtual node dominates the head of the full con- constituency representation of Figure 1. stituent through a mixed binarization.3 Specifi- cally, we perform right binarization of the head and all pre-modifiers, then left binarization of all RDLM to take this into account. We distinguish post-modifiers. This head-binarized representa- between virtual nodes based on whether their span tion is illustrated in Figure 3 (b).4 is a string prefix, suffix, or infix of the full con- Head binarization ensures that even hypotheses stituent. For prefixes and infixes, we do not add whose root is a virtual node can be scored by the a stop symbol at the end, and use null symbols, RDLM. This score is only relevant for pruning, which denote unavailable context, for padding to and discarded when the full constituent is scored. the right. For suffixes and infixes, we do the same Still, these hypotheses require special treatment in at the start. the RDLM to mitigate search errors. The virtual node labels (such as OBJA) are unknown symbols 4 Post-Processing to the RDLM, and we simply replace them with For SMT, all German training and development the original label (OBJA). The RDLM uses sibling data is converted into the representation described context, and this is normally padded with special in sections 2–3. To restore the original represen- start and stop symbols, analogous to BOS/EOS tation, we start from the tree output of the string- symbols in n-gram models. These start and stop to-tree decoder. Merging compounds is trivial: all symbols let the RDLM compute the probability segments and linking elements can be identified by that a node is the first or last child of its ances- the tree structure, and are concatenated. tor node. However, computing these probabilities For verbs that dominate a verb particle, the orig- for virtual nodes would unfairly bias the search, inal order is restored through three rules: since the first/last child of a virtual node is not nec- essarily the first/last child of the full constituent. 1. non-finite verbs are concatenated with the We adapt the representation of virtual nodes in particle, and zu-markers are infixed. 3In other words, every node is a fixed well-formed depen- dency structure (Shen et al., 2010) with our binarization. 2. finite verbs that head a subordinated clause 4Note that our definition of head binarization is different (identified by its dependency label) are con- from that of Wang et al. (2007), who left-binarize a node if catenated with the particle. the head is the first child, and right-binarize otherwise. Our algorithm also covers cases where the head has both pre- and post-modifiers, as erheben and gepäck do in Figure 3. 3. finite verbs that head a main clause have the

2083

Page 27 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

particle moved to the right clause bracket.5 system newstest2014 newstest2015 baseline 20.7 22.0 Previous work on particle verb translation into +split compounds 21.3 22.4 +particle verbs 21.4 22.8 German proposed to predict the position of parti- head binarization 20.9 22.7 cles with an n-gram language model (Nießen and +split compounds 22.0 23.4 Ney, 2001). Our rules have the advantage that they +particle verbs 22.1 23.8 full system 22.6 24.4 are informed by the syntax of the sentence and consider the finiteness of the verb. Table 2: English–German translation results Our rules only produce projective trees. Verb (BLEU). Average of three optimization runs. particles may also appear in positions that violate particle verb projectivity, and we leave it to future research to system compound sep. pref. zu-infix determine if our limitation to projective trees af- reference 2841 553 1195 176 fects translation quality, and how to produce non- baseline 845 96 847 71 projective trees. +head binarization 798 157 858 106 +split compounds 1850 160 877 94 +particle verbs 1992 333 953 169 5 SMT experiments Table 3: Number of compounds [that would be 5.1 Data and Models split by compound splitter] and particle verbs We train English–German string-to-tree SMT sys- (separated, prefixed and with zu-infix) in new- tems on the training data of the shared transla- stest2014/5. Average of three optimization runs. tion task of the Workshop on Statistical Machine Translation (WMT) 2015. The data set consists of uses the dependency representation of compounds 4.2 million sentence pairs of parallel data, and 160 and tree binarization introduced in this paper; we million sentences of monolingual German data. achieve additional gains over the submission sys- We base our systems on that of Williams et tem through particle verb restructuring. al. (2014). It is a string-to-tree GHKM transla- tion system implemented in Moses (Koehn et al., 5.2 SMT Results 2007), and using the dependency annotation by Table 2 shows translation quality (BLEU) with dif- ParZu (Sennrich et al., 2013). Additionally, our ferent representations of German compounds and baseline system contains a dependency language particle verbs. Head binarization not only yields model (RDLM) (Sennrich, 2015), trained on the improvements over the baseline, but also allows target-side of the parallel training data. for larger gains from morphological segmenta- We report case-sensitive BLEU scores on the tion. We attribute this to the fact that full com- newstest2014/5 test sets from WMT, averaged pounds, and prefixed particle verbs, are not al- over 3 optimization runs of k-batch MIRA (Cherry ways a constituent in the segmented representa- and Foster, 2012) on a subset of newstest2008-12.6 tion, and that binarization compensates this the- We split all particle verbs and hyphenated com- oretical drawback. pounds, but other compounds are only split if they With head binarization, we find substantial im- are rare (frequency in parallel text < 5). provements from compound splitting of 0.7–1.1 For comparison with the state-of-the-art, we BLEU. On newstest2014, the improvement is train a full system on our restructured representa- almost twice of that reported in related work tion, which incorporates all models and settings of (Williams et al., 2014), which also uses a hier- our WMT 2015 submission system (Williams et archical representation of compounds, albeit one al., 2015).7 Note that our WMT 2015 submission that does not allow for dependency modelling. 5We use the last position in the clause as default location, Examples of correct, unseen compounds gener- but put the particle before any subordinated and coordinated ated include Staubsauger|roboter ’vacuum cleaner clauses, which occur in the Nachfeld (the ‘final field’ in topo- logical field theory). robot’, Gravitation|s|wellen ’gravitational waves’, 6We use mteval-v13a.pl for comparability to official and NPD|-|verbot|s|verfahren ’NPD banning pro- WMT results; all significance values reported are obtained cess’.8 with MultEval (Clark et al., 2011). 7In contrast to our other systems in this paper, RDLM is (Vaswani et al., 2013), and soft source-syntactic constraints trained on all monolingual data for the full system, and two (Huck et al., 2014). models are added: a 5-gram Neural Network language model 8Note that Staubsauger, despite being a compound, is not

2084

Page 28 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

Particle verb restructuring yields additional Ney 5-gram LM and RDLM perform poorly due to gains of 0.1–0.4 BLEU. One reason for the smaller data sparseness, with 70% and 57.5% accuracy, re- effect of particle verb restructuring is that the diffi- spectively. In the split representation, the RDLM cult cases – separated particle verbs and those with reliably prefers the correct agreement (96.5% ac- infixation – are rarer than compounds, with 2841 curacy), whilst the performance of the 5-gram rare compounds [that would be split by our com- model even deteriorates (to 60% accuracy). This pound splitter] in the reference texts, in contrast is because the gender of the first segment(s) is ir- to 553 separated particle verbs, and 176 particle relevant, or even misleading, for agreement. For verbs with infixation, as Table 3 illustrates. If we instance, Handgepäck is neuter, which could lead only evaluate the sentences containing a particle a morpheme-level n-gram model to prefer the de- verb with zu-infix in the reference, 165 in total terminer ein, but Handgepäckgebühr is feminine for newstest2014/5, we observe an improvement and requires eine. of 0.8 BLEU on this subset (22.1 22.9), signifi- → cant with p < 0.05. 6 Conclusion The positive effect of restructuring is also ap- Our main contribution is that we exploit the hi- parent in frequency statistics. Table 3 shows that erarchical structure of morphemes to model them the baseline system severely undergenerates com- jointly with syntax in a dependency-based string- pounds and separated/infixed particle verbs. Bi- to-tree SMT model. We describe the dependency narization, compound splitting, and particle verb annotation of two morphologically complex word restructuring all contribute to bringing the distri- classes in German, compounds and particle verbs, bution of compounds and particle verbs closer to and show that our tree representation yields im- the reference. provements in translation quality of 1.4–1.8 BLEU In total, the restructured representation yields in the WMT English–German translation task.9 improvements of 1.4–1.8 BLEU over our base- The principle of jointly representing syntactic line. The full system is competitive with official and morphological structure in dependency trees submissions to the WMT 2015 shared translation can be applied to other language pairs, and we ex- tasks. It outperforms our submission (Williams pect this to be helpful for languages with a high et al., 2015) by 0.4 BLEU, and outperforms other degree of morphological synthesis. However, the phrase-based and syntax-based submissions by 0.8 annotation needs to be adapted to the respective BLEU or more. The best reported result accord- languages. For example, French compounds such ing to BLEU is an ensemble of Neural MT systems as arc-en-ciel ’rainbow’ are head-initial, in con- (Jean et al., 2015), which achieves 24.9 BLEU. In trast to head-final Germanic compounds. the human evaluation, both our submission and the Neural MT system were ranked 1–2 (out of 16), Acknowledgments with no significant difference between them. This project received funding from the Euro- pean Union’s Horizon 2020 research and innova- 5.3 Synthetic LM Experiment tion programme under grant agreements 645452 We perform a synthetic experiment to test our (QT21), 644402 (HimL), 644333 (TraMOOC), claim that a dependency representation allows for and from the Swiss National Science Foundation the modelling of agreement between morphemes. under grant P2ZHP1_148717. For 200 rare compounds [that would be split by our compound splitter] in the newstest2014/5 ref- erences, we artificially introduce agreement errors References by changing the gender of the determiner. For in- Colin Cherry and George Foster. 2012. Batch Tun- stance, we create the erroneous sentence sie er- ing Strategies for Statistical Machine Translation. In Proceedings of the 2012 Conference of the North heben ein Handgepäckgebühr as a complement to American Chapter of the Association for Compu- Example 1. We measure the ability of language tational Linguistics: Human Language Technolo- models to prefer (give a higher probability to) gies, NAACL HLT ’12, pages 427–436, Montreal, the original reference sentence over the erroneous Canada. Association for Computational Linguistics. one. In the original representation, both a Kneser- 9We released source code and configuration files at https://github.com/rsennrich/ segmented due to its frequency. wmt2014-scripts.

2085

Page 29 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

Jonathan H. Clark, Chris Dyer, Alon Lavie, and In Proceedings of the ACL-2007 Demo and Poster Noah A. Smith. 2011. Better Hypothesis Testing for Sessions, pages 177–180, Prague, Czech Republic. Statistical Machine Translation: Controlling for Op- Association for Computational Linguistics. timizer Instability. In Proceedings of the 49th An- nual Meeting of the Association for Computational Klaus Macherey, Andrew Dai, David Talbot, Ashok Linguistics, pages 176–181, Portland, Oregon. As- Popat, and Franz Och. 2011. Language- sociation for Computational Linguistics. independent compound splitting with morphological operations. In Proceedings of the 49th Annual Meet- Killian A. Foth. 2005. Eine umfassende Constraint- ing of the Association for Computational Linguis- Dependenz-Grammatik des Deutschen. University tics: Human Language Technologies, pages 1395– of Hamburg, Hamburg. 1404, Portland, Oregon, USA. Association for Com- putational Linguistics. Alexander Fraser, Marion Weller, Aoife Cahill, and Fa- bienne Cap. 2012. Modeling Inflection and Word- Sonja Nießen and Hermann Ney. 2000. Improving Formation in SMT. In Proceedings of the 13th Con- SMT quality with morpho-syntactic analysis. In ference of the European Chapter of the Association 18th Int. Conf. on Computational Linguistics, pages for Computational Linguistics, pages 664–674, Avi- 1081–1085. gnon, France. Association for Computational Lin- guistics. Sonja Nießen and Hermann Ney. 2001. Morpho- syntactic analysis for Reordering in Statistical Ma- Fabienne Fritzinger and Alexander Fraser. 2010. How chine Translation. In Machine Translation Summit, to Avoid Burning Ducks: Combining Linguistic pages 247–252, Santiago de Compostela, Spain. Analysis and Corpus Statistics for German Com- pound Processing. In Proceedings of the Joint Fifth Maja Popovic, Daniel Stein, and Hermann Ney. 2006. Workshop on Statistical Machine Translation and Statistical Machine Translation of German Com- MetricsMATR, WMT ’10, pages 224–234, Uppsala, pound Words. In Advances in Natural Language Sweden. Association for Computational Linguistics. Processing, 5th International Conference on NLP, FinTAL 2006, pages 616–624, Turku, Finland. Michel Galley, Jonathan Graehl, Kevin Knight, Daniel Marcu, Steve DeNeefe, Wei Wang, and Ignacio Helmut Schmid, Arne Fitschen, and Ulrich Heid. Thayer. 2006. Scalable inference and training of 2004. A German Computational Morphology Cov- context-rich syntactic translation models. In ACL- ering Derivation, Composition, and Inflection. In 44: Proceedings of the 21st International Confer- Proceedings of the IVth International Conference on ence on Computational Linguistics and the 44th an- Language Resources and Evaluation (LREC 2004), nual meeting of the Association for Computational pages 1263–1266. Linguistics, pages 961–968, Sydney, Australia. As- sociation for Computational Linguistics. Rico Sennrich and Beat Kunz. 2014. Zmorge: A Ger- man Morphological Lexicon Extracted from Wik- Matthias Huck, Hieu Hoang, and Philipp Koehn. tionary. In Proceedings of the 9th International 2014. Preference Grammars and Soft Syntactic Conference on Language Resources and Evaluation Constraints for GHKM Syntax-based Statistical Ma- (LREC 2014), Reykjavik, Iceland, May. chine Translation. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Rico Sennrich, Martin Volk, and Gerold Schneider. Statistical Translation, pages 148–156, Doha, Qatar. 2013. Exploiting Synergies Between Open Re- Association for Computational Linguistics. sources for German Dependency Parsing, POS- tagging, and Morphological Analysis. In Proceed- Sébastien Jean, Orhan Firat, Kyunghyun Cho, Roland ings of the International Conference Recent Ad- Memisevic, and Yoshua Bengio. 2015. Montreal vances in Natural Language Processing 2013, pages Neural Machine Translation Systems for WMT’15 . 601–609, Hissar, Bulgaria. In Proceedings of the Tenth Workshop on Statistical Machine Translation. Rico Sennrich. 2015. Modelling and Optimizing on Syntactic N-Grams for Statistical Machine Transla- Philipp Koehn and Kevin Knight. 2003. Empirical tion. Transactions of the Association for Computa- Methods for Compound Splitting. In EACL ’03: tional Linguistics, 3:169–182. Proceedings of the Tenth Conference on European Chapter of the Association for Computational Lin- Libin Shen, Jinxi Xu, and Ralph Weischedel. 2010. guistics, pages 187–193, Budapest, Hungary. Asso- String-to-dependency Statistical Machine Transla- ciation for Computational Linguistics. tion. Comput. Linguist., 36(4):649–671.

Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Ashish Vaswani, Yinggong Zhao, Victoria Fossum, and Callison-Burch, Marcello Federico, Nicola Bertoldi, David Chiang. 2013. Decoding with Large-Scale Brooke Cowan, Wade Shen, Christine Moran, Neural Language Models Improves Translation. In Richard Zens, Chris Dyer, Ondrejˇ Bojar, Alexandra Proceedings of the 2013 Conference on Empirical Constantin, and Evan Herbst. 2007. Moses: Open Methods in Natural Language Processing, EMNLP Source Toolkit for Statistical Machine Translation. 2013, pages 1387–1392, Seattle, Washington, USA.

2086

Page 30 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

Wei Wang, Kevin Knight, and Daniel Marcu. 2007. Binarizing Syntax Trees to Improve Syntax-Based Machine Translation Accuracy. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). Marion Weller, Fabienne Cap, Stefan Müller, Sabine Schulte im Walde, and Alexander Fraser. 2014. Dis- tinguishing Degrees of Compositionality in Com- pound Splitting for Statistical Machine Translation. In Proceedings of the First Workshop on Computa- tional Approaches to Compound Analysis (ComA- ComA 2014), pages 81–90, Dublin, Ireland. Philip Williams, Rico Sennrich, Maria Nadejde, Matthias Huck, Eva Hasler, and Philipp Koehn. 2014. Edinburgh’s Syntax-Based Systems at WMT 2014. In Proceedings of the Ninth Workshop on Sta- tistical Machine Translation, pages 207–214, Bal- timore, Maryland, USA. Association for Computa- tional Linguistics.

Philip Williams, Rico Sennrich, Maria Nadejde, Matthias Huck, and Philipp Koehn. 2015. Edin- burgh’s Syntax-Based Systems at WMT 2015. In Proceedings of the Tenth Workshop on Statistical Machine Translation, Lisbon, Portugal. Association for Computational Linguistics.

2087

Page 31 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

C Edinburgh’s Syntax-Based Systems at WMT 2015

Edinburgh’s Syntax-Based Systems at WMT 2015

Philip Williams1, Rico Sennrich1, Maria Nadejde1, Matthias Huck1, Philipp Koehn1,2 1School of Informatics, University of Edinburgh 2Center for Speech and Language Processing, The Johns Hopkins University

Abstract 2 System Overview

This paper describes the syntax-based sys- 2.1 Pre-processing tems built at the University of Edinburgh The training data was pre-processed using scripts for the WMT 2015 shared translation task. from the Moses toolkit. We first normalized We developed systems for all language the data using the normalize-punctuation.perl pairs except French-English. This year script then performed tokenization, parsing, and we focused on: translation out of En- truecasing. To parse the English data, we used glish using tree-to-string models; contin- the Berkeley parser (Petrov et al., 2006; Petrov uing to improve our English-German sys- and Klein, 2007). To parse the German data, we tem; and source-side morphological seg- used the ParZu dependency parser (Sennrich et mentation of Finnish using Morfessor. al., 2013).

1 Introduction 2.2 Word Alignment For word alignment we used either MGIZA++ This year’s WMT shared translation task featured (Gao and Vogel, 2008), a multi-threaded imple- five language pairs: English paired with Czech, mentation of GIZA++ (Och and Ney, 2003), or Finnish, French, German, and Russian. We built fast_align (Dyer et al., 2013). In preliminary syntax-based systems in both translation direc- experiments, we found that the tree-to-string sys- tions for all language pairs except English-French. tems were particularly sensitive to the choice of For English German, we continued to de- → word aligner, echoing a previous observation by velop our string-to-tree system, which has proven Neubig and Duh (2014). See the individual tree- highly competitive in previous years. Additions to-string system descriptions in Section 3. this year included the use of a dependency lan- guage model, an alternative tuning metric, and soft 2.3 Language Model source-syntactic constraints. For translation from English into Czech, We used all available monolingual data to train one Finnish, and Russian, we built STSG-based tree- interpolated 5-gram language model for each sys- to-string systems. Support for this type of model tem. Using either lmplz (Heafield et al., 2013) is a recent addition to the Moses toolkit. In previ- or the SRILM toolkit (Stolcke, 2002), we first ous years, our systems have all used string-to-tree trained an individual language model for each of models and have only translated into English and the supplied monolingual training corpora. These German. models all used modified Kneser-Ney smoothing For Finnish English, we experimented with (Chen and Goodman, 1998). We then interpolated → unsupervised morphological segmentation using the individual models using SRILM, providing the Morfessor 2.0 (Virpioja et al., 2013). target-side of the system’s tuning set (Section 2.7) For the remaining systems (Czech English, for perplexity-based weight optimization. → German English, and Russian English), our → → 2.4 String-to-Tree Model systems were essentially the same as last year’s (Williams et al., 2014) except for the addition of For English German and the systems that trans- → this year’s training data. late into English, we used a string-to-tree model.

199 Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 199–209, Lisboa, Portugal, 17-18 September 2015. c 2015 Association for Computational Linguistics.

Page 32 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

2.4.1 Grammar p (β α) and p (α β), the direct and • lex | lex | The string-to-tree translation model is based on a indirect lexical weights (Koehn et al., 2003). synchronous context-free grammar (SCFG) with p (π), the monolingual PCFG probability • pcfg linguistically-motivated labels on the target side. of the tree fragment π from which the rule SCFG rules were extracted from the word- was extracted. aligned parallel data using the Moses implemen- exp( 1/count(r)), a rule rareness penalty. tation (Williams and Koehn, 2012) of the GHKM • − algorithm (Galley et al., 2004; Galley et al., 2006). 2.5 Tree-to-String Model Minimal GHKM rules were composed into For English Czech, English Finnish, and En- → → larger rules subject to restrictions on the size of glish Russian, we used a tree-to-string model. the resulting tree fragment. We used the settings → shown in Table 1, which were chosen empirically 2.5.1 Grammar during the development of 2013’s systems (Nade- In the tree-to-string model, the translation gram- jde et al., 2013). mar is a synchronous tree-substitution gram- mar (Eisner, 2003) with parse tree fragments on Parameter Unbinarized Binarized the source-side and strings of terminals and non- Rule depth 5 7 terminals on the target-side. Node count 20 30 As with the string-to-tree models, the grammar Rule size 5 7 was extracted from the word-aligned parallel data using the Moses implementation of the GHKM al- Table 1: Parameter settings for rule composition. gorithm. Minimal GHKM rules were composed The parameters were relaxed for systems that used into larger rules subject to the same size restric- binarization to allow for the increase in tree node tions (Table 1). Unlike string-to-tree rule extrac- density. tion, fully non-lexical unary rules were included in the grammar and scope pruning was not used. Further to the restrictions on rule composition, fully non-lexical unary rules were eliminated us- 2.5.2 Feature Functions ing the method described in Chung et al. (2011) The tree-to-string feature functions are similar to and rules with scope greater than 3 (Hopkins and those of the string-to-tree model. For a grammar Langmead, 2010) were pruned from the trans- rule r of the form lation grammar. Scope pruning makes parsing π, β, tractable without the need for grammar binariza- h ∼i where π is a source-side tree fragment, β is a string tion. of target terminals and non-terminals, and is ∼ 2.4.2 Feature Functions a one-to-one correspondence between source and target non-terminals, we score the rule according Our core set of string-to-tree feature functions is to (logarithms of) the following functions: unchanged from previous years. It includes the n- gram language model’s log probability for the tar- p (β π, ) and p (π β, ), the direct and • | ∼ | ∼ get string, the target word count, the rule count, indirect translation probabilities. and various pre-computed rule-specific scores. p (β π) and p (π β), the direct and For a grammar rule r of the form • lex | lex | indirect lexical weights (Koehn et al., 2003). C α, β, exp( 1/count(r)), a rule rareness penalty. → h ∼i • − where C is a target-side non-terminal label, α is a string of source terminals and non-terminals, β is 2.6 Decoding a string of target terminals and non-terminals, and Decoding for the string-to-tree models is based on is a one-to-one correspondence between source Sennrich’s (2014) recursive variant of the CYK+ ∼ and target non-terminals, we score the rule accord- parsing algorithm combined with LM integration ing to (logarithms of) the following functions: via cube pruning (Chiang, 2007). Decoding for the tree-to-string models is based on the rule matching p (C, β α, ) and p (α C, β, ), the direct algorithm by Zhang et al. (2009) combined with • | ∼ | ∼ and indirect translation probabilities. LM integration via cube pruning.

200

Page 33 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

2.7 Tuning 3.2 English to Finnish The feature weights were tuned using the Moses In preliminary English Finnish experi- → implementation of MERT (Och, 2003) for all sys- ments, we compared the use of MGIZA++ and tems except English-to-German, for which we fast_align. Since there was only one test used k-best MIRA (Cherry and Foster, 2012) due set provided, in these initial experiments we split to the use of sparse features. newsdev2015 into two halves, using the first half For the tree-to-string systems, we used all of for tuning and the second half for testing. Table 3 the previous years’ test sets as tuning data (except gives the mean BLEU scores, averaged over three newstest2014, which was used as the development MERT runs. test set). For the string-to-tree systems, we used MGIZA++ fast_align subsets of the test data to speed up decoding. Hiero 11.7 11.6 3 Individual Systems Tree-to-string 11.5 12.3 + right binarization 11.9 12.8 In this section we describe individual systems and present experimental results. In many cases, the Table 3: Comparison of word alignment tools for only difference from the generic setup of the pre- English to Finnish. BLEU on subset of news- vious section is that we perform right binarization dev2015. of the training and test parse trees. We also built hierarchical phrase-based systems For our final system, we used fast_align (Chiang, 2007), which we refer to in tables as ‘Hi- for word alignment and we used the full news- ero.’ These systems were built using the Moses dev2015 test set as tuning data. Table 4 gives the toolkit, with standard settings. They were not used mean BLEU scores for this setup. Our submitted in the submission and are included for comparison system was the right binarized system that, out of only. the three MERT runs, scored highest on devtest. For each system, we present results for both the development test set (newstest2014 in most cases) system dev test and for the test set (newstest2015) for which ref- Hiero 11.4 11.5 erence translations were provided after the system Tree-to-string 11.9 11.8 submission deadline. We refer to these as ‘devtest’ + right binarization 12.2 12.3 and ‘test’, respectively. Table 4: Final English to Finnish translation 3.1 English to Czech results (BLEU) on dev (newsdev2015) and test For English Czech we built a tree-to-string (newstest2015) sets. → system. We used fast_align for word align- ment due to the large training data size and on the 3.3 English to German strength of its performance for English Finnish → We experiment with the following additions to last and English Russian. We used all test sets from → year’s submission system: a relational dependency 2008 to 2013 as tuning data. Table 2 gives the language model (RDLM) (Sennrich, 2015); tuning mean BLEU scores, averaged over three MERT on the syntactic metric HWCM (Liu and Gildea, runs. Our submitted system was the right bina- 2005; Sennrich, 2015); soft source-syntactic con- rized system that, out of the three runs, scored straints (Huck et al., 2014); a large-scale n- highest on devtest. gram Neural Network language model (NPLM) system devtest test (Vaswani et al., 2013); treebank binarization (Sen- Hiero 20.2 16.8 nrich and Haddow, 2015); particle verb restructur- Tree-to-string 19.0 15.7 ing (Sennrich and Haddow, 2015). We do not in- + right binarization 19.5 16.1 clude syntactic constraints in this year’s baseline. Our string-to-tree baseline uses a dependency rep- Table 2: English to Czech translation results resentation of compounds, as described in (Sen- (BLEU) on devtest (newstest2014) and test (news- nrich and Haddow, 2015). test2015) sets. RDLM is a relational dependency language model which predicts the dependency relations

201

Page 34 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

system BLEU 2+ SUBJ system devtest test original trees 20.1 0 Hiero 19.2 21.0 + RDLM 21.0 0 String-to-tree baseline 19.8 21.4 HWCM+BLEU + RDLM (bidir.) 21.2 0 + 2 tuning 20.1 21.6 right binarization 20.4 272 + head binarization 20.5 22.3 head binarization 20.5 152 + RDLM (bidirectional) 21.5 23.3 + RDLM 21.3 43 + source-syntactic constraints 21.6 23.8 + RDLM (bidir.) 21.5 32 + 5-gram NPLM 22.0 24.1 + less pruning (submission) 22.0 24.0 Table 5: English to German translation results + particle verb restructuring 22.0 24.4 (on newstest2013) with different binarizations and language models. 2+ SUBJ: number of finite Table 6: English to German translation results clauses with more than one subject. (BLEU) on devtest (newstest2013) and test (news- test2015) sets. and words in the translation hypotheses based on the dependency relations and words of the ances- The NPLM is a 5-gram feed-forward neural lan- tor and sibling nodes in the dependency tree. Our guage model, and for both RDLM and NPLM model contains several extensions over the origi- we use a single hidden layer of size 750, a 150- nal paper (Sennrich, 2015). Like the original pa- dimensional input embedding layer with a vocab- per, we use an ancestor context size of 2, but we ulary size of 500000, noise-contrastive estimation increase the sibling context size from 1 to 3, and with 100 noise samples, and 2 iterations over the allow bidirectional context, using the 3 closest sib- monolingual training set. Estimating LM proba- lings to both the left and right of the current node. bilities for OOV words is a well-known problem, The original model predicts a virtual stop node as and we avoid this by filtering the translation model the last child of each tree, which models the prob- according to the vocabulary of the neural models. ability that a node has no more children. This is The impact of all experimental components is mirrored by a virtual start node in the bidirectional shown in Table 6. Each system in Tables 5 and 6 model. was tuned separately with MIRA. For our submis- We binarize the treebanks before rule extrac- sion system, we increased the Moses parameters tion. We note that treebank binarization allows the cube-pruning-pop-limit from 1000 to 4000, and extraction of rules that overgeneralize, e.g. allow- rule-limit from 100 to 400, but this had little effect ing structures with zero, or multiple, preterminals on devtest, and gave even slightly lower BLEU on per node, effectively allowing verb clauses with- test. Particle verb restructuring, which was done out verb and similar. We use head binarization after the submission deadline, increases BLEU on (Sennrich and Haddow, 2015), which ensures that test. In total, we observe substantial improvements each constituent contains exactly one head. Dur- over our baseline, which roughly corresponds to ing decoding, the generated target trees are un- last year’s submission systems: 2.2 BLEU on dev- binarized to allow scoring with RDLM. Table 5 test, and 3.0 BLEU on test. shows that both right binarization and head bi- narization overgeneralize, exemplified by the fact 3.4 English to Russian that they allow finite clauses to have multiple sub- For English Russian we built a tree-to-string jects1. The RDLM reduces this problem, and the → system. During preliminary experiments we found bidirectional RDLM slightly outperforms the uni- that fast_align gave consistent gains over directional variant, both in terms of BLEU and the MGIZA++ (albeit smaller than Finnish English number of overgeneralizations. → at around 0.3 BLEU). In final experiments we used For the soft source-syntactic constraints, we an- fast_align for word alignment and we used notate the source text with the Stanford Neural the 2012 and 2013 test sets as tuning data. Table 7 Network dependency parser (Chen and Manning, gives the mean BLEU scores, averaged over three 2014), along with heuristic projectivization (Nivre MERT runs. Our submitted system was the right and Nilsson, 2005). binarized system that, out of the three runs, scored 1Compound subjects are represented as a single node. highest on devtest.

202

Page 35 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

system devtest test word morph syscomb Hiero 29.8 23.8 Hiero 17.8 19.1 19.2 Tree-to-string 27.5 22.1 String-to-tree 17.6 18.5 18.7 + right binarization 28.3 23.0 + right binarization 17.8 18.9 18.9

Table 7: English to Russian translation results Table 9: Finnish to English experiments with mor- (BLEU) on devtest (newstest2014) and test (news- phological segmentation. test2015) sets. system dev test 3.5 Czech to English Hiero 18.6 17.5 For Czech English we built a string-to-tree sys- → String-to-tree 18.3 17.2 tem. We used all test sets from 2008 to 2013 as + right binarization 18.5 17.7 tuning data. Table 8 gives the mean BLEU scores, which are averaged over three MERT runs. Our Table 10: Finnish to English translation results submitted system was the right binarized system (BLEU) on dev (newsdev2015) and test (news- that, out of the three runs, scored highest on dev- test2015) sets. test. system devtest test eraged over three MERT runs. Our submitted sys- Hiero 28.5 24.9 tem was the right binarized system that, out of the String-to-tree 27.8 24.4 three, scored highest on newsdev2015. + right binarization 27.8 24.5 3.7 German to English Table 8: Czech to English translation results For German English we built a tree-to-string → (BLEU) on devtest (newstest2014) and test (news- system with similar setup as last year’s (Williams test2015) sets. et al., 2014). Our submitted system was right bi- narized with the following extraction parameters: 3.6 Finnish to English Rule Depth = 7, Node Count = 100, Rule Size = In preliminary Finnish English experiments, we 7. At decoding time we used the following non- → tried using Morfessor to segment Finnish words default parameter value: max-chart-span = 25. into morphemes. We used Morfessor 2.0 (with de- This limits sub derivations to a maximum span of fault settings) to learn an unsupervised segmenta- 25 source words. For the Hiero baseline system we tion model from all of the available Finnish data, used max-chart-span = 15. For tuning we used a which was then used to segment all words in the random subset of 2000 sentences drawn from the source-side training and test data. We compared full tuning set. systems with and without segmentation and using We performed some preliminary experiments a system combination of the two — an approach with neural bilingual language models, our re- that has been shown to improve translation quality implementation of the “joint” model of (Devlin for this language pair (de Gispert et al., 2009). et al., 2014). The bilingual language models are As with English Finnish, we split news- trained with the NPLM toolkit (Vaswani et al., → dev2015 into two halves, using the first half for 2013). We used 250-dimensional input embedding tuning and the second half for testing. Table 9 and hidden layers, and input and output vocabu- shows the results: the column headed ‘word’ gives lary sizes of 500000 and 250000 respectively. One BLEU scores for the unsegmented systems; the bilingual language model was a 5-gram model column headed ‘morph’ gives scores for systems with an additional context of 9 source words, the trained on segmented data; and the column headed affiliated source word and a window of 4 words on ‘syscomb’ gives results for a system combination either side. A second model was a 1-gram model using MEMT (Heafield and Lavie, 2010). with an additional context of 13 source words. The For our final system, we used morphological language models were trained on the available par- segmentation but not system combination. We allel corpora. used the full newsdev2015 test as tuning data. Ta- We also added a 7-gram class-based language ble 10 gives mean BLEU scores for this setup, av- model, with 50 word classes trained using mkcls

203

Page 36 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

system devtest test al., 2014; Lommel et al., 2014; Aranberri, 2015). Hiero 27.7 28.0 We used the following error annotation protocol: String-to-tree 28.7 28.7 1. A bilingual speaker corrects the machine + bilingual LMs 28.6 28.7 translation output with minimal necessary ed- + bilingual & class LMs 28.3 28.7 its to render an acceptable translation. This is done in view of the human reference transla- Table 11: German to English translation results tion, but typically a much more literal trans- (BLEU) on devtest (newstest2014) and test (news- lation was obtained. test2015) sets. 2. Each edit is noted in a list in the form "old string new string", where either old or new (Och, 1999). The language model was trained → on all available monolingual corpora, filtering out string may also be empty or discontinuous. singletons. 3. In a second pass, all edits are classified with Table 11 shows the results. As the preliminary error categories. results were not encouraging, we did not include Such an error analysis is subjective. There are the bilingual LMs and class LMs in our submitted many ways to correct errors (step 1), many ways system. to split corrections into units (step 2), and many 3.8 Russian to English ways to classify the errors (step 3). Moreover, an- For Russian English we built a string-to-tree alyzing only 100 sentences does not lead to strong → system, using the 2012 and 2013 test sets as tun- statistically significant findings. With this in mind, the following analysis is broadly indicative of the ing data. Table 12 gives the mean BLEU scores, averaged over three MERT runs. Our submitted main error types in our syntax-based systems. system was the right binarized system that, out of Occasionally, parts of a machine translation are the three runs, scored highest on devtest. just too muddled that a sequence of edits could be established. This happened in 8 German–English system devtest test sentences, and 7 English–German sentences. Hiero 31.2 27.1 4.1 German–English String-to-tree 30.5 25.9 16 sentences have no error, while 18 sentences + right binarization 30.6 26.2 have only one error. These are of course typically Table 12: Russian to English translation results the shorter ones. The longest sentence without er- (BLEU) on devtest (newstest2014) and test (news- ror is: test2015) sets. Source: Der Oppositionspolitiker Imran • Khan wirft Premier Sharif vor, bei der Par- 4 Manual Error Analysis lamentswahl im Mai vergangenen Jahres be- trogen zu haben. Our syntax-based systems for the German– MT: The opposition politician Imran Khan English language pairs have greatly improved • over the last years and outperformed traditional accuses Premier Sharif of having cheated phrase-based statistical machine translation sys- in the parliamentary election in May of last tems. Translating between German and English year. is a challenge for those systems, since extensive This is not a trivial sentence, since it requires the long distance reordering and long distance agree- translation of the complex subclause construction ment constraints do not fit that approach. Are our accuses ... of having cheated, which is rendered syntax-based systems tackling these problems bet- quite differently in German as wirft ... vor ... bet- ter? And what are the main remaining problems? rogen zu haben. For both German–English and English– An overview of the major error categories is German, we analyzed 100 sentences, we carried shown is Figure 13. On average, 2.85 errors per out an error analysis using linguistic error cate- sentence were identified. This gives us guidance gories that roughly match other efforts in this area on the major problems we should be working on (Vilar et al., 2006; Toral et al., 2013; Herrmann et in the future.

204

Page 37 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

Count Category Count Category 29 Wrong content word - noun 6 Wrong content word - phrasal verb 25 Wrong content word - verb 6 Added function word - determiner 22 Wrong function word - preposition 5 Unknown word - noun 21 Inflection - verb 5 Missing content word - adverb 14 Reordering: verb 5 Missing content word - noun 13 Reordering: adjunct 5 Inflection - noun 12 Missing function word - preposition 4 Reordering: NP 10 Missing content word - verb 3 Missing content word - adjective 9 Wrong function word - other 3 Inflection - wrong POS 9 Wrong content word - wrong POS 3 Casing 9 Added punctuation 2 Unknown word - verb 8 Muddle 2 Reordering: punctuation 8 Missing function word - connective 2 Reordering: noun 8 Added function word - preposition 2 Reordering: adverb 7 Missing punctuation 2 Missing function word - determiner 7 Wrong content word - adverb 2 Inflection - adverb

Table 13: Main error types in German–English system (count in 100 sentences).

Lexical choice The biggest group of error types translation quality for this language pair histori- concern translation of basic concepts. On average, cally. While we cannot declare complete success, such errors occur 0.76 times per sentence. Given our syntax-based systems constitute great progress the vast number of content words that need to be in this area. translated, the actual performance on the task of Count Category lexical translation is pretty high, but it is by no 14 Reordering: verb means solved. 13 Reordering: adjunct Count Category 4 Reordering: NP 29 Wrong content word - noun 2 Reordering: noun 25 Wrong content word - verb 2 Reordering: adverb 9 Wrong content word - wrong POS 7 Wrong content word - adverb Other issues with verbs Reordering errors in- 6 Wrong content word - phrasal verb volving verbs top the list in the previous group Prepositions We were surprised by the large of error types, but there are also other problems number of errors revolving prepositions. Prepo- with verbs: their inflection and the unacceptable sitions are frequent, but not as frequent as con- frequency of dropping verbs. The latter has its tent words, so the performance on the preposi- roots in faulty word alignment which are based tion translation task is not as good. Prepositions on IBM Models which often fail to align the out- mostly mark relationships of adjuncts, which in- of-English-order German verb, thus enabling the volve quite complex considerations — the adjunct, translation model to drop them, which the lan- the modified verb or noun phrase, identifying the guage model often prefers. Inflection is here to relationship between them in the source sentence, be understood broadly, including the need for the and the fuzzy meaning of prepositions. right function words to form a grammatical correct verb complex (e.g., will have been resolved). Count Category 22 Wrong function word - preposition Count Category 12 Missing function word - preposition 21 Inflection - verb 8 Added function word - preposition 10 Missing content word - verb Reordering We were also surprised by the low Overall, the main thrust of future research number of reordering errors. The different word should be focused on lexical choice, selecting cor- order between German and English has hampered rect prepositions, and producing the correct verb.

205

Page 38 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

Count Category Count Category 41 Wrong content word - verb 9 Compound merging 37 Wrong content word - noun 8 Added function word - preposition 33 Reordering - verb 7 Punctuation - inserted 30 Inflection - verb 7 Muddle 22 Missing function word - preposition 7 Missing function word - clausal connective 17 Inflection - np 7 Added function word - determiner 14 Wrong function word - preposition 5 Punctuation - missing 12 Wrong content word - phrasal verb 5 Missing content word - verb 12 Wrong content word - wrong POS 4 Reordering - adverb 12 Wrong function word - clausal connective 4 Wrong content word - adverb 11 Reordering - pp 3 Missing content word - adjective 11 Inflection - noun 2 Reordering - pronoun 10 Wrong function word - pronoun 2 Wrong content word - name 10 Missing function word - pronoun 2 Missing content word - adverb 10 Missing function word - determiner 2 Wrong content word - adjective 9 Reordering - noun 2 Added function word - pronoun

Table 14: Main error types in English–German system (count in 100 sentences).

4.2 English–German of patient as Geduld (patience) in a medical con- 12 Sentences had no error, 13 sentences only one text. In general, there is no reason to believe that error. Less than German–English, which supports models that more strongly draw on a wider context the general contention that translating into Ger- could not resolve many of these cases. man is harder. On average, a total of 3.8 errors Count Category per sentence were marked, one error per sentence 41 Wrong content word - verb more than German–English. An overview of the 37 Wrong content word - noun major error categories is shown is Figure 14. 12 Wrong content word - phrasal verb The longest sentence with no error is: 12 Wrong content word - wrong POS 4 Wrong content word - adverb Source: Congressmen Keith Ellison and John 2 Wrong content word - adjective • Lewis have proposed legislation to protect union organizing as a civil right. Role and order of adjuncts and arguments While the overall sentence structure is mostly cor- Target: Die Kongressabgeordneten Keith El- rect, there are often problems with the handling of • lison und John Lewis haben Gesetze zum adjunct and argument phrases. Their role is iden- Schutz der gewerkschaftlichen Organisation tified in German by a preposition or the case of a als Bürgerrecht vorgeschlagen. noun phrase (the main cause of inflection errors). Their position in the sentence is less strict, but mis- In terms of word order, this is not a takes can be and are made. complicated sentence (besides the verb move- Count Category ment proposed vorgeschlagen), but it does → 22 Missing function word - preposition involve switching of part-of-speech for two 17 Inflection - np content words: protect Schutz (verb noun), → → 14 Wrong function word - preposition union gewerkschaftlichen (noun adjective). → → 11 Reordering - pp 11 Inflection - noun As with German–English, this is Lexical choice 8 Added function word - preposition biggest group of error types, with 1.08 errors per sentence. Verb sense errors tend to be more subtle, Verbs Reordering errors of verbs mainly oc- such that a media outlet does not sagt (says) but cur in complex subclause constructions. German berichtet (reports) a news item. For nouns, there verbs are more strongly inflected for count and were several stark errors, such the mis-translation person, and often a few function words are needed

206

Page 39 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

in just the right order and placement for a correct 30–38, Denver, Colorado, USA, June. Association verb complex. for Computational Linguistics. 33 Reordering - verb Stanley F. Chen and Joshua Goodman. 1998. An em- 30 Inflection - verb pirical study of smoothing techniques for language 5 Missing content word - verb modeling. Technical report, Harvard University. Due to grammatical gender of nouns Danqi Chen and Christopher Manning. 2014. A Fast Pronouns and Accurate Dependency Parser using Neural Net- in German, translating it and they is a complex un- works. In Proceedings of the 2014 Conference on dertaking. German verbs also require more fre- Empirical Methods in Natural Language Processing quently reflexive pronouns. (EMNLP), pages 740–750, Doha, Qatar. Count Category Colin Cherry and George Foster. 2012. Batch Tun- 10 Wrong function word - pronoun ing Strategies for Statistical Machine Translation. In 10 Missing function word - pronoun Proceedings of the 2012 Conference of the North American Chapter of the Association for Computa- 2 Added function word - pronoun tional Linguistics: Human Language Technologies, Clausal connectives A specific problem of pages 427–436, Montréal, Canada, June. English–German translations are clausal connec- David Chiang. 2007. Hierarchical phrase-based trans- tives. In English, the relationship of the sub clause lation. Comput. Linguist., 33(2):201–228. is often not explicitly marked (e.g., Police say the Tagyoung Chung, Licheng Fang, and Daniel Gildea. rider), while German requires a function word. 2011. Issues concerning decoding with synchronous context-free grammar. In Proceedings of the 49th Count Category Annual Meeting of the Association for Computa- tional Linguistics: Human Language Technologies, 12 Wrong function word - clausal connective pages 413–417, Portland, Oregon, USA, June. 7 Missing function word - clausal connective Adrià de Gispert, Sami Virpioja, Mikko Kurimo, and Overall, while there are more structural prob- William Byrne. 2009. Minimum bayes risk com- lems than for German–English, often the remain- bination of translation hypotheses from alternative ing challenge is the disambiguation of lexical morphological decompositions. In Proceedings of choices and the correct labelling of syntactic re- Human Language Technologies: The 2009 Annual lationships. Conference of the North American Chapter of the Association for Computational Linguistics, Com- panion Volume: Short Papers, pages 73–76, Boul- 5 Conclusion der, Colorado, June. This year we submitted syntax-based systems for Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas all language pairs except English-French. Our Lamar, Richard Schwartz, and John Makhoul. 2014. English German system included significant Fast and Robust Neural Network Joint Models for → improvements over last year’s and we intend to Statistical Machine Translation. In Proceedings of the 52nd Annual Meeting of the Association for continue developing this system. We presented Computational Linguistics (Volume 1: Long Pa- the first results using Moses’ STSG-based tree-to- pers), pages 1370–1380, Baltimore, MD, USA, string model. June. Chris Dyer, Victor Chahuneau, and Noah A. Smith. Acknowledgements 2013. A Simple, Fast, and Effective Reparame- This project has received funding from the terization of IBM Model 2. In Proceedings of the 2013 Conference of the North American Chapter of European Union’s Horizon 2020 research and the Association for Computational Linguistics: Hu- innovation programme under grant agreements man Language Technologies, pages 644–648, At- 645452 (QT21) and 644402 (HimL), and from the lanta, GA, USA, June. Swiss National Science Foundation under grant Jason Eisner. 2003. Learning non-isomorphic tree P2ZHP1_148717. mappings for machine translation. In The Compan- ion Volume to the Proceedings of 41st Annual Meet- ing of the Association for Computational Linguis- References tics, pages 205–208, Sapporo, Japan, July. Associa- tion for Computational Linguistics. Nora Aranberri. 2015. Smt error analysis and mapping to syntactic, semantic and structural fixes. In Pro- Michel Galley, Mark Hopkins, Kevin Knight, and ceedings of the Ninth Workshop on Syntax, Seman- Daniel Marcu. 2004. What’s in a Translation Rule? tics and Structure in Statistical Translation, pages In HLT-NAACL ’04.

207

Page 40 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

Michel Galley, Jonathan Graehl, Kevin Knight, Daniel 2014. Using a new analytic measure for the anno- Marcu, Steve DeNeefe, Wei Wang, and Ignacio tation and analysis of mt errors on real data. In Thayer. 2006. Scalable inference and training of Proceedings of 17th Annual conference of the Eu- context-rich syntactic translation models. In ACL- ropean Association for Machine Translation, pages 44: Proceedings of the 21st International Confer- 165–172. ence on Computational Linguistics and the 44th an- nual meeting of the Association for Computational Maria Nadejde, Philip Williams, and Philipp Koehn. Linguistics, pages 961–968, Morristown, NJ, USA. 2013. Edinburgh’s Syntax-Based Machine Transla- tion Systems. In Proceedings of the Eighth Work- Qin Gao and Stephan Vogel. 2008. Parallel implemen- shop on Statistical Machine Translation, pages 170– tations of word alignment tool. In Software Engi- 176, Sofia, Bulgaria, August. neering, Testing, and Quality Assurance for Natural Language Processing, SETQA-NLP ’08, pages 49– Graham Neubig and Kevin Duh. 2014. On the ele- 57, Stroudsburg, PA, USA. ments of an accurate tree-to-string machine trans- lation system. In Proceedings of the 52nd Annual Kenneth Heafield and Alon Lavie. 2010. Combining Meeting of the Association for Computational Lin- Machine Translation Output with Open Source The guistics (Volume 2: Short Papers), pages 143–149, Carnegie Mellon Multi-Engine Machine Translation Baltimore, Maryland, June. Scheme. The Prague Bulletin of Mathematical Lin- guistics, 93:27–36. Joakim Nivre and Jens Nilsson. 2005. Pseudo- Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H. Projective Dependency Parsing. In Proceedings Clark, and Philipp Koehn. 2013. Scalable modi- of the 43rd Annual Meeting of the Association for fied Kneser-Ney language model estimation. In Pro- Computational Linguistics (ACL’05), pages 99–106, ceedings of the 51st Annual Meeting of the Associa- Ann Arbor, Michigan. tion for Computational Linguistics, pages 690–696, Sofia, Bulgaria, August. Franz Josef Och and Hermann Ney. 2003. A sys- tematic comparison of various statistical alignment Teresa Herrmann, Jan Niehues, and Alex Waibel. models. Comput. Linguist., 29(1):19–51, March. 2014. Manual analysis of structurally informed re- ordering in german-english machine translation. In Franz Josef Och. 1999. An Efficient Method for Deter- Proceedings of the Ninth International Conference mining Bilingual Word Classes. In Proceedings of on Language Resources and Evaluation (LREC- the 9th Conference of the European Chapter of the 2014). European Language Resources Association Association for Computational Linguistics (EACL), (ELRA). pages 71–76.

Mark Hopkins and Greg Langmead. 2010. SCFG de- Franz Josef Och. 2003. Minimum error rate training coding without binarization. In Proceedings of the in statistical machine translation. In Proceedings of 2010 Conference on Empirical Methods in Natural the 41st Annual Meeting on Association for Com- Language Processing, pages 646–655, Cambridge, putational Linguistics - Volume 1, ACL ’03, pages MA, October. 160–167, Morristown, NJ, USA.

Matthias Huck, Hieu Hoang, and Philipp Koehn. Slav Petrov and Dan Klein. 2007. Improved Inference 2014. Preference Grammars and Soft Syntactic for Unlexicalized Parsing. In Human Language Constraints for GHKM Syntax-based Statistical Ma- Technologies 2007: The Conference of the North chine Translation. In Proceedings of SSST-8, Eighth American Chapter of the Association for Computa- Workshop on Syntax, Semantics and Structure in tional Linguistics; Proceedings of the Main Confer- Statistical Translation, pages 148–156, Doha, Qatar. ence, pages 404–411, Rochester, New York, April. Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Slav Petrov, Leon Barrett, Romain Thibaux, and Dan NAACL ’03: Proceedings of the 2003 Conference Klein. 2006. Learning accurate, compact, and of the North American Chapter of the Association interpretable tree annotation. In Proceedings of for Computational Linguistics on Human Language the 21st International Conference on Computational Technology, pages 48–54, Morristown, NJ, USA. Linguistics and the 44th annual meeting of the As- sociation for Computational Linguistics, ACL-44, Ding Liu and Daniel Gildea. 2005. Syntactic Fea- pages 433–440. tures for Evaluation of Machine Translation. In Pro- ceedings of the ACL Workshop on Intrinsic and Ex- Rico Sennrich and Barry Haddow. 2015. A Joint trinsic Evaluation Measures for Machine Transla- Dependency Model of Morphological and Syntac- tion and/or Summarization, pages 25–32, Ann Ar- tic Structure for Statistical Machine Translation. bor, Michigan. In Proceedings of the 2015 Conference on Em- pirical Methods in Natural Language Processing Arle Lommel, Aljoscha Burchardt, Maja Popovic,´ Kim (EMNLP), Lisbon, Portugal. Association for Com- Harris, Eleftherios Avramidis, and Hans Uszkoreit. putational Linguistics.

208

Page 41 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

Rico Sennrich, Martin Volk, and Gerold Schneider. syntax-based statistical machine translation. In Pro- 2013. Exploiting Synergies Between Open Re- ceedings of the 2009 Conference on Empirical Meth- sources for German Dependency Parsing, POS- ods in Natural Language Processing, pages 1037– tagging, and Morphological Analysis. In Proceed- 1045, Singapore, August. ings of the International Conference Recent Ad- vances in Natural Language Processing 2013, pages 601–609, Hissar, Bulgaria.

Rico Sennrich. 2014. A cyk+ variant for scfg decod- ing without a dot chart. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Struc- ture in Statistical Translation, pages 94–102, Doha, Qatar, October.

Rico Sennrich. 2015. Modelling and Optimizing on Syntactic N-Grams for Statistical Machine Transla- tion. Transactions of the Association for Computa- tional Linguistics, 3:169–182.

Andreas Stolcke. 2002. SRILM – an Extensible Lan- guage Modeling Toolkit. In Proc. of the Int. Conf. on Spoken Language Processing (ICSLP), volume 3, Denver, CO, USA, September.

Antonio Toral, Sudip Kumar Naskar, Joris Vreeke, Federico Gaspari, and Declan Groves. 2013. A web application for the diagnostic evaluation of machine translation over specific linguistic phenomena. In Proceedings of the 2013 NAACL HLT Demonstra- tion Session, pages 20–23, Atlanta, Georgia, June. Association for Computational Linguistics.

Ashish Vaswani, Yinggong Zhao, Victoria Fossum, and David Chiang. 2013. Decoding with Large-Scale Neural Language Models Improves Translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1387–1392, Seattle, WA, USA.

David Vilar, Jia Xu, Luis Fernando D’Haro, and Her- mann Ney. 2006. Error analysis of statistical ma- chine translation output. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 06), pages 697–702.

Sami Virpioja, Peter Smit, Stig-Arne Grönroos, and Mikko Kurimo. 2013. Morfessor 2.0: Python im- plementation and extensions for morfessor baseline. Technical report, Aalto University, Helsinki.

Philip Williams and Philipp Koehn. 2012. GHKM Rule Extraction and Scope-3 Parsing in Moses. In Proceedings of the Seventh Workshop on Statisti- cal Machine Translation, pages 388–394, Montréal, Canada, June.

Philip Williams, Rico Sennrich, Maria Nadejde, Matthias Huck, Eva Hasler, and Philipp Koehn. 2014. Edinburgh’s Syntax-Based Systems at WMT 2014. In Proceedings of the Ninth Workshop on Sta- tistical Machine Translation, pages 207–214, Balti- more, Maryland, USA, June.

Hui Zhang, Min Zhang, Haizhou Li, and Chew Lim Tan. 2009. Fast translation rule matching for

209

Page 42 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

D Edinburgh’s Statistical Machine Translation Systems for WMT16

Edinburgh’s Statistical Machine Translation Systems for WMT16 Philip Williams1, Rico Sennrich1, Maria Nadejde˘ 1, Matthias Huck2, Barry Haddow1, Ondrejˇ Bojar3 1School of Informatics, University of Edinburgh 2Center for Information and Language Processing, LMU Munich 3Institute of Formal and Applied Linguistics, Charles University in Prague [email protected] [email protected] [email protected] [email protected] [email protected] [email protected]

Abstract core syntax-based setup and experiments in Sec- tions 4 and 5. This paper describes the University of Ed- inburgh’s phrase-based and syntax-based 2 Phrase-based System Overview submissions to the shared translation tasks of the ACL 2016 First Conference on Ma- 2.1 Preprocessing chine Translation (WMT16). We sub- The training data was preprocessed us- mitted five phrase-based and five syntax- ing scripts from the Moses toolkit. based systems for the news task, plus one We first normalized the data using the phrase-based system for the biomedical normalize-punctuation.perl script, task. then performed tokenization (using the -a op- tion), and then truecasing. We did not perform 1 Introduction any corpus filtering other than the standard Moses method, which removes sentence pairs with Edinburgh’s submissions to the WMT 2016 news extreme length ratios, and sentences longer than translation task fall into two distinct groups: neu- 80 tokens. ral translation systems and statistical translation systems. In this paper, we describe the statisti- 2.2 Word Alignment cal systems, which includes a mix of phrase-based and syntax-based approaches. We also include a For word alignment we used fast_align (Dyer et al., 2013)—except for German English, brief description of our phrase-based submission ↔ to the WMT16 biomedical translation task. Our where we used MGIZA++ (Gao and Vo- neural systems are described separately in Sen- gel, 2008)—followed by the standard nrich et al. (2016a). grow-diag-final-and symmetrization In most cases, our statistical systems build on heuristic. last year’s, incorporating recent modelling refine- 2.3 Language Models ments and adding this year’s new training data. For Romanian—a new language this year—we Our default approach to language modelling was paid particular attention to language-specific pro- to train individual models on each monolingual cessing of diacritics. For English Czech, we ex- corpus (except CommonCrawl) and then linearly- → perimented with a string-to-tree system, first using interpolate them to produce a single model. For Treex1 (formerly TectoMT; Popel and Žabokrt- some systems, we added separate neural or Com- ský, 2010) to produce Czech dependency parses, monCrawl LMs. Here we outline the various ap- then converting them to constituency representa- proaches and then in Section 3 we describe the tion and extracting GHKM rules. combination used for each language pair. In the next two sections, we describe the phrase- based systems, first describing the core setup in Interpolated LMs For individual monolingual Section 2 and then describing system-specific ex- corpora, we first used lmplz (Heafield et al., 2013) tensions and experimental results for each individ- to train count-based 5-gram language models with ual language pair in Section 3. We describe the modified Kneser-Ney smoothing (Chen and Good- man, 1998). We then used the SRILM toolkit 1http://ufal.mff.cuni.cz/treex (Stolcke, 2002) to linearly interpolate the models

399 Proceedings of the First Conference on Machine Translation, Volume 2: Shared Task Papers, pages 399–410, Berlin, Germany, August 11-12, 2016. c 2016 Association for Computational Linguistics

Page 43 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

using weights tuned to minimize perplexity on the 2.6 Decoding development set. In decoding we applied cube pruning (Huang and CommonCrawl LMs Our CommonCrawl lan- Chiang, 2007) with a stack size of 5000 (reduced guage models were trained in the same way as the to 1000 for tuning), Minimum Bayes Risk de- individual corpus-specific standard models, but coding (Kumar and Byrne, 2004), a maximum were not linearly-interpolated with other LMs. In- phrase length of 5, a distortion limit of 6, 100- stead, the log probabilities of CommonCrawl LMs best translation options and the no-reordering- were added as separate features of the systems’ over-punctuation heuristic (Koehn and Haddow, linear models. 2009). Neural LMs For some of our phrase-based sys- 3 Phrase-based Experiments tems we experimented with feed-forward neural 3.1 Finnish English network language models, both trained on target → n-grams only, and on “joint” or “bilingual” n- Similar to last year (Haddow et al., 2015), we built an unconstrained system for Finnish English us- grams (Devlin et al., 2014; Le et al., 2012). For → training these models we used the NPLM toolkit ing data extracted from OPUS (Tiedemann, 2012). (Vaswani et al., 2013), for which we have now im- Our parallel training set was the same as we used plemented gradient clipping to address numerical previously, but the language model training set issues often encountered during training. was extended with the addition of the news2015 monolingual corpus and the large WMT16 En- 2.4 Baseline Features glish CommonCrawl corpus. We used news- We follow the standard approach to SMT of scor- dev2015 for tuning, and newsdev2015 for testing ing translation hypotheses using a weighted lin- during system development. ear combination of features. The core features One clear problem that we noted with our sub- of our model are a 5-gram LM score (i.e. log mission from last year was the large number of probability), phrase translation and lexical trans- OOVs, which were then copied directly into the lation scores, word and phrase penalties, and a lin- English output. This is undoubtedly due to the ag- ear distortion score. The phrase translation prob- glutinative nature of Finnish, and probably was the abilities are smoothed with Good-Turing smooth- cause of our system being poorly judged by human ing (Foster et al., 2006). We used the hierarchi- evaluators, despite having a high BLEU score. To cal lexicalized reordering model (Galley and Man- address this, we split the Finnish input into sub- ning, 2008) with 4 possible orientations (mono- word units at both train and test time. In particular, tone, swap, discontinuous left and discontinuous we applied byte pair encoding (BPE) to split the right) in both left-to-right and right-to-left direc- Finnish source into smaller units, greatly reduc- tion. We also used the operation sequence model ing the vocabulary size. BPE is a technique which (OSM) (Durrani et al., 2013) with 4 count based has been recently used to good effect in neural ma- supportive features. We further employed domain chine translation (Sennrich et al., 2016b), where indicator features (marking which training cor- the models cannot handle large vocbaularies. It is pus each phrase pair was found in), binary phrase actually a merging algorithm, originally designed count indicator features, sparse phrase length fea- for compression, and works by starting with a tures, and sparse source word deletion, target word maximally split version of the training corpus (i.e. insertion, and word translation features (limited to split to characters) and iteratively merging com- the top K words in each language, typically with mon clusters. The merging continues for a speci- K = 50). fied number of iterations, and the merges are col- lected up to form the BPE model. At test time, 2.5 Tuning the recorded merges are applied to the test corpus, Since our feature set (generally around 500 to with the result that there are no OOVs in the test 1000 features) was too large for MERT, we used data. For the experiments here, we used 100,000 k-best batch MIRA for tuning (Cherry and Fos- BPE merges to create the model. ter, 2012). To speed up tuning we applied thresh- Applying BPE to Finnish English was clearly → old pruning to the phrase table, based on the direct effective at addressing the unknown word prob- translation model probability. lem, and in many cases the resulting translations

400

Page 44 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

are quite understandable, e.g. into two parts randomly (so as to balance the “born source yös Intian on sanottu olevan kiinnostunut English” and “born Romanian” portions), using puolustusyhteistyösopimuksesta Japanin one for tuning and one for testing. For building the kanssa. final system, and for the contrastive experiments, base India is also said to be interested in puolus- we used the whole of newsdev2016 for tuning, and tusyhteistyösopimuksesta with Japan. newstest2016 for testing. bpe India is also said to be interested in defence In early experiments we noted that both the cooperation agreement with Japan. training and the development data were inconsis- reference India is also reportedly hoping for a tent in their use of diacritics leading to problems deal on defence collaboration between the with OOVs and sparse statistics. To address this two nations. we stripped off all diacritics from the Romanian However applying BPE to Finnish can also re- texts and the result was a significant increase in sult in some rather odd translations when it over- performance in our development setup. We also zealously splits: experimented with different language model com- source Balotelli oli vielä kaukana huippu- binations during development, with our submit- vireestään. ted system using three different language model base Balotelli was still far from huippuvireestään. features: a neural LM trained on just news2015 bpe Baloo, Hotel was still far from the peak of its monolingual, an n-gram language model trained vitality. on the WMT16 English CommonCrawl corpus, reference Balotelli is still far from his top tune. and a linear interpolation of language models We built four language models: an interpolated trained on all other WMT16 English corpora. count-based 5-gram language model with all cor- In Table 1 we show how system performance pora, apart from the WMT16 CommonCrawl; sep- varies under different language model combina- arate count-based language models with WMT16 tion and preprocessing conditions. CommonCrawl and news2015; and a neural LM on news2015. A performance comparison across 3.3 English Romanian different language model combinations, and with → and without BPE is shown in Table 1. For English Romanian, we used all the data → system BLEU in the constrained track, including the Com- fi-en ro-en monCrawl language model data, and as with the Romanian English system, we used news- only interpolated LM 22.9 34.2 → + CommonCrawl LM 23.2 35.0 dev2016 for the final tuning run. + CC LM & news2015 (count) 23.4 34.9 The inconsistent use of diacritics in Romana- + CC LM & news2015 (neural) 23.4 nian text also affected the English Romanian 35.2 → + all 23.4 35.0 system, however removing altogether would be without BPE 22.2 – problematic as we would then need a method for without diacritic removal – 32.2 restoring them for the final system. So the only extra preprocessing we performed on the Roma- Table 1: Comparison of different language model nian was to ensure that “t-comma” and “s-comma” combinations and preprocessing regimes for were written correctly, with a comma rather than a Finnish English and for Romanian English. → → cedilla. The submitted system is shown in bold. The pre- Our final system used two different count- processing variant uses the same language model based 5-gram language models (one trained on all combination as the submitted system. Cased data, including the WMT16 Romanian Common- BLEU scores are on newstest2016. Crawl corpus, without pruning, and one trained on news2015 monolingual only), a neural language model trained on news2015 monolingual, and a 3.2 Romanian English → bilingual language model trained on the parallel We trained our Romanian English system using data, with source window of 15 and target window → all data available for the constrained task. For sys- of 1. In Table 2 we show ablation experiments tem development, we split the newsdev2016 set where we remove each of these language models.

401

Page 45 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

system BLEU ical tags. submitted 26.8 The feature weights for our final system were + prune all 26.2 tuned with hypergraph MIRA (i.e. batch MIRA - all 25.6 over lattices representing the decoding search - news2015 26.4 space) on a concatenation of newssyscomb2009 - neural LM 26.6 and newstest2008–2012. - bilingual LM 26.5 3.5 German English → Table 2: Effect of each of the language models For phrase-based translation from German, we ap- used in the English Romanian system. The ex- → plied syntactic pre-reordering (Collins et al., 2005) periments are not cumulative, so we first try prun- and compound splitting (Koehn and Knight, 2003) ing the “all” language model, then go back to the in a preprocessing step on the source side. The op- unpruned version and remove each LM in turn, ob- eration sequence model for the German English → serving the effect. The submitted system used all phrase-based system was unpruned. We integrated four LMs, and the scores shown are uncased BLEU three language models: an unpruned LM over all scores on newstest2016. English data except the CommonCrawl monolin- gual corpus; a pruned LM over CommonCrawl; 3.4 English German → and a pruned LM over the monolingual News For the English German phrase-based system, Crawl 2015 corpus. In addition to lexical smooth- → we exploited several translation factors in addi- ing with the standard lexicon models, we utilized a tion to word surface forms, in particular: Och source-to-target IBM Model 1 (Brown et al., 1993) clusters (with 50 classes) and part-of-speech tags for sentence-level lexical scoring in a similar man- (Ratnaparkhi, 1996) on the English side, as ner as described by Huck et al. (2011) for hierar- well as Och clusters (50 classes), morphologi- chical systems. We tuned on the concatenation of cal tags, and part-of-speech tags on the German newssyscomb2009 and newstest2008–2012. side (Schmid, 2000). Recent experiments for our Unlike last year’s system (Haddow et al., IWSLT 2015 phrase-based system have recon- 2015)—and different from the inverse translation firmed that English German translation quality direction (English German)—we refrained from → → can benefit from these factors when supplemen- using any factors and instead set up a system that tary models over factored representations are used operates over surface form word representations (Huck and Birch, 2015). For WMT16, we utilized only. In relation to last year’s system, we were the factors in the translation model, in operation able to maintain high translation quality as mea- sequence models, and in language models (for lin- sured in BLEU despite the abandonment of fac- early interpolated 7-gram LMs over Och clusters tors. However, we suspect that human judgment and morphological tags). scores may suffer a bit from the abandonment of Sparse source word deletion, target word in- a factored model. We decided to drop the factored sertion, and word translation features were inte- representations in favour of gains in decoding ef- grated over the top 200 word surface forms and ficiency. over selected factors (source and target Och clus- We furthermore did not employ any sparse fea- ters, source part-of-speech tags and target mor- tures (sparse phrase length, source word deletion, phological tags). An unpruned 5-gram LM over target word insertion, or word translation features) words that was trained on all German data except in the German English system since we did not → the CommonCrawl monolingual corpus was sup- observe any clear gains in preliminary experi- plemented by a separate pruned LM trained on the ments, and sparse features slow down tuning and CommonCrawl data that had been provided as per- decoding. missible data for the “constrained” track. Rather English German and German English trans- → → than applying a simple linear distortion score, we lation results with our phrase-based systems are opted for sparse distortion features as described given in Table 3. by Green et al. (2010), which we reimplemented 3.6 Spanish English Biomedical in Moses. We activated sparse distortion features → with a feature template based on jump distance, For our submission to the Spanish English → source part-of-speech tags, and target morpholog- biomedical task, we created a parallel corpus using

402

Page 46 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

system de-en en-de 2013 2014 2015 2016 2013 2014 2015 2016 last year’s phrase-based 27.2 28.8 29.3 33.8 20.8 21.1 22.8 28.3 this year’s phrase-based 27.8 30.0 29.9 35.1 21.5 21.9 23.7 28.4

Table 3: Experimental results with phrase-based systems for German English and English German. → → We report case-sensitive BLEU scores on each of the newstest2013–2016 test sets.

all relevant data from WMT13, as well as the extra Crawl and neural LMs. We detail the system- biomedical data provided by the task organisers, specific combinations in Section 5. and the EMEA corpus from OPUS (Tiedemann, 2012). In total we had around 16M sentences 4.4 Rule Extraction of parallel data. Our monolingual corpus was SCFG rules were extracted from the word-aligned made up of three parts: all the English monolin- parallel data using the Moses implementation gual medical data from WMT14 medical, WMT16 (Williams and Koehn, 2012) of the GHKM algo- biomedical and EMEA (11M sentences); all the rithm (Galley et al., 2004, 2006). English LDC GigaWord data (180M sentences); Minimal GHKM rules were composed into and all the English general domain data from larger rules subject to restrictions on the size of WMT16 (240M sentences). We used the monolin- the resulting tree fragment. We used the settings gual data to build three different language models shown in Table 4, which were chosen empirically which were then linearly interpolated. System tun- during the development of 2013’s systems (Nade- ing was with the SCIELO development data pro- jde et al., 2013). vided for the biomedical task. parameter unbinarized binarized 4 Syntax-based System Overview rule depth 5 7 node count 20 30 For all syntax-based systems, we used a string-to- rule size 5 7 tree model based on a synchronous context-free grammar (SCFG) with linguistically-motivated la- Table 4: Parameter settings for rule composition. bels on the target side. The parameters were relaxed for systems that used binarization to allow for the increase in tree node 4.1 Preprocessing density. Except for English-Czech, which we describe sep- Further to the restrictions on rule composition, arately in Section 5.1, preprocessing was similar to fully non-lexical unary rules were eliminated us- the phrase-based systems (Section 2.3). To parse ing the method described in Chung et al. (2011) the target-side of the training data, we used the and rules with scope greater than 3 (Hopkins and Berkeley parser (Petrov et al., 2006; Petrov and Langmead, 2010) were pruned from the trans- Klein, 2007) for English, and the ParZu depen- lation grammar. Scope pruning makes parsing dency parser (Sennrich et al., 2013) for German. tractable without the need for grammar binariza- Except where stated otherwise, we right-binarized tion. the trees after parsing to increase rule coverage. 4.5 Baseline Features 4.2 Word Alignment Our core set of string-to-tree feature functions is As in the phrase-based models, we used unchanged from previous years. It includes the n- fast_align for word alignment and the gram language model’s log probability for the tar- grow-diag-final-and heuristic for sym- get string, the target word count, the rule count, metrization. and several pre-computed rule-specific scores. The rule-specific scores were: the direct and in- 4.3 Language Models direct translation probabilities; the direct and in- As in the phrase-based systems (Section 2.3), direct lexical weights (Koehn et al., 2003); the we used linearly-interpolated language models as monolingual PCFG probability of the tree frag- standard, with some systems adding Common- ment from which the rule was extracted; and a rule

403

Page 47 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

rareness penalty. In preliminary experiments we used a smaller training set, comprising 2 million sentence pairs 4.6 Decoding sampled from OPUS and monolingual data from Decoding for the string-to-tree models is based on last year’s WMT translation task. We used two Sennrich’s (2014) recursive variant of the CYK+ test sets from the HimL project and the Khresmoi parsing algorithm combined with LM integration test set. Results with and without constraints are via cube pruning (Chiang, 2007). shown in Table 5. We used hard constraints and re- used the baseline weights (re-tuning did not appear 4.7 Tuning to give additional gains). The feature weights for the English Czech and → Finnish English systems were tuned using the system BLEU → Moses implementation of MERT (Och, 2003). HimL1 HimL2 Khresmoi For the remaining systems we used k-best MIRA baseline 23.3 18.6 20.4 (Cherry and Foster, 2012) due to the use of sparse + constraints 23.6 18.8 20.7 features. Table 5: Translation results on the development We used randomly-chosen subsets of the previ- system for English Czech with unification-based ous years’ test data to speed up decoding. → constraints. Cased BLEU scores are shown. They 5 Syntax-based Experiments are averaged over three tuning runs (note that base- line weights are reused in the experiments with 5.1 English Czech constraints). → For English Czech, we used Treex to prepro- → cess and parse the Czech-side of the training data. Although the gains in BLEU were small, previ- Treex uses the MST parser (McDonald et al., ous analysis for German showed that BLEU lacks 2005), which produces dependency graphs with sensitivity to grammatical improvements when non-projective arcs. In order to extract SCFG compared to human evaluators (Williams, 2014). rules, we first applied the following conversion We trained the final system on all of the pro- process: i) the dependency graphs were projec- vided training and monolingual data. In addition tivized using the Malt Parser, which implements to the interpolated LM, we used a model trained the method described in Nivre and Nilsson (2005) on the CommonCrawl data. Results are shown in (we used the ‘Head’ encoding scheme); ii) the pro- Table 6. jective dependency graphs were converted to CFG trees. In addition, we reduced the complex posi- system BLEU tional tags to simple POS tags by discarding the 2015 2016 morphological attributes. The CFG trees were not baseline 17.3 20.1 binarized. + constraints 17.5 20.2 We also experimented with unification- + CC LM 17.9 20.9 based agreement and case government con- Table 6: Translation results on the final system straints (Williams and Koehn, 2011; Williams, for English Czech with unification-based con- 2014). Specifically, our constraints were designed → straints. Cased BLEU scores are shown. Note to enforce: i) case, gender, and number agreement that baseline weights are reused in the experiments between nouns and pre-nominal adjectival modi- with constraints. fiers; ii) number and person agreement between subjects and verbs; iii) case agreement between prepositions and nouns; iv) use of nominative case for subject nouns. For every Czech word in the 5.1.1 Manual Analysis training data, we obtained a set of morphological We carried out a small manual analysis of the sub- analyses using MorphoDiTa (Straková et al., mitted system with and without unification-based 2014). From these analyses, we constructed constraints (the CC LM was used in both cases). a lexicon of feature structures. For constraint In order to remove the effect of tuning variance, extraction, we used handwritten rules along the we used the same model weights in both cases lines of those described in Williams (2014). (the weights were learned on the version without

404

Page 48 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

constraints). The BLEU scores of the two sys- system BLEU tems were 20.9 (with constraints) and 20.7 (with- dev test out constraints). A large majority of the outputs last year’s system 24.0 29.3 (81% of the 2999 sentences in the newstest2016) +particle verb restructuring 24.4 30.2 are identical. +News 2015 training data 24.5 30.6 Looking at a sample of 100 sentences with some Table 8: Translation results of English German differences, we classified differring areas to see in → string-to-tree translation system on dev (news- what aspects the outputs of the two systems differ. test2015) and test (newstest2016). In total, there were 104 such areas (some sentences had more than one area of interest). Table 7 summarizes the overall evaluation of be better balanced with respect to other parts of the these areas (the annotation was not blind, we knew model. In constrast to German, targetting Czech which system was which). The majority of the ar- usually does not need long-distance reordering eas were of an equal quality, in fact equally bad and doing it risks more serious translation errors overall, so neither of the compared systems deliv- than sticking to the English word order. ered an acceptable translation. Since the hard unification constraints effectively Much Crazy only avoid some of the possible translations (i.e. Better Better Equal Worse Reordering reduce the search space), we conclude that having to obey mere agreement constraints helps to select 4 41 44 12 3 a hypothesis better in a surprisingly larger span of words, improving overall sentence structure on av- Table 7: Manual evaluation of translations as pro- erage. posed by the English Czech system with unifica- → tion constraints vs. the same system without con- straints. 5.2 English German → In 4 cases, the system with constraints delivered This year’s string-to-tree submission for English German is similar to last year’s much better translation, and three of those were → overall improvement of the sentence structure. system (Williams et al., 2015). In addition to the In 41 cases, the area was better for various rea- baseline feature functions, it contains count-based sons. Most frequently (16 cases), this was in- 5-gram Neural Network language model (NPLM) deed the agreement within noun and prepositional (Vaswani et al., 2013), a relational dependency phrases (adjective matching in case the preposition language model (RDLM) (Sennrich, 2015), and etc.). In 9 additional cases, the NP or PP was better soft source-syntactic constraints (Huck et al., translated but in other aspects than morphological 2014). The parameters of the model are tuned case, number of gender. For instance the baseline towards the linear interpolation of BLEU and the system translated the phrase “between the depart- syntactic metric HWCM (Liu and Gildea, 2005; ments of individual hospitals” as “between the in- Sennrich, 2015). Trees are transformed through dividual departments of the hospitals” (in morpho- binarization and a hierarchical representation of logically well-formed Czech). Beyond better NPs morphologically complex words (Sennrich and and PPs, the constraints have also helped over- Haddow, 2015). all sentence or clause structure (5 cases), lexical For the soft source-syntactic constraints, we an- choice (4 cases) and verbs and their belongings (2 notate the source text with the Stanford Neural cases). Network dependency parser (Chen and Manning, In 15 cases, the constraints forced the system 2014), along with heuristic projectivization (Nivre to select a worse translation, damaging sentence and Nilsson, 2005). structure, lexical choice, spuriously introducing Results are shown in Table 8. We report results negation etc. We highlight 3 of these cases, where of last year’s system (Williams et al., 2015), which the system with constraints accidentally moved was ranked (joint) first at WMT 15. Our improve- words far away from their correct location (“Crazy ments this year stem from particle verb restructur- Reordering” in Table 7). This suggests that due to ing (Sennrich and Haddow, 2015), and the use of sparse data, the application of constraints should the new monolingual News Crawl 2015 corpus for

405

Page 49 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

the Kneser-Ney language model.2 system BLEU dev test 5.3 Finnish English → baseline (phrase-structure) 28.6 33.5 Our Finnish English syntax-based system was + NER before split 28.8 33.8 → similar to last year’s (Williams et al., 2015). The + CommonCrawl LM* 29.4 34.4 main difference from the basic setup of Section 4 contrastive (dependency) is that we preprocessed the Finnish data to seg- + NER before split 28.1 33.0 ment words into morphemes. We also added a Table 10: Translation results of German English CommonCrawl language model in addition to the → interpolated LM. string-to-tree translation system on dev (news- For segmentation, we used Morfessor 2.0 with test2015) and test (newstest2016). *submitted sys- default settings, first training a segmentation tem. model, then using it to segment all words in the source-side training and test data. Morfessor takes known words similar to the English German sys- → a set of word types as input and we found that tems described by Williams et al. (2014) and Sen- it was important for translation quality to use a nrich et al. (2015). We also tagged named entities large training vocabulary. Table 9 gives mean to avoid over-splitting of compounds. For exam- BLEU scores for this setup, averaged over three ple the script provided with Moses for compound MERT runs. Our baseline is the standard string- splitting will split Florstadt nach Bad Salzhausen to-tree setup (i.e. without segmentation and with- into flor Stadt nach Bad Salz hausen. This is out the CommonCrawl LM). For segmentation, then wrongly translated by the baseline system as we experimented with varying amounts of training Flor after bath salt station. We applied a 3–class data, initially using the Finnish side of the pro- named entity tagger (Finkel et al., 2005; Faruqui vided parallel corpora, then adding the monolin- and Padó, 2010) on the German side of the cor- gual Finnish data (apart from CommonCrawl), and pus prior to splitting and removed the annotations finally adding 10% of the CommonCrawl vocabu- afterwards. We also trained a contrastive system lary (we extracted the full vocabulary from Com- with target–side dependency relations instead of monCrawl and then randomly sampled 10%). We PTB–style phrase-structures. The English side of found that using larger amounts of training data the parallel corpora was annotated with the Stan- was prohibitively slow. ford Neural Network dependency parser (Chen and Manning, 2014), along with heuristic pro- system BLEU jectivization (Nivre and Nilsson, 2005) and head- 2015 2016 binarization (Sennrich and Haddow, 2015). We re- baseline 16.0 18.2 port the cased BLEU scores for different setups of + Morfessor (all parallel) 16.8 19.1 our system in Table 10. + Morfessor (non-CC mono) 17.6 20.1 + Morfessor (10% CC) 17.9 20.1 5.5 Romanian English + CC LM 18.0 20.3 → For Romanian English we built a string-to-tree → system similar to the German English system. Table 9: Comparison of different preprocessing → and language model regimes for Finnish English However we did not use compound splitting and → (syntax-based). Cased BLEU scores are given for we allowed glue rules. Similar to the phrase-based the newstest2015 and newstest2016 test sets, aver- setup we used half of the newsdev2016 for tuning aged over three tuning runs. and the other half as development set. We normal- ized the corpora by removing all diacritics from 5.4 German English the Romanian side. We report the cased BLEU → scores for different setups of our system in Ta- For German English we built a string-to-tree → ble 11. system with a similar setup to last year’s (Williams et al., 2015). In addition we used sparse fea- 6 Conclusion tures to determine the non-terminal labels for un- 2The neural language models were trained on last year’s The Edinburgh team built a total of 11 phrase- training data. based and syntax-based translation systems us-

406

Page 50 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

system BLEU for Computational Linguistics: Human Lan- dev test guage Technologies. Montréal, Canada, pages baseline (phrase-structure) 33.9 32.9 427–436. + UNK NT labels 34.2 33.0 David Chiang. 2007. Hierarchical phrase-based + CommonCrawl LM* 35.2 33.6 translation. Comput. Linguist. 33(2):201–228. contrastive (dependency) + UNK NT labels 33.7 32.3 Tagyoung Chung, Licheng Fang, and Daniel Gildea. 2011. Issues Concerning Decoding with Table 11: Translation results of Synchronous Context-free Grammar. In Pro- Romanian English string-to-tree translation ceedings of the 49th Annual Meeting of the As- → system on dev (half of newsdev2016) and test sociation for Computational Linguistics: Hu- (newstest2016). *submitted system. man Language Technologies. Portland, Oregon, USA, pages 413–417. ing the open source Moses toolkit. Our Michael Collins, Philipp Koehn, and Ivona Finnish English and Romanian English sys- Kucerova. 2005. Clause Restructuring for Sta- → → tems ranked first according to cased BLEU on the tistical Machine Translation. In Proceedings of newstest2016 evaluation set.3 the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05). Ann Ar- Acknowledgments bor, MI, USA, pages 531–540. We are grateful to Martin Popel for assistance with Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Treex. This project has received funding from Thomas Lamar, Richard Schwartz, and John the European Union’s Horizon 2020 research and Makhoul. 2014. Fast and Robust Neural innovation programme under grant agreements Network Joint Models for Statistical Machine 645452 (QT21), 644333 (TraMOOC) and 644402 Translation. In Proceedings of the 52nd An- (HimL). nual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers). References Baltimore, MD, USA, pages 1370–1380. Nadir Durrani, Alexander Fraser, Helmut Schmid, Peter F. Brown, Stephen A. Della Pietra, Vincent J. Hieu Hoang, and Philipp Koehn. 2013. Can Della Pietra, and Robert L. Mercer. 1993. The Markov Models Over Minimal Translation Mathematics of Statistical Machine Translation: Units Help Phrase-Based SMT? In Proceedings Parameter Estimation. Computational Linguis- of the 51st Annual Meeting of the Association tics 19(2):263–311. for Computational Linguistics (Volume 2: Short Danqi Chen and Christopher Manning. 2014. A Papers). Sofia, Bulgaria, pages 399–405. Fast and Accurate Dependency Parser using Chris Dyer, Victor Chahuneau, and Noah A. Neural Networks. In Proceedings of the 2014 Smith. 2013. A Simple, Fast, and Effective Conference on Empirical Methods in Natural Reparameterization of IBM Model 2. In Pro- Language Processing (EMNLP). Association ceedings of the 2013 Conference of the North for Computational Linguistics, Doha, Qatar, American Chapter of the Association for Com- pages 740–750. putational Linguistics: Human Language Tech- Stanley F. Chen and Joshua Goodman. 1998. An nologies. Atlanta, GA, USA, pages 644–648. empirical study of smoothing techniques for Manaal Faruqui and Sebastian Padó. 2010. Train- language modeling. Technical report, Harvard ing and evaluating a german named entity rec- University. ognizer with semantic generalization. In Pro- Colin Cherry and George Foster. 2012. Batch Tun- ceedings of KONVENS 2010. Saarbrücken, Ger- ing Strategies for Statistical Machine Transla- many. tion. In Proceedings of the 2012 Conference of Jenny Rose Finkel, Trond Grenager, and Christo- the North American Chapter of the Association pher Manning. 2005. Incorporating non-local 3http://matrix.statmt.org/?mode=all& information into information extraction systems test_set[id]=23 by gibbs sampling. In Proceedings of the 43rd

407

Page 51 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

Annual Meeting on Association for Computa- In Proceedings of the 51st Annual Meeting of tional Linguistics. pages 363–370. the Association for Computational Linguistics. George Foster, Roland Kuhn, and Howard John- Sofia, Bulgaria, pages 690–696. son. 2006. Phrasetable Smoothing for Statisti- Mark Hopkins and Greg Langmead. 2010. SCFG cal Machine Translation. In Proceedings of the decoding without binarization. In Proceedings Conference on Empirical Methods for Natural of the 2010 Conference on Empirical Methods Language Processing (EMNLP). Sydney, Aus- in Natural Language Processing. Cambridge, tralia, pages 53–61. MA, pages 646–655. Michel Galley, Jonathan Graehl, Kevin Knight, Liang Huang and David Chiang. 2007. For- Daniel Marcu, Steve DeNeefe, Wei Wang, and est Rescoring: Faster Decoding with Integrated Ignacio Thayer. 2006. Scalable inference and Language Models. In Proceedings of the 45th training of context-rich syntactic translation Annual Meeting of the Association of Compu- models. In ACL-44: Proceedings of the 21st In- tational Linguistics. Prague, Czech Republic, ternational Conference on Computational Lin- pages 144–151. guistics and the 44th annual meeting of the As- Matthias Huck and Alexandra Birch. 2015. The sociation for Computational Linguistics. Mor- Edinburgh Machine Translation Systems for ristown, NJ, USA, pages 961–968. IWSLT 2015. In Proceedings of the Interna- Michel Galley, Mark Hopkins, Kevin Knight, and tional Workshop on Spoken Language Transla- Daniel Marcu. 2004. What’s in a Translation tion. Da Nang, Vietnam, pages 31–38. Rule? In HLT-NAACL ’04. Matthias Huck, Hieu Hoang, and Philipp Koehn. Michel Galley and Christopher D. Manning. 2008. 2014. Preference Grammars and Soft Syntac- A Simple and Effective Hierarchical Phrase Re- tic Constraints for GHKM Syntax-based Sta- ordering Model. In Proceedings of the 2008 tistical Machine Translation. In Proceedings Conference on Empirical Methods in Natu- of SSST-8, Eighth Workshop on Syntax, Seman- ral Language Processing. Honolulu, HI, USA, tics and Structure in Statistical Translation. As- pages 848–856. sociation for Computational Linguistics, Doha, Qin Gao and Stephan Vogel. 2008. Parallel imple- Qatar, pages 148–156. mentations of word alignment tool. In Software Matthias Huck, Saab Mansour, Simon Wiesler, Engineering, Testing, and Quality Assurance and Hermann Ney. 2011. Lexicon Models for Natural Language Processing. Stroudsburg, for Hierarchical Phrase-Based Machine Trans- PA, USA, SETQA-NLP ’08, pages 49–57. lation. In Proceedings of the International Spence Green, Michel Galley, and Christopher D. Workshop on Spoken Language Translation. Manning. 2010. Improved Models of Distortion San Francisco, CA, USA, pages 191–198. Cost for Statistical Machine Translation. In Hu- Philipp Koehn and Barry Haddow. 2009. Edin- man Language Technologies: The 2010 Annual burgh’s Submission to all Tracks of the WMT Conference of the North American Chapter of 2009 Shared Task with Reordering and Speed the Association for Computational Linguistics. Improvements to Moses. In Proceedings of the Los Angeles, California, pages 867–875. Fourth Workshop on Statistical Machine Trans- Barry Haddow, Matthias Huck, Alexandra Birch, lation. Athens, Greece, pages 160–164. Nikolay Bogoychev, and Philipp Koehn. 2015. Philipp Koehn and Kevin Knight. 2003. Empiri- The Edinburgh/JHU Phrase-based Machine cal Methods for Compound Splitting. In Pro- Translation Systems for WMT 2015. In Pro- ceedings of the 10th Conference of the Euro- ceedings of the Tenth Workshop on Statistical pean Chapter of the Association for Computa- Machine Translation. Association for Computa- tional Linguistics (EACL). Budapest, Hungary, tional Linguistics, Lisbon, Portugal, pages 126– pages 187–194. 133. Philipp Koehn, Franz Josef Och, and Daniel Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H. Marcu. 2003. Statistical phrase-based transla- Clark, and Philipp Koehn. 2013. Scalable mod- tion. In NAACL ’03: Proceedings of the 2003 ified Kneser-Ney language model estimation. Conference of the North American Chapter of

408

Page 52 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

the Association for Computational Linguistics meeting of the Association for Computational on Human Language Technology. Morristown, Linguistics. ACL-44, pages 433–440. NJ, USA, pages 48–54. Slav Petrov and Dan Klein. 2007. Improved In- Shankar Kumar and William Byrne. 2004. Mini- ference for Unlexicalized Parsing. In Human mum Bayes-Risk Decoding for Statistical Ma- Language Technologies 2007: The Conference chine Translation. In HLT-NAACL 2004: Main of the North American Chapter of the Associ- Proceedings. Boston, MA, USA, pages 169– ation for Computational Linguistics; Proceed- 176. ings of the Main Conference. Rochester, New Hai Son Le, Alexandre Allauzen, and François York, pages 404–411. Yvon. 2012. Continuous Space Translation Martin Popel and Zdenekˇ Žabokrtský. 2010. Tec- Models with Neural Networks. In Human Lan- toMT: Modular NLP framework. In Hrafn guage Technologies: Conference of the North Loftsson, Eirikur Rögnvaldsson, and Sigrun American Chapter of the Association of Com- Helgadottir, editors, Lecture Notes in Artificial putational Linguistics, Proceedings, June 3-8, Intelligence, Proceedings of the 7th Interna- 2012, Montréal, Canada. pages 39–48. tional Conference on Advances in Natural Lan- Ding Liu and Daniel Gildea. 2005. Syntactic Fea- guage Processing (IceTAL 2010). Iceland Cen- tures for Evaluation of Machine Translation. In tre for Language Technology (ICLT), Springer, Proceedings of the ACL Workshop on Intrinsic Berlin / Heidelberg, volume 6233 of Lecture and Extrinsic Evaluation Measures for Machine Notes in Computer Science, pages 293–304. Translation and/or Summarization. Ann Arbor, Adwait Ratnaparkhi. 1996. A Maximum Entropy Michigan, pages 25–32. Part-Of-Speech Tagger. In Proceedings of the Ryan McDonald, Fernando Pereira, Kiril Ribarov, Conference on Empirical Methods in Natural and Jan Hajic. 2005. Non-projective depen- Language Processing. Philadelphia, PA, USA. dency parsing using spanning tree algorithms. Helmut Schmid. 2000. LoPar: Design and Im- In Proceedings of Human Language Technol- plementation. Bericht des Sonderforschungs- ogy Conference and Conference on Empirical bereiches “Sprachtheoretische Grundlagen für Methods in Natural Language Processing. die Computerlinguistik” 149, Institute for Com- Maria Nadejde, Philip Williams, and Philipp putational Linguistics, University of Stuttgart. Koehn. 2013. Edinburgh’s Syntax-Based Ma- Rico Sennrich. 2014. A CYK+ Variant for SCFG chine Translation Systems. In Proceedings of Decoding Without a Dot Chart. In Proceed- the Eighth Workshop on Statistical Machine ings of SSST-8, Eighth Workshop on Syntax, Se- Translation. Sofia, Bulgaria, pages 170–176. mantics and Structure in Statistical Translation. Joakim Nivre and Jens Nilsson. 2005. Pseudo- Doha, Qatar, pages 94–102. Projective Dependency Parsing. In Proceedings Rico Sennrich. 2015. Modelling and Optimizing of the 43rd Annual Meeting of the Association on Syntactic N-Grams for Statistical Machine for Computational Linguistics (ACL’05). Asso- Translation. Transactions of the Association for ciation for Computational Linguistics, Ann Ar- Computational Linguistics 3:169–182. bor, Michigan, pages 99–106. Rico Sennrich and Barry Haddow. 2015. A Joint Franz Josef Och. 2003. Minimum error rate train- Dependency Model of Morphological and Syn- ing in statistical machine translation. In Pro- tactic Structure for Statistical Machine Trans- ceedings of the 41st Annual Meeting on Asso- lation. In Proceedings of the 2015 Conference ciation for Computational Linguistics - Volume on Empirical Methods in Natural Language 1. Morristown, NJ, USA, ACL ’03, pages 160– Processing. Association for Computational Lin- 167. guistics, Lisbon, Portugal, pages 2081–2087. Slav Petrov, Leon Barrett, Romain Thibaux, and Rico Sennrich, Barry Haddow, and Alexandra Dan Klein. 2006. Learning accurate, compact, Birch. 2016a. Edinburgh Neural Machine and interpretable tree annotation. In Proceed- Translation Systems for WMT 16. In Proceed- ings of the 21st International Conference on ings of the First Conference on Machine Trans- Computational Linguistics and the 44th annual lation (WMT16). Berlin, Germany.

409

Page 53 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

Rico Sennrich, Barry Haddow, and Alexandra Philip Williams and Philipp Koehn. 2011. Agree- Birch. 2016b. Neural Machine Translation of ment constraints for statistical machine transla- Rare Words with Subword Units. In Proceed- tion into german. In Proceedings of the Sixth ings of the 54th Annual Meeting of the Associa- Workshop on Statistical Machine Translation. tion for Computational Linguistics (ACL 2016). Association for Computational Linguistics, Ed- Berlin, Germany. inburgh, Scotland, pages 217–226. Rico Sennrich, Martin Volk, and Gerold Schnei- Philip Williams and Philipp Koehn. 2012. GHKM der. 2013. Exploiting Synergies Between Open Rule Extraction and Scope-3 Parsing in Moses. Resources for German Dependency Parsing, In Proceedings of the Seventh Workshop on Sta- POS-tagging, and Morphological Analysis. In tistical Machine Translation. Montréal, Canada, Proceedings of the International Conference pages 388–394. Recent Advances in Natural Language Process- Philip Williams, Rico Sennrich, Maria Nadejde, ing 2013. Hissar, Bulgaria, pages 601–609. Matthias Huck, Eva Hasler, and Philipp Koehn. Rico Sennrich, Philip Williams, and Matthias 2014. Edinburgh’s Syntax-Based Systems at Huck. 2015. A tree does not make a well- WMT 2014. In Proceedings of the Ninth Work- formed sentence: Improving syntactic string- shop on Statistical Machine Translation. Balti- to-tree statistical machine translation with more more, Maryland, USA, pages 207–214. linguistic knowledge. Computer Speech & Lan- Philip Williams, Rico Sennrich, Maria Nadejde, guage 32(1):27–45. Matthias Huck, and Philipp Koehn. 2015. Edin- Andreas Stolcke. 2002. SRILM – an Extensible burgh’s Syntax-Based Systems at WMT 2015. Language Modeling Toolkit. In Proc. of the Int. In Proceedings of the Tenth Workshop on Sta- Conf. on Spoken Language Processing (ICSLP). tistical Machine Translation. Association for Denver, CO, USA, volume 3. Computational Linguistics, Lisbon, Portugal, Jana Straková, Milan Straka, and Jan Hajic.ˇ 2014. pages 199–209. Open-Source Tools for Morphology, Lemmati- zation, POS Tagging and Named Entity Recog- nition. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguis- tics: System Demonstrations. Association for Computational Linguistics, pages 13–18. Jörg Tiedemann. 2012. Parallel Data, Tools and Interfaces in OPUS. In Nicoletta Cal- zolari (Conference Chair), Khalid Choukri, Thierry Declerck, Mehmet Ugur Dogan, Bente Maegaard, Joseph Mariani, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12). Euro- pean Language Resources Association (ELRA), Istanbul, Turkey. Ashish Vaswani, Yinggong Zhao, Victoria Fos- sum, and David Chiang. 2013. Decoding with Large-Scale Neural Language Models Im- proves Translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013. Seattle, Washington, USA, pages 1387–1392. Philip Williams. 2014. Unification-based Con- straints for Statistical Machine Translation. Ph.D. thesis, University of Edinburgh.

410

Page 54 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

E Modeling Selectional Preferences of Verbs and Nouns in String-to- Tree Machine Translation

Modeling Selectional Preferences of Verbs and Nouns in String-to-Tree Machine Translation

Maria Nadejde˘ Alexandra Birch Philipp Koehn School of Informatics School of Informatics Department of Computer Science University of Edinburgh University of Edinburgh Johns Hopkins University [email protected] [email protected] [email protected]

Abstract structures. We give a few examples of such er- rors in Table 1. In example a) the baseline system We address the problem of mistranslated MT1 mistranslates the verb besichtigt as viewed. predicate-argument structures in syntax- The system MT2 which uses information about the based machine translation. This paper ex- semantic affinity between the verb and its argu- plores whether knowledge about semantic ment produces the correct translation visited. The affinities between the target predicates and semantic affinity score , shown on the right, for their argument fillers is useful for translat- the verb viewed and argument trip in the syntac- ing ambiguous predicates and arguments. tic relation prep on is indicating a stronger affinity We propose a selectional preference fea- than for the baseline translation. In example b) ture based on the selectional association the baseline system MT1 mistranslates the noun measure of Resnik (1996) and integrate it Aufnahmen as recordings while the system MT2 in a string-to-tree decoder. The feature produces the correct translation images which is models selectional preferences of verbs for a better fit for the prepositional modifier from the their core and prepositional arguments as telescope. well as selectional preferences of nouns Syntax-based MT systems handle long distance for their prepositional arguments. reordering with synchronous translation rules such as: We compare our features with a variant of 0 1 2 3 the neural relational dependency language root RB∼ VBZ∼ sich nsubj∼ prep∼ , → h 0 2 1 3 model (RDLM) (Sennrich, 2015) and find RB∼ nsubj∼ VBZ∼ prep∼ that neither of the features improves au- i tomatic evaluation metrics. We conclude This rule is useful for reordering the verb and that mistranslated verbs, errors in the tar- its arguments according to the target side word or- get syntactic trees produced by the de- der. However the rule does not contain the lexical coder and underspecified syntactic rela- head for the verb, the subject and the prepositional tions are negatively impacting these fea- modifier. Therefore the entire predicate argument tures. structure is translated by subsequent independent rules. The language model context will capture 1 Introduction at most the verb and one main argument. Due to the lack of a larger source or target context the re- Syntax-based machine translation systems have sulting predicate-argument structures are often not had some success when applied to language pairs semantically coherent. with major structural differences such as German- This paper explores whether knowledge about English or Chinese-English. Modeling the target semantic affinities between the target predicates side syntactic structure is important in order to and their argument fillers is useful for translating produce grammatical, fluent translations and could ambiguous predicates and arguments. We propose be an intermediate step on which to build a se- a selectional preference feature for string-to-tree mantic representation of the target sentence. How- statistical machine translation based on the infor- ever these systems still suffer from errors such mation theoretic measure of Resnik (1996). The as scrambled or mis-translated predicate-argument feature models selectional preferences of verbs for

32 Proceedings of the First Conference on Machine Translation, Volume 1: Research Papers, pages 32–42, Berlin, Germany, August 11-12, 2016. c 2016 Association for Computational Linguistics

Page 55 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

(relation, predicate, argument) Affinity SRC Bei nur einer Reise konnen¨ nicht alle davon besichtigt werden. REF You won’t be able to visit all of them on one trip . a) MT1 Not all of them can be viewed on only one trip. (prep on, viewed, trip) -0.154 MT2 Not all of them can be visited on only one trip. (prep on, visited, trip) 1.042

SRC Eine der scharfsten¨ Aufnahmen des Hubble-Teleskops REF One of the sharpest pictures from the Hubble telescope b) MT1 One of the strongest recordings of the Hubble telescope (prep of, recordings, telescope) -0.0004 MT2 One of the strongest images from the Hubble telescope (prep from, images, telescope) 0.3917

Table 1: Examples of errors in the predicate-argument structure produced by a syntax-based MT system. a) mistranslated verb b) mistranslated noun. Semantic affinity scores are shown on the right. Higher scores indicate a stronger affinity. Negative scores indicate a lack of affinity.

their core and prepositional arguments as well as presents the results of automatic evaluation as well selectional preferences of nouns for their preposi- as a qualitative analysis of the machine translated tional arguments. output. Previous work has addressed the selectional preferences of prepositions for noun classes 2 Related work (Weller et al., 2014) but not the semantic affini- From a syntactic perspective, a correct ties between a predicate and its argument class. predicate-argument structure will have the Another line of research on improving translation sub-categorization frame of the predicate filled of predicate-argument structures includes model- in. Weller et al. (2013) use sub-categorization ing reordering and deletion of semantic roles (Wu information to improve case-prediction for noun and Fung, 2009; Liu and Gildea, 2010; Li et al., phrases when translating into German. Case 2013). These models however do not encode in- prediction for noun phrases is important in the formation about the lexical semantic affinities be- German language as it indicates the grammat- tween target predicates and their arguments. Sen- ical function. Their approach however did not nrich (2015) proposes a relational dependency lan- produce strong improvements over the baseline. guage model (RDLM) for string-to-tree machine From a large corpus annotated with dependency translation. One component of RDLM predicts relations, they extract verb-noun tuples and their the head word of a dependent conditioned on a associated syntactic functions: direct object, wide syntactic context. Our feature is different indirect object, subject. They also extract triples as it quantifies the amount of information that the of verb-preposition-noun in order to predict predicate carries about the argument class filling a the case of noun-phrases within prepositional- particular syntactic function. phrases. The probabilities of such tuples and For one variant of the proposed feature we triples are computed using relative frequencies found a slight improvement in automatic evalua- and then used as a feature for a CRF classifier that tion metrics when translating short sentences as predicts the case of noun-phrases. Weller et al. well as an increase in precision for verb transla- (2013) apply the CRF classifier to the output of tion. However the features generally did not im- a word-to-stem phrased-based translation system prove automatic evaluation metrics. We conclude as a post-processing step. In contrast, our model that mistranslated verbs, errors in the target syn- is used directly as a feature in the decoder. While tactic trees produced by the decoder and under- Weller et al. (2013) identify the arguments of the specified syntactic relations are negatively impact- verb and their grammatical function by projecting ing these features. the information from the source sentence we use The paper is structured as follows. Section 2 the dependency tree produced by the string-to-tree describes related work on improving translation decoder. We also consider prepositional modifiers of predicate-argument structures. Section 3 intro- of nouns. duces the selectional preference feature. Section Weller et al. (2014) propose using noun class 4 describes the experimental setup and Section 5 information to model selectional preferences of

33

Page 56 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

prepositions in a string-to-tree translation system. similarity (Erk et al., 2010; Seaghdha,´ 2010; Ritter They use the noun class information to annotate et al., 2010), clustering (Sun and Korhonen, 2009), PP translation rules in order to restrict their appli- multi-modal datasets (Shutova et al., 2015), and cability to specific semantic classes. In our work neural networks (Cruys, 2014). we don’t impose hard constraints on the transla- Our feature is based on the measure proposed tion rules, but rather soft constraints using our by Resnik (1996). It uses unsupervised clusters model as a feature in the decoder. While we to generalize over seen arguments. Resnik (1996) use word embeddings to cluster arguments, Weller uses selectional preferences of predicates for word et al. (2014) experiment with a lexical seman- sense disambiguation. The information theoretic tic taxonomy and clustering words based on co- measure for selectional preference proposed by occurrences within a window or syntactic features Resnik quantifies the difference between the pos- extracted from dependency-parsed data. terior distribution of an argument class given the Modeling reordering and deletion of semantic verb and the prior distribution of the class. For roles (Wu and Fung, 2009; Liu and Gildea, 2010; instance, ”person” has a higher prior probability Li et al., 2013) has been another line of research on than ”insect” to appear in the subject relation, but, improving translation of predicate-argument struc- knowing the verb is ”fly”, the posterior probability tures. Liu and Gildea (2010) propose modeling becomes higher for ”insect”. reordering of a complete semantic frame while Li Resnik’s model defines selectional preference et al. (2013) propose finer grained features that strength of a predicate as: distinguish between predicate-argument reorder- ing and argument-argument reordering. Gao and SelP ref(p, r) = KL(P (c p, r) P (c r)) Vogel (2011) and Bazrafshan and Gildea (2013) | k | annotate target non-terminals with the semantic P (c p, r) = P (c p, r)log | roles they cover in order to extract synchronous | P (c r) c | grammar rules that cover the entire predicate argu- X (1) ment structure. These models however do not en- code information about the lexical semantic affini- where KL is the Kullback - Leibler divergence, ties between target predicates and their arguments. r is the relation type, p is the predicate and c In this work we focus on using selectional pref- is the conceptual class of the argument. Resnik erence over predicate and arguments in the tar- uses WordNet to obtain the conceptual classes of get as this is a simple way of leveraging external arguments, therefore generalizing over seen ar- knowledge in the translation framework. guments. The selectional association or seman- tic affinity between a predicate and an argument 3 Selectional Preference Feature class is quantified as the relative contribution of the class towards the overall selectional strength 3.1 Learning Selectional Preferences of the predicate: Selectional preferences describe the semantic affinities between predicates and their argument P (c p,r) fillers. For example, the verb ”drinks” has a strong P (c p, r)log P (c| r) SelAssoc(p, r, c) = | | (2) preference for arguments in the conceptual class of SelStr(p, r) ”liquids”. Therefore the word ”wine” can be dis- ambiguated when it appears in relation to the verb We give examples of the selectional preference ”drinks”. A corpus driven approach to modeling strength and selectional association scores for dif- selectional preferences usually involves extracting ferent verbs and their arguments in Table 2. The triples of (syntactic relation, predicate, argument) verb see takes on many arguments as direct ob- and computing co-occurrence statistics. The pred- jects and therefore has a lower selectional prefer- icate and argument are represented by their head ence strength for this syntactic relation. In contrast words and the triples are extracted from automati- the predicate hereditary takes on fewer arguments cally parsed data. Another typical step is general- for which it has a stronger selectional preference. izing over seen arguments. Approaches to gener- Several selectional preference models have alization include using an ontology such as Word- been used as features in discriminative syntac- Net (Resnik, 1996), using distributional semantics tic parsing systems. Cohen et al. (2012) observe

34

Page 57 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

Verb Relation SelPref Argument SelAssoc see dobj 0.56 PRN 0.123 movie 0.022 episode 0.001 is–hereditary nsubj 1.69 disease 0.267 monarchy 0.148 title 0.082 drink dobj 3.90 water 0.144 wine 0.061 glass 0.027

Table 2: Example of selectional preference (SelPref) and selectional association (SelAssoc) scores for different verbs. PRN is the class of pronouns.

that when parsing out-of-domain data many at- propriate for a machine translation task where the tachment errors occur for the following syntactic vocabulary has millions of words and English is configurations: head (V or N) – prep – obj and not the only targeted language. Therefore we head (N) – adj. The authors proposed a class- adapt Resnik’s selectional association measure in based measure of selectional preferences for these two ways. syntactic configurations and learn the argument In the first model SelAssoc L we compute the classes using Latent Dirichlet Allocation (LDA). co-occurrence statistics defined in Eq 2 over lem- Kiperwasser and Goldberg (2015) compare differ- mas of the predicate and argument head words. ent measures of lexical association between head word and modifier word for improving depen- In the second model SelAssoc C we replace the 1 dency parsing. Their results show that the associa- WordNet classes in Eq 2 with word clusters . We tion measure based on pointwise mutual informa- obtain the word clusters by applying the k-means tion (PMI) has similar generalization capabilities algorithm to the glovec word embeddings (Pen- as a measure of distributional similarity between nington et al., 2014). word embeddings. van Noord (2007) has shown Prepositional phrase attachment remains a fre- that bilexical association scores computed using quent and challenging error for syntactic parsers PMI for all types of dependency relations are a (Kummerfeld et al., 2012) and translation of useful feature for improving dependency parsing prepositions is a challenge for SMT (Weller et al., in Dutch. 2014). Therefore we decide to use two separate features: one for main arguments (nsubj, nsubj- 3.2 Adaptation of Selectional Preference pass, dobj, iobj) and one for prepositional argu- Models for Syntax-Based Machine ments. Translation. We are interested in modeling selectional pref- erences of verbs for their core and prepositional 3.3 Comparison with a Neural Relational arguments as well as selectional preferences of Dependency Language Model. nouns for their prepositional arguments. We iden- tify the relation between a predicate and its mod- Sennrich (2015) proposes a relational dependency ifier from the dependency tree produced by a language model (RDLM) for string-to-tree ma- string-to-tree machine translation system. Since chine translation, which he trains using a feed- we are interested in using the feature during de- forward neural network. For a sentence S coding, we need the model to be fast to query and with symbols w1, w2, ...wn and dependency labels have broad coverage. l1, l2, ...ln with li the label of the incoming arc at Our selectional preference feature is a variant position i, RDLM is defined as: of the information theoretic measure of Resnik (1996) defined in Eq 2. While Resnik uses the WordNet classes of the arguments, this is not ap- 1We have not done experiments with WordNet classes.

35

Page 58 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

root

nsubj VBD prep punct

det nn NNP prep met IN pobj . relation predicate argument nsubj met Minister DT NNP Minister IN pobj in NNP . prep in met Tokyo prep of Minister India the Prime of NNP cc conj:and Tokyo

India CC NNP

and Japan

Figure 1: Example of a translation and its dependency tree in constituency representation produced by the string-to-tree SMT system. Triples extracted during decoding are shown on the right.

menting GHKM rule extraction (Galley et al., n 2004, 2006; Williams and Koehn, 2012). The P (S, D) P (i) P (i) ≈ l × w string-to-tree translation model is based on a syn- Yi=1 chronous context-free grammar (SCFG) that is P (i) = P (l h (i)q, l (i)q, h (i)r, l (i)r) l i | s 1 s 1 a 1 a 1 extracted from word-aligned parallel data with P (i) = P (w h (i)q, l (i)q, h (i)r, l (i)r, l ) target-side syntactic annotation. The system was w i | s 1 s 1 a 1 a 1 i (3) trained on all available data provided at WMT15 2 (Bojar et al., 2015). The number of sentences in where for each of q siblings and r ancestors of the training, tuning and test sets are shown in Ta- wi, hs and ha are their head words and ls and la ble 3. We use the following rule extraction param- their dependency labels. The Pw(i) distribution eters: Rule Depth = 5, Node Count = 20, Rule Size models similar information as our proposed fea- = 5. At decoding time we give a high penalty to ture SelAssoc. However we use ha(i)1, li as con- glue rules and allow non-terminals to span a max- text and consider only a subset of dependency la- imum of 50 words. We train a 5-gram language bels: nsubj, nsubjpass, dobj, iobj, prep. The re- model on all available monolingual data 3 using duced context alleviates problems of data sparsity the SRILM toolkit (Stolcke, 2002) with modi- and is more reliably extracted at decoding time. fied Kneser-Ney smoothing (Chen and Goodman, The subset of dependency relations identify argu- 1998) for training and KenLM (Heafield, 2011) for ments for which predicates might exhibit selec- language model scoring during decoding. tional preferences. Our feature is different from RDLM P as it quantifies the amount of infor- − w Train Tune Test mation that the predicate carries about the argu- 4,472,694 2000 8172 ment class filling a particular syntactic function. We hypothesize that such information is useful Table 3: Number of sentences in the training, tun- when translating arguments that appear less fre- ing and test sets. The test set consists of the WMT quently in the training data but are prototypical for newstest2013, 2014 and 2015. certain predicates. For example the triples (bus, drive, dobj) and (van, drive, dobj) have the fol- The English side of the parallel corpus is anno- lowing log posterior probabilities and SelAssoc tated with dependency relations using the Stanford scores: log P(bus drive, dobj) = -5.44, log P(van | | dependency parser (Chen and Manning, 2014). drive, dobj)= -5.58 and SelAssoc(bus, drive, dobj) The dependency structure is then converted to a = 0.0079, SelAssoc(van, drive, dobj) = 0.0103. constituency representation which is needed to run 4 Experimental setup the GHKM rule extraction. We use the conversion 2http://www.statmt.org/wmt15/translation-task.html Our baseline system for translating German into 3target side of the parallel corpus, the monolingual En- English is the Moses string-to-tree toolkit imple- glish News Crawl, Gigaword and news-commentary

36

Page 59 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

algorithm and the head word extraction method the harmonic mean of precision and recall over described in Sennrich (2015). head-word chains of length 1 to 4. The head-word For training the selectional preference features chains are extracted directly from the dependency we extract triples of (dependency relation, predi- tree produced by the string-to-tree decoder and cate, argument ) from parsed data, where the pred- from the parsed reference. Tuning is performed icate and argument are identified by their head using batch MIRA (Cherry and Foster, 2012) on word. We use the english side of the parallel data 1000-best lists. We report evaluation scores av- and the Gigaword v.5 corpus parsed with Stanford eraged over the newstest2013, newstest2014 and typed dependencies (Napoles et al., 2012). We newstest2015 data sets provided by WMT15. use Stanford dependencies in the collapsed ver- 4 sion which resolves coordination and collapses 5 Evaluation the prepositions. Figure 1 shows an example of a translated sentence, its dependency tree produced 5.1 Error analysis by the string-to-tree system and the triples ex- We wanted to get an idea about how often the verb tracted at decoding time. We consider the fol- and its arguments are mistranslated. For this pur- lowing main arguments: nsubj, nsubjpass, dobj, pose we manually annotated errors in sentences iobj and prep arguments attached to both verbs with more than 5 words and at most 15 words. and nouns. Table 4 shows the number of extracted With this criterion we avoided translations with triples. scrambled predicate-argument structures. Each Type of relation Number of triples sentence had roughly one main verb. main 540,109,283 To have a more reliable error annotation we first prep 810,118,653 post-edited 100 translations from the baseline sys- nsubj 315,852,775 tem. We then compared the translations with their nsubjpass 32,111,962 post-editions and annotated error categories using dobj 188,412,178 the BLAST tool (Stymne, 2011). We considered iobj 3,732,368 a sense error category when there was a wrong lexical choice for the head of a main argument, a Table 4: Number of relation triples extracted from prepositional modifier or the main verb. We also parsed data. The data consists of the English side annotated mistranslated prepositions. of the parallel data and Gigaword. main arguments include: nsubj, nsubjpass, dobj, iobj. Error Category Error Count Total Preposition 18 143 We integrate the feature in a bottom-up chart de- Sense 53 388 coder. The feature has several scores: Main argument 18 145 Prep modifier 9 143 A counter for the dependency triples covered • Main verb 26 100 by the current hypothesis. Table 5: Number of mistranslated words in 100 A selectional association score aggregated • sentences manually annotated with error cate- over all main arguments: nsubj, nsubjpass, gories. dobj, iobj.

A selectional association score aggregated In Table 5 we can see that 26 percent of the • over all prepositional arguments with no dis- verbs are mistranslated and about 10 percent of tinction between noun and verb modifiers. the arguments. Mistranslated verbs are problem- atic since the feature produces the selectional as- For both tuning and evaluation of all machine sociation scores for the wrong verb. Although the translation systems we use a combination of the semantic affinity is mutual, the formulation of the cased BLEU score and head-word chain metric score conditions on the verb. In the cases when (HWCM ) (Liu and Gildea, 2005). The HWCM met- both the verb and the argument are mistranslated ric implemented in the Moses toolkit computes the association score might be high although the 4Coordination is not resolved at decoding time. translation is not faithful to the source.

37

Page 60 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

5.2 Evaluation of the Selectional Preference not improve the evaluation scores as shown in the Feature fifth row of Table 6. First, we determine the effectiveness of our selec- The lack of variance in automatic evaluation tional association features. We compare the two scores can be explained by: a) the feature touches different selectional association features described only a few words in the translation and b) the rela- in section 3.2: SelAssoc L and SelAssoc C . We re- tion between a predicate and its argument is iden- port the results of automatic evaluation in Table 6. tified at later stages of the bottom-up chart-based Neither of the features improved the automatic decoding when many lexical choices have already evaluation scores. The SelAssoc L suffers from been pruned out. The SelAssoc scores, similar to data sparsity while the SelAssoc C feature is over- mutual information scores, are sensitive to outlier generalizing due to noisy clustering. Adding both events with low frequencies in the training data. In features compensates for these issues, however we the next section we investigate whether a more ro- only see a slight improvement in BLEU scores bust model would mitigate some of these issues for shorter sentences5: 25.59 compared to 25.40 and experiment with a neural relational depen- for the baseline system. We further investigate dency language model (RDLM) (Sennrich, 2015). whether sparse features are more informative. 5.3 Comparison with a Relational Dependency LM System BLEU -c HWCM Baseline 26.45 24.47 The RDLM (Sennrich, 2015) is a feed-forward neural network which learns two probability dis- + SelAssoc L 26.41 .04 24.52+.05 − tributions conditioned on a large syntactic context + SelAssoc C 26.48+.03 24.54+.07 + SelAssoc L described in Eq 3: Pw predicts the head word of the dependent and Pl the dependency relation. We + SelAssoc C 26.48+.03 24.47+.00 + Bin (SelAssoc L compare our feature with RDLM–Pw. For training the RDLM–Pw we use the parame- + SelAssoc C) 26.37 .08 24.53+.06 − ters for the feed-forward neural network described + RDLM–Pw (1, 0, 0) 26.35 .10 24.75+.28 − in Sennrich (2015): 150 dimensions for input + RDLM–Pw (2, 1, 1) 26.38 .07 24.83+.36 − layer, 750 dimensions for the hidden layer, a vo- cabulary of 500 000 words and 100 noise samples. Table 6: Results for string-to-tree systems with Se- We train the RDLM–Pw on the target side of the lAssoc P and RDLM– w features. The number of parallel data. Although we use less data than for SelAssoc C clusters used with is 500. The triples training the SelAssoc features, the neural network in parenthesis indicate the context size for ances- is inherently good at learning generalizations and tors, left siblings and right siblings respectively. selecting the appropriate conditioning context. P The RDLM– w configuration (1, 0, 0) captures We experiment with different configurations for similar syntactic context as the selectional prefer- RDLM–Pw by varying the number of ancestors as ence features. well as left and right siblings:

We changed the format of the features in or- ancestors = 1, left = 0, right = 0 der to experiment with sparse features. By us- • ing sparse features we let the tuning algorithm dis- ancestors = 2, left = 1, right = 1 • criminate between low and high values of the Se- The first configuration captures similar syntac- lAssoc score. For each of the SelAssoc features tic context as the SelAssoc features. The only ex- we normalized the scores to have zero mean and ception is the prep relation for which the head of standard deviation one and mapped them to their pobj, the actual preposition, is a sibling of the ar- corresponding percentile. A sparse feature was gument. The results are shown in the last two lines created for each percentile, below and above the of Table 6 and the configuration is marked be- mean 6 resulting in a total of 20 sparse features. tween parentheses for the ancestors, left siblings However this formulation of the feature also did and right siblings respectively. 52701 sentences with more than 5 words and at most 15 The RDLM–Pw performs slightly better than words 6Up to two standard deviations below the mean and three the selectional preference feature in terms of the standard deviations above the mean. HWCM scores. An increase in HWCM is to be

38

Page 61 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

Figure 2: Frequency and translation precision of triples with respect to the distance between the predicate and its arguments. Frequency is computed for triples extracted from the reference sentences of the tests sets. Translation precision is computed over triples extracted from the output of the two translation systems: baseline system and the system with SelAssoc L and SelAssoc C features.

expected since the RDLM–Pw models all depen- SelAssoc L SelAssoc C dency relations. However there is not a significant System main prep main prep contribution from having a larger syntactic con- Baseline 0.067 0.039 0.164 0.147 text. + SelAssoc L + SelAssoc C 0.074 0.041 0.175 0.305 5.4 Analysis Reference 0.077 0.043 0.186 0.163 In this section we investigate possible reasons for the low impact of our selectional preference fea- Table 7: Average selectional association scores for tures. We look at how frequently our features are the test sets. Scores are aggregated over the main triggered, and how precision is influenced by the and prep argument types. main arguments include: distance between predicates and arguments. nsubj, nsubjpass, dobj, iobj. Firstly we are interested in how often the fea- ture triggers and how it influences the overall se- lectional association score of the test set. On av- the distance between the predicate and its argu- erage, 4.85 triples can be extracted per sentence ments. Figure 2 shows the frequency of triples produced by our system. Out of these, 4.35 triples extracted from the reference sentence as well as get scored by the SelAssoc C feature and 3.56 by the translation precision of triples extracted from the SelAssoc L feature. The selectional associa- the output of the translation systems. For more tion scores are higher on average for our system reliable precision scores we lemmatized all pred- than for the baseline as shown in Table 7. The Se- icates and arguments. Most arguments are within lAssoc C feature seems to overgeneralize for the a 5 word window from the predicate. Therefore prep relations as the scores are on average higher most triples are also scored by the language model. than for the reference triples. We therefore con- For these triples we see only a slight increase in clude that our feature is having an impact on the precision for our system. This result indicates that translation system. for predicates and arguments that are close to each Secondly we want to understand the interaction other, the feature is not adding much information. between the SelAssoc features and the language As the distance increases the precision decreases model. For this purpose we compute the frequency drastically for both systems. A longer distance and translation precision of triples with respect to between predicates and arguments also implies a

39

Page 62 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

Source Das 16-jahrige¨ Madchen¨ und der 19-jahrige¨ Mann brachen kurz nach Sonntagmittag in Govetts Leap in Blackheath zu ihrer Tour auf. Reference The 16-year old girl and the 19-year old man went on their tour shortly after Sunday lunch at Govetts Leap in Blackheath. Baseline The 16-year old girl and the 19-year old man broke shortly after Sunday lunch in Govetts Leap in Blackheath on their tour.

Figure 3: Examples of a complex sentence with multiple prepositional modifiers. Information about semantic roles is needed to identify the relevant prepositional modifier.

more complex syntactic structure which will neg- 6 Conclusions atively impact the quality of extracted triples and the selectional association scores. This paper explores whether knowledge about se- mantic affinities between the target predicates and their argument fillers is useful for translating am- 5.5 Discussion biguous predicates and arguments. We propose One reason for the small impact of both SelAssoc three variants of a selectional preference feature and RDLM–Pw features could be the poor qual- for string-to-tree statistical machine translation ity of the syntactic trees produced by the decoder based on the selectional association measure of for longer sentences. In the cases where the rela- Resnik (1996). We compare our features with a tion between predicate and argument can be reli- variant of the neural relational dependency lan- ably extracted, such as the example in Fig 1, the guage model (RDLM) (Sennrich, 2015) and find features are not adding more information than is that neither of the features improves automatic already covered by the language model. evaluation metrics. We conclude that mistrans- In more complex sentences there are cases lated verbs, errors in the target syntactic trees pro- where the features score modifiers that are not im- duced by the decoder and underspecified syntactic portant for disambiguating the verb. The exam- relations are negatively impacting these features. ple in Figure 3 has several prepositional modifiers We propose to address these issues in future work but only on tour could help disambiguate the verb by augmenting the feature with source side infor- brachen (went). In such cases identifying the se- mation such as the source verb and the semantic mantic roles of the modifiers in the source and pro- roles of its arguments. jecting them on the target might be useful for bet- ter estimation of semantic affinities. Acknowledgments The error analysis on short sentences showed We thank the anonymous reviewers as well as that translation of verbs is problematic for syntax- Rico Sennrich for his feedback and assistance with based systems. This is confirmed by the low pre- RDLM . This project has received funding from cision scores7 for verb translation shown in Table the European Union’s Horizon 2020 research and 8. Although there is a slight improvement in pre- innovation programme under grant agreements cision, generally mistranslated verbs impact our 644402 (HimL) and 645452 (QT21). We are also features as the semantic affinity is scored for the grateful for support by a Google Faculty Research wrong verb. A solution would be to add the source Award. verbs in the conditioning context. References System Precision baseline 46.10 Marzieh Bazrafshan and Daniel Gildea. 2013. Se- + SelAssoc L + SelAssoc C 46.26+.16 mantic roles for string to tree machine transla- + RDLM–Pw (2, 1, 1) 46.31+.21 tion. In Proceedings of the 51st Annual Meet- ing of the Association for Computational Lin- Table 8: Evaluation of verb translation in the test guistics. Sofia, Bulgaria, ACL 2013, pages 419– set. Precision scores are computed over verb lem- 423. mas against the reference translations. Ondrejˇ Bojar, Rajen Chatterjee, Christian Feder- mann, Barry Haddow, Matthias Huck, Chris Hokamp, Philipp Koehn, Varvara Logacheva, 7The precision scores were computed over verb lemmas extracted automatically from the test sets. In total 21633 Christof Monz, Matteo Negri, Matt Post, Car- source verbs were evaluated. olina Scarton, Lucia Specia, and Marco Turchi.

40

Page 63 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

2015. Findings of the 2015 workshop on statis- Chapter of the Association of Computational tical machine translation. In Proceedings of the Linguistics. HLT-NAACL ’04. Tenth Workshop on Statistical Machine Trans- Qin Gao and Stephan Vogel. 2011. Utilizing lation. Lisbon, Portugal, pages 1–46. target-side semantic role labels to assist hier- Danqi Chen and Christopher Manning. 2014. A archical phrase-based machine translation. In fast and accurate dependency parser using neu- Proceedings of the Fifth Workshop on Syntax, ral networks. In Proceedings of the 2014 Con- Semantics and Structure in Statistical Transla- ference on Empirical Methods in Natural Lan- tion. Portland, Oregon, USA, pages 107–115. guage Processing (EMNLP). Association for Kenneth Heafield. 2011. Kenlm: Faster and Computational Linguistics, Doha, Qatar, pages smaller language model queries. In Proceed- 740–750. ings of the Sixth Workshop on Statistical Ma- Stanley F. Chen and Joshua Goodman. 1998. An chine Translation. Edinburgh, Scotland, WMT empirical study of smoothing techniques for ’11, pages 187–197. language modeling. Technical report, Harvard Eliyahu Kiperwasser and Yoav Goldberg. 2015. University. Semi-supervised dependency parsing using Colin Cherry and George Foster. 2012. Batch tun- bilexical contextual features from auto-parsed ing strategies for statistical machine translation. data. In Proceedings of the 2015 Conference on In Proceedings of the 2012 Conference of the Empirical Methods in Natural Language Pro- North American Chapter of the Association for cessing. pages 1348–1353. Computational Linguistics: Human Language Jonathan K. Kummerfeld, David Hall, James R. Technologies. Montreal, Canada, NAACL HLT Curran, and Dan Klein. 2012. Parser showdown ’12, pages 427–436. at the wall street corral: An empirical investiga- Raphael Cohen, Yoav Goldberg, and Michael El- tion of error types in parser output. In Proceed- hadad. 2012. Domain adaptation of a depen- ings of the 2012 Joint Conference on Empiri- dency parser with a class-class selectional pref- cal Methods in Natural Language Processing erence model. In Proceedings of ACL 2012 Stu- and Computational Natural Language Learn- dent Research Workshop. ACL ’12, pages 43– ing. EMNLP-CoNLL ’12, pages 1048–1059. 48. Junhui Li, Philip Resnik, and Hal Daum. 2013. Tim Van De Cruys. 2014. A neural network ap- Modeling syntactic and semantic structures in proach to selectional preference acquisition. In hierarchical phrase-based translation. In Pro- Proceedings of the 2014 Conference on Empir- ceedings of Human Language Technologies: ical Methods in Natural Language Processing Conference of the North American Chapter (EMNLP). ACL ’14, pages 26–35. of the Association of Computational Linguis- Katrin Erk, Sebastian Pado,´ and Ulrike Pado.´ tics,. Atlanta, Georgia, USA, number June in 2010. A flexible, corpus-driven model of regu- NAACL-HLT 2013, pages 540–549. lar and inverse selectional preferences. Comput. Ding Liu and Daniel Gildea. 2005. Syntactic fea- Linguist. 36(4):723–763. tures for evaluation of machine translation. In Michel Galley, Jonathan Graehl, Kevin Knight, ACL 2005 Workshop on Intrinsic and Extrinsic Daniel Marcu, Steve DeNeefe, Wei Wang, and Evaluation Measures for Machine Translation Ignacio Thayer. 2006. Scalable inference and and/or Summarization. Ann Arbor, MI, pages training of context-rich syntactic translation 25–32. models. In ACL-44: Proceedings of the 21st In- Ding Liu and Daniel Gildea. 2010. Semantic role ternational Conference on Computational Lin- features for machine translation. In Proceedings guistics and the 44th annual meeting of the As- of the 23rd International Conference on Com- sociation for Computational Linguistics. Mor- putational Linguistics. pages 716–724. ristown, NJ, USA, pages 961–968. Courtney Napoles, Matthew Gormley, and Ben- Michel Galley, Mark Hopkins, Kevin Knight, and jamin Van Durme. 2012. Annotated gigaword. Daniel Marcu. 2004. What’s in a translation In Proceedings of the Joint Workshop on Auto- rule? In Proceedings of Human Language Tech- matic Knowledge Base Construction and Web- nologies: Conference of the North American scale Knowledge Extraction. Association for

41

Page 64 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

Computational Linguistics, Stroudsburg, PA, Gertjan van Noord. 2007. Using self-trained USA, AKBC-WEKEX ’12, pages 95–100. bilexical preferences to improve disambigua- Jeffrey Pennington, Richard Socher, and Christo- tion accuracy. In Proceedings of the 10th Inter- pher D. Manning. 2014. Glove: Global vec- national Conference on Parsing Technologies. tors for word representation. In Proceedings of IWPT ’07, pages 1–10. the 2014 Conference on Empirical Methods in Marion Weller, Alexander Fraser, and Sabine Natural Language Processing (EMNLP 2014). Schulte im Walde. 2013. Using subcatego- pages 1532–1543. rization knowledge to improve case prediction Philip Resnik. 1996. Selectional constraints: an for translation to german. In Proceedings of information-theoretic model and its computa- the 51st Annual Meeting of the Association for tional realization. Cognition 61:127–159. Computational Linguistics (Volume 1: Long Pa- pers). Association for Computational Linguis- Alan Ritter, Mausam, and Oren Etzioni. 2010. A tics, Sofia, Bulgaria, pages 593–603. latent dirichlet allocation method for selectional preferences. In Proceedings of the 48th Annual Marion Weller, Sabine Schulte Im Walde, and Meeting of the Association for Computational Alexander Fraser. 2014. Using noun class in- Linguistics. Stroudsburg, PA, USA, ACL ’10, formation to model selectional preferences for pages 424–434. translating prepositions in smt. In Proceedings of the 11th Conference of the Association for ´ Diarmuid O. Seaghdha.´ 2010. Latent variable Machine Translation in the Americas. Vancou- models of selectional preference. In Proceed- ver, BC, AMTA ’14, pages 275–287. ings of the 48th Annual Meeting of the Asso- ciation for Computational Linguistics. Strouds- Philip Williams and Philipp Koehn. 2012. Ghkm burg, PA, USA, ACL ’10, pages 435–444. rule extraction and scope-3 parsing in moses. In Proceedings of the Seventh Workshop on Statis- Rico Sennrich. 2015. Modelling and Optimizing tical Machine Translation. pages 388–394. on Syntactic N-Grams for Statistical Machine Dekai Wu and Pascale Fung. 2009. Semantic roles Translation. Transactions of the Association for for SMT: a hybrid two-pass model. In Proceed- Computational Linguistics 3:169–182. ings of Human Language Technologies: The Ekaterina Shutova, Niket Tandon, and Gerard 2009 Annual Conference of the North American de Melo. 2015. Perceptually grounded selec- Chapter of the Association for Computational tional preferences. In Proceedings of the 53rd Linguistics, Companion Volume: Short Papers. Annual Meeting of the Association for Compu- pages 13–16. tational Linguistics and the 7th International Joint Conference on Natural Language Pro- cessing. ACL ’15, pages 950–960. Andreas Stolcke. 2002. SRILM – an Extensible Language Modeling Toolkit. In Proc. of the Int. Conf. on Spoken Language Processing (ICSLP). volume 3. Sara Stymne. 2011. Blast: A tool for error anal- ysis of machine translation output. In Proceed- ings of the 49th Annual Meeting of the Asso- ciation for Computational Linguistics: Human Language Technologies: Systems Demonstra- tions. Portland, Oregon, HLT ’11, pages 56–61. Lin Sun and Anna Korhonen. 2009. Improving verb clustering with automatically acquired se- lectional preferences. In Proceedings of the 2009 Conference on Empirical Methods in Nat- ural Language Processing: Volume 2 - Volume 2. EMNLP ’09, pages 638–647.

42

Page 65 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

F Using Verb Patterns in PBMT

Verb Sense Disambiguation in Machine Translation

Roman Sudarikov, Ondrejˇ Dusek,ˇ Martin Holub, Ondrejˇ Bojar, and Vincent Kr´ızˇ Charles University, Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics {sudarikov,odusek,holub,bojar,kriz}@ufal.mff.cuni.cz

Abstract

We describe experiments in Machine Translation using word sense disambiguation (WSD) in- formation. This work focuses on WSD in verbs, based on two different approaches – verbal patterns based on corpus pattern analysis and verbal word senses from valency frames. We eval- uate several options of using verb senses in the source-language sentences as an additional factor for the Moses statistical machine translation system. Our results show a statistically significant translation quality improvement in terms of the BLEU metric for the valency frames approach, but in manual evaluation, both WSD methods bring improvements.

1 Introduction

The possibility of using word sense disambiguation (WSD) systems in machine translation (MT) has recently been investigated in several ways: Output of WSD systems has been incorporated into MT to improve translation quality — at the decoding step of a phrase-based statistical machine translation (PB- SMT) system (Chan et al., 2007) or as contextual features in maximum entropy (MaxEnt) models (Neale et al., 2015) and (Neale et al., 2016). In addition, WSD has also been used in MT evaluation, for exam- ple in METEOR (Apidianaki et al., 2015). These works indicate that WSD can be beneficial to different MT tasks, in case of using senses as contextual features for MaxEnt models Neale et al. (2016) achieve statistically significant improvement over the baseline for English-to-Portuguese translation. And Apid- ianaki et al. (2015) report that usage of WSD can establish better sense correspondences and improve its correlation with human judgments of translation quality. In this research, we have investigated the possibilities of integrating two different approaches to verbal WSD into a PB-SMT system – verb patterns based on corpus pattern analysis (CPA) and verbal word senses in valency frames. The focus on verbs was motivated by the ideas that verbs carry a crucial part of the meaning of the sentence (Healy and Miller, 1970) and thus accurate translation of the verb is critical for the understanding of the translation. Therefore, improvement of the translation of verbs can lead to overall increase of the translation quality. Therefore, improvement of the translation of verbs can lead to an overall increase of translation quality. The outputs of automatic verb sense disambiguation systems using both CPA and valency frames were integrated into Moses statistical machine translation system(Koehn et al., 2007). Both kinds of verb senses were added as additional factors (Koehn and Hoang, 2007). Section 4.1 shows that we obtain statistically significant improvement in terms of BLEU scores (Papineni et al., 2002) and manual evaluation of translations validated that. The novelty of this work lies not only in our focus only on verbs senses, but also in the fact that we are comparing the impact of two WSD approaches on the statistical machine translation. The following Section 2 describes the initial setup of our experiments. Section 3 and Section 4 depict the idea behind corpus pattern analysis and verb valency frames representations and show evaluation results of incorporation of these sense to phrase-based statistical machine translation. The next section (Section 5) is devoted to the discussion of results obtained during the evaluation. And finally Section 6 describes our plan of the future work.

42 Proceedings of the Sixth Workshop on Hybrid Approaches to Translation, pages 42–50, Osaka, Japan, December 11, 2016.

Page 66 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

2 Experiments setup

2.1 Dataset and MT system For our experiments, we have used a subset of the Czech-English corpus CzEng 1.0 (Bojar et al., 2012); the respective numbers of sentences and tokens in each of training, development and test sets are shown in Table 1. For our experiments, 28 different English verbs were selected and automatically annotated with corpus pattern analysis senses, and 3,306 verbs annotated using valency frames. The subset has been selected to include verbs annotated with CPA, so the effect of WSD would be visible. All the experiments were carried out in the Eman experiment management system (Bojar and Tamchyna, 2013) using the Moses PB-SMT system (Koehn et al., 2007) as the core and minimum error rate training (MERT, (Och, 2003)) to optimize the decoder feature weights on the development set. The evaluation was performed using the BLEU score (Papineni et al., 2002), but the results of each setup were then thoroughly examined and verified using the MT-ComparEval system (Aranberri et al., 2016)1.

Set Number of sentences Tokens CS Tokens EN Training 649,605 10,759,546 12,073,130 Development 10,115 187,478 167,788 Test 2,707 59,446 67,336

Table 1: Data set composition

2.2 MT configurations As we have mentioned in Section 1 the main goal of the experiments was to explore whether verb senses as additional factors in the statistical MT system Moses can help in improving translation quality. The following configurations were tested:

Form Form – “vanilla” Moses setup, translating from surface word forms to target surface forms, • → including capitalization.

Form+Sense Form – two source factors (surface word form and verb sense ID, if applicable) are • → translated to the target-side word forms. This is technically identical to appending the verb sense ID to the source words.

Form Form+Tag – the source word form is translated to two factors on the target side: word • → form and morphological tag (part-of-speech tag with morphological categories of Czech, such as case, number, gender, or tense). This allows us to use an additional language model trained on morphological tags only. This setup is known to perform well for morphologically rich languages (Bojar, 2007) and thus was selected as a baseline for all comparisons.

Form+Sense Form+Tag – a combination of the two setups above: two source and two target-side • → factors, for better handling of source verb meaning and target morphological coherence.

Form Form+Tag + Form+Sense Form+Tag – a combination of previous two models as two sep- • → → arate phrase tables.

For all configurations, we trained a 4-gram language model on word forms of the sentences from the train- ing set. This LM was pruned: we discarded all singleton n-grams (apart from unigrams). In addition, for configurations which generated morphological tags, we used a 10-gram model LM over morphological tags to help maintain morphological coherence of the translation outputs. Again, we pruned all singleton n-grams with the exception of unigrams.

1http://wmt.ufal.cz/

43

Page 67 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

Verb No. Pattern / Implicature [[Physical Object Surface]] gleam [NO OBJ] gleam 1 | [[Surface]] of [[Physical Object]] reflects occasional flashes of light [[Light Light Source]] gleam [NO OBJ] gleam 2 | [[Light Source]] emits an occasional flash of [[Light]] eyes gleam [NO OBJ] (with [[Emotion]]) gleam 3 { } eyes of [[Human]] shine, expressive of [[Emotion]] { } [no object] [Human] wake ( up ) AdvTime( from nightmare dream sleep reverie )( to wake 3 Eventuality) { } { }{ | | | } { } the mind of [[Human]] returns at a particular [[Time]] to a state of full conscious awareness and alertness after sleep pv [phrasal verb] [[Human 1] ˆ [Sound] ˆ [Event]] wake [[Human 2] ˆ [Animal]] ( up ) wake 4 { } [[Human 1 — Sound — Event]] causes the mind of [[Human 2 — Animal]] to return to a state of full conscious awareness and alertness after sleep [Anything] wake [Emotion] ( in Human) wake 7 { } [[Anything]] causes [[Human]] to feel or become aware of [[Emotion]] waking * ( up ) wake 9 { } [Human—Animal]’s returning to a state of full conscious awareness and alertness after sleep

Table 2: Example patterns defined for the verbs gleam and wake.

3 Verb patterns based on Corpus Pattern Analysis Corpus Pattern Analysis (CPA) is a method of manual context-based lexical disambiguation of verbs (Hanks, 1994; Hanks, 2013). Verbs are supposed to have no meanings on their own; instead, meanings are triggered by the context. Hence, a CPA-based lexicon does not group the uses of a verb into senses but into syntagmatic usage patterns derived from the corpus findings. Such a CPA-based lexicon is the Pattern Lexicon of English Verbs (PDEV, (Hanks and Pustejovsky, 2005)). In contrast to the classical WSD, here the verb patterns are used as verb meaning representations. An example of a few patterns is given in Table 2. Here we employ an automatic procedure for verb pattern recognition developed by Holub et al. (2012), which deals with 30 selected English verbs. In fact, their method uses 30 separate classifiers, one for each verb, trained on moderately sized manually annotated samples. They use the collection called VPS-30-En (Verb Pattern Sample, 30 English verbs) published by Cinkova´ et al. (2012) as training data. VPS-30-En was designed as a small sample of PDEV, a pilot lexical resource of 30 English lexical verb entries enriched with semantically annotated corpus samples. The data describes regular contextual patterns of use of the selected verbs in the , version 3 (BNC, 2007).2 The number of different patterns varies from 4 to 10 in most cases across the verbs, and the performance of Holub et al. (2012)’s automatic pattern recognition also differs verb from verb, ranging between 50% and 90% accuracy.

3.1 Experiments and evaluation For the experiments with verb patterns based on CPA, we have explored all the configurations described in Section 2.2. Table 3 shows the results of the best MERT run for each configuration. Multiple MERT runs evaluation was performed for Form Form+Tag, Form+Sense Form+Tag, and Form Form+Tag → → → + Form+Sense Form+Tag using MultEval system (Clark et al., 2011) with Form Form+Tag as → → the baseline system, and the results are shown in Table 4. We see that the average results of Form+Sense Form+Tag are worse than the ones of Form Form+Tag by 0.1% BLEU. MultEval aims → → to determine whether an experimental result has a statistically reliable difference for a give evaluation metric, using a stratified approximate randomization (AR) test. AR estimates the probability (p-value) that a measured difference in metric scores arose by chance by randomly exchanging sentences between the two systems. If there is no significant difference between the systems (i.e., the null hypothesis is true), then this shuffling should not change the computed metric score (Clark et al., 2011). While comparing

2Details about both selected verbs and training contexts can be found at http://ufal.mff.cuni.cz/spr.

44

Page 68 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

Configuration BLEU Form Form 24.26 Form+Sense→ Form 24.15 Form+Sense→Form+Tag 25.01 Form Form+Tag→ 25.11 Form→Form+Tag + Form+Sense Form+Tag 25.27 → → Table 3: Evaluation results for corpus pattern analysis annotation, best MERT run

Form Form+Tag and Form Form+Tag + Form+Sense Form+Tag, we see that p-value is 0.16, thus → → → allowing us to claim, that these two systems don’t differ one from another. The same test performed using METEOR and TER tests only confirms that (in case of TER having p-value=0.61).

Metric System Avg ssel sTest p-value Form Form+Tag 25.0 0.9 0.1 - BLEU Form+Sense→ Form+Tag 24.9 0.9 0.1 0.00 Form Form+Tag→ + Form+Sense Form+Tag 25.0 0.9 0.1 0.16 Form→Form+Tag → 22.6 0.4 0.0 - METEOR Form+Sense→ Form+Tag 22.5 0.4 0.0 0.00 Form Form+Tag→ + Form+Sense Form+Tag 22.6 0.4 0.1 0.22 Form→Form+Tag → 62.2 0.7 0.2 - TER Form+Sense→ Form+Tag 62.4 0.7 0.1 0.00 Form Form+Tag→ + Form+Sense Form+Tag 62.2 0.7 0.2 0.61 → → Table 4: Multeval results for corpus pattern analysis, based on 36 MERT runs

We also performed a more detailed analysis with pairwise comparisons of the following configura- tions: Form Form vs. Form+Sense Form • → → Form Form+Tag vs. Form+Sense Form+Tag • → → Form Form+Tag vs. Form Form+Tag + Form+Sense Form+Tag • → → → 3.1.1 Form Form vs. Form+Sense Form → → The comparison provided by MT-ComparEval based on paired bootstrap resampling (Koehn, 2004) of best MERT runs for both configurations showed that Form Form is significantly better (p-value=0.022) → than Form+Sense Form. The sentence-by-sentence comparison explains this: On the positive side, 8 → examples out of the top 10 sentences where Form+Sense Form output was better than Form Form → → profited from using additional information about the verb sense. On the negative side, the model with verb senses made a lot of errors due to badly extracted phrase tables, even leaving some verbs untrans- lated. 3.1.2 Form Form+Tag vs. Form+Sense Form+Tag → → In this case the same paired bootstrap resampling of the best MERT runs showed that the difference between Form+Sense Form+Tag and Form Form+Tag outputs is not significant (p-value=0.062). In → → the sentence by sentence comparison, we saw that while information about verb pattern helps to deal with some translations, it still causes mistakes. For example, in the sentence from Figure 1, the verb cool down is translated as vychladnout (‘let the temperature sink’) instead of the correct uklidnit (‘calm down’). Here, MT-ComparEval shows that Form Form+Tag translated the verb correctly, meaning that the correct translation exists in the → training data. Therefore, we checked which of the translation model factors caused the wrong trans- lation. In the source sentence, the verb cool has the CPA pattern “1”, but the only suitable phrase in the Form+Sense Form+Tag phrase table (with cool|1 down|- on the source side) has the verb vy- → chladnout on the target side. In the Form Form+Tag table, we have the phrase cool down and let → translated using the verb uklidnit, but the corresponding phrase in the Form+Sense Form+Tag table has → has a different CPA pattern “u” for the verb cool.

45

Page 69 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

Figure 1: An example MT-ComparEval output from the Form+Sense Form+Tag sentence analysis → work1: ACT PAT DIR3 (put, implement) Burger King works a sales pitch into its public-service message. work2: ACT ?PAT ?BEN ?ACMP (perform a job) Mr. Cray has been working on the project for more than six years. work3: ACT PAT (cause, create) [. . . ] greenhouse effect that will work important climatic changes [. . . ] work4: ACT (function) US trade law is working.

Figure 2: Example entry from the EngVallex valency , with four different senses/valency frames of the verb work (abridged, with minor adaptations for presentation). The sense ID and the valency frame is shown on the 1st line of each sense, with the following semantic roles: ACT = actor, PAT = patient, DIR3 = direction (to, into), BEN = benefactor, ACMP = accompanying person or object. Optional arguments are prepended with a “?”. A short gloss is shown on the 2nd line, and an example on the 3rd line.

3.1.3 Form Form+Tag vs. Form Form+Tag + Form+Sense Form+Tag → → → The MT-ComparEval’s paired bootstrap resampling showed that the difference between these two outputs is significant (p-value=0.023), thus showing that output of Form Form+Tag + Form+Sense Form+Tag → → is significantly better than Form Form+Tag. In the sentence-by-sentence comparison, we saw that → the combined system benefited from the verb patterns where possible but resorted to the more general translation of the baseline phrase-table when CPA-annotated translations were insufficient.

4 Verbal word senses in valency frames Valency in verbs (and other parts of speech), i.e., the ability of a verb to require and shape its arguments, is one of the core notions of the Functional Generative Description (FGD) theory (Sgall et al., 1986). The valency of a verb is described in a valency frame, which lists the semantic roles and possible syntactic shapes of all of its obligatory and optional arguments. Since different senses of the same verb require different arguments and thus are described by different valency frames, this amounts to WSD in verbs (an example is shown in Figure 2). Valency frames for over 7,000 senses of more than 4,000 common English verbs are listed in the Eng- Vallex valency lexicon (Cinkova,´ 2006),3 and the Prague Czech-English Dependency Treebank (PCEDT) 2.0 (Hajicˇ et al., 2012) provides manually annotated valency frame IDs for all of its verbs. Using this annotation, Dusekˇ et al. (2015) trained an automatic system for valency frame detection as a part of the Treex natural language processing toolkit (Popel and Zabokrtskˇ y,´ 2010).4 We processed all the sentences in our dataset with the tool and used the resulting valency frame IDs in our experiments.

4.1 Experiments and evaluation Based on the results of the experiments shown in Section 3.1, we have decided to focus only on the following configurations: Form Form+Tag, Form+Sense Form+Tag and their combination → → 3EngVallex is origially based on the PropBank frame files (Palmer et al., 2005), but it also contains a lot of manual changes. 4http://ufal.mff.cuni.cz/treex

46

Page 70 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

Configuration BLEU Form+Sense Form+Tag 24.97 Form Form+Tag→ 25.08 Form→Form+Tag + Form+Sense Form+Tag 25.26 → → Table 5: Evaluation results for valency frames annotation, best MERT for each configuration

Form Form+Tag + Form+Sense Form+Tag. → → Table 5 shows the results for best MERT runs for each configuration. MultEval MERT evaluation for the all configurations mentioned above, with Form Form+Tag as a baseline, is shown in Ta- → ble 6. The table shows that the average Form+Sense Form+Tag model results are still 0.1% BLEU → worse than the Form Form+Tag model, but the average results of the combined Form Form+Tag + → → Form+Sense Form+Tag model are 0.1% BLEU better than the average results of Form Form+Tag. → → The results of MultEval’s stratified approximate randomization test (Clark et al., 2011) allow us to claim that the combination of these two models is statistically significantly better than the baseline. The same is true for METEOR and TER tests results, shown in the same table. It also shows that the valency frames approach to WSD has more impact on MT than CPA in our case.

Metric System Avg ssel sTest p-value Form Form+Tag 25.0 0.9 0.1 - BLEU Form+Sense→ Form+Tag 24.9 0.9 0.1 0.01 Form Form+Tag→ + Form+Sense Form+Tag 25.1 0.9 0.1 0.00 Form→Form+Tag → 22.5 0.4 0.0 - METEOR Form+Sense→ Form+Tag 22.5 0.4 0.0 0.01 Form Form+Tag→ + Form+Sense Form+Tag 22.6 0.4 0.0 0.00 Form→Form+Tag → 62.2 0.7 0.1 - TER Form+Sense→ Form+Tag 62.4 0.7 0.2 0.00 Form Form+Tag→ + Form+Sense Form+Tag 62.1 0.7 0.2 0.00 → → Table 6: MultEval results for valency frames, based on 8 MERT runs

A more thorough examination of the best MERT runs of following pairs of configurations in MT- ComparEval output of paired bootstrap resampling showed that:

Form+Sense Form+Tag is insignificantly worse than Form Form+Tag, with p-value=0.0161 • → → Form Form+Tag + Form+Sense Form+Tag is significantly better than Form Form+Tag, with • → → → p-value=0.002

An interesting observation was that Form+Sense Form+Tag and Form Form+Tag + → → Form+Sense Form+Tag models were more likely to translate verbs as verbs, while translation → errors in Form Form+Tag often were caused by its efforts to translate verbs as nouns. → 4.2 Comparsion of CPA and valency frames Based on the MultEval results shown in Table 4 and Table 6, it can be claimed that using the valency frames approach to WSD helped to achieve a statistically significant improvement in machine translation, while CPA did not help to such an extent. Among the possible reasons are a lower number of verbs covered (for the same number of sentences, we had CPA-based annotations only for 28 different verbs and 3,306 different verbs with valency frames annotations) and the precision of automatic annotating system itself. One of the future plans here is to compare the results of these approaches when exactly the same verbs are annotated. An example of the sentence where the valency frames approach was more successful than CPA is “. . . forged steel components for the automotive industry”. Here, the word forged was annotated by verbal valency frame and by verbal pattern, and while valency frame provided correct translation of this word into Czech “kovane´ oceli soucˇast´ ´ı”, the CPA-based model generated “zfalsovalˇ ocel soucˇast´ ´ı”, which is incorrect in both the meaning and the part of speech.

47

Page 71 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

5 Discussion and conclusion Including verb senses – be it based on corpus pattern analysis or as valency frames – as an ad- ditional factor to a PB-SMT English-to-Czech model did not help by itself, as our results for Form+Sense Form+Tag configurations have shown. Nevertheless, the combination of this model with → a better-performing model Form Form+Tag resulted in a significant improvement for the case of us- → ing senses based on valency frames, as shown by paired boootstrap resampling tests given in Table 6, while a manual evaluation of best MERT runs showed translation quality improvement for both WSD approaches. All the results were achieved on a relatively small data sets, but it can be of use in cases when one does not have enough parallel data, but WSD for the source language (which is often English) is available, for example, in case of domain-specific translations. We have tried to use sense information produced by two different approaches to verbal WSD dis- ambiguation – corpus pattern analysis and valency frames, and while the former did not significantly outperform the baseline system in terms of the BLEU metric, the later showed significant improvement. Adding the automatic WSD system as additional preprocessing layer can influence the SMT system due to the fact that WSD system cannot deliver 100% accurate senses, thus causing confusing situations, when the system had a correct translation available, but did not select it because the verb sense of the source sentence from test set was incorrect. Possible ways of reducing the impact of such things are improvement of automatic WSD systems used and using WSD system combination.

6 Future work In the future, we plan to continue our experiments on verbs senses using approached described in this work as well as other approaches, e.g. WSD systems based on BabelNet synsets (Navigli and Ponzetto, 2012) and WordNet senses.5 In addition, we are going to experiment with the size of the corpus used for training, because this research used only a part of available Czech-English parallel corpus.

7 Acknowledgments This research was supported by the grants H2020-ICT-2014-1-645452, GBP103/12/G084, SVV 260 333, and using language resources distributed by the LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (LM2015071). We thank the two anonymous reviewers for useful comments.

References Marianna Apidianaki, Benjamin Marie, and Lingua et Machina. 2015. METEOR-WSD: improved sense matching in MT evaluation. Syntax, Semantics and Structure in Statistical Translation, page 49. Nora Aranberri, Eleftherios Avramidis, Aljoscha Burchardt, Ondrej Klejch, Martin Popel, and Maja Popovic. 2016. Tools and guidelines for principled machine translation development. In Proceedings of the Tenth In- ternational Conference on Language Resources and Evaluation (LREC 2016), pages 1877–1882, Portoroz,ˇ Slovenia. BNC. 2007. British national corpus, version 3 (BNC XML edition). Distributed by Oxford University Computing Services on behalf of the BNC Consortium. URL: http://www.natcorp.ox.ac.uk/. Ondrejˇ Bojar and Alesˇ Tamchyna. 2013. The Design of Eman, an Experiment Manager. The Prague Bulletin of Mathematical Linguistics, 99:39–58. Ondrejˇ Bojar, Zdenekˇ Zabokrtskˇ y,´ Ondrejˇ Dusek,ˇ Petra Galusˇcˇakov´ a,´ Martin Majlis,ˇ David Marecek,ˇ Jirˇ´ı Marsˇ´ık, Michal Novak,´ Martin Popel, and Alesˇ Tamchyna. 2012. The Joy of Parallelism with CzEng 1.0. In Proceed- ings of LREC 2012, Istanbul, Turkey, May. ELRA, European Language Resources Association. In print. Ondrejˇ Bojar. 2007. English-to-Czech Factored Machine Translation. In Proceedings of the Second Workshop on Statistical Machine Translation, pages 232–239, Prague, Czech Republic, June. Association for Computational Linguistics.

5http://wordnet.princeton.edu

48

Page 72 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

Yee Seng Chan, Hwee Tou Ng, and David Chiang. 2007. Word sense disambiguation improves statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, pages 33–40, Prague, Czech Republic.

Silvie Cinkova,´ Martin Holub, Adam Rambousek, and Lenka Smejkalova.´ 2012. A database of semantic clus- ters of verb usages. In Proceedings of the LREC 2012 International Conference on Language Resources and Evaluation. Istanbul, Turkey.

Silvie Cinkova.´ 2006. From PropBank to EngValLex: Adapting the PropBank-Lexicon to the Valency Theory of the Functional Generative Description. In Proceedings of the fifth International conference on Language Resources and Evaluation (LREC 2006), Genova, Italy.

Jonathan H Clark, Chris Dyer, Alon Lavie, and Noah A Smith. 2011. Better hypothesis testing for statistical machine translation: Controlling for optimizer instability. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers-Volume 2, pages 176–181. Association for Computational Linguistics.

Ondrejˇ Dusek,ˇ Eva Fucˇ´ıkova,´ Jan Hajic,ˇ Martin Popel, Jana Sindlerovˇ a,´ and Zdenkaˇ Uresovˇ a.´ 2015. Using Parallel Texts and Lexicons for Verbal Word Sense Disambiguation. In Proceedings of the Third International Conference on Dependency Linguistics (Depling 2015), pages 82–90, Uppsala, Sweden.

J. Hajic,ˇ E. Hajicovˇ a,´ J. Panevova,´ P. Sgall, O. Bojar, S. Cinkova,´ E. Fucˇ´ıkova,´ M. Mikulova,´ P. Pajas, J. Popelka, J. Semecky,´ J. Sindlerovˇ a,´ J. Stˇ epˇ anek,´ J. Toman, Z. Uresovˇ a,´ and Z. Zabokrtskˇ y.´ 2012. Announcing Prague Czech-English Dependency Treebank 2.0. In Proceedings of LREC, pages 3153–3160, Istanbul.

Patrick Hanks and James Pustejovsky. 2005. A Pattern Dictionary for Natural Language Processing. Revue Francaise de linguistique appliquee´ , 10(2).

Patrick Hanks. 1994. Linguistic norms and pragmatic exploitations, or why lexicographers need prototype theory and vice versa. In F. Kiefer, G. Kiss, and J. Pajzs, editors, Papers in Computational Lexicography: Complex ’94. Research Institute for Linguistics, Hungarian Academy of Sciences.

Patrick Hanks. 2013. Lexical Analysis: Norms and Exploitations. University Press Group Limited.

Alice F Healy and George A Miller. 1970. Verb as main determinant of sentence meaning. Psychonomic Science, 20(6):372–372.

Martin Holub, Vincent Kr´ız, Silvie Cinkova,´ and Eckhard Bick. 2012. Tailored feature extraction for lexical disambiguation of english verbs based on corpus pattern analysis. In COLING, pages 1195–1210.

Philipp Koehn and Hieu Hoang. 2007. Factored Translation Models. In Proc. of EMNLP.

Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrejˇ Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the annual meeting of the Association for Computational Linguistics, pages 187–193.

Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In EMNLP, pages 388–395. Citeseer.

Roberto Navigli and Simone Paolo Ponzetto. 2012. BabelNet: The automatic construction, evaluation and appli- cation of a wide-coverage multilingual semantic network. Artificial Intelligence, 193:217–250.

Steven Neale, Luıs Gomes, and Antonio´ Branco. 2015. First steps in using word senses as contextual features in maxent models for machine translation. In 1st Deep Machine Translation Workshop, page 64.

Steven Neale, Luıs Gomes, Eneko Agirre, Oier Lopez de Lacalle, and Antonio´ Branco. 2016. Word sense-aware machine translation: Including senses as contextual features for improved translation models. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), pages 2777–2783, Portoroz,ˇ Slovenia.

Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1, pages 160–167.

Martha Palmer, Daniel Gildea, and Paul Kingsbury. 2005. The proposition bank: An annotated corpus of semantic roles. Computational Linguistics, 31(1):71–106.

49

Page 73 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evalua- tion of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.

Martin Popel and Zdenekˇ Zabokrtskˇ y.´ 2010. TectoMT: modular NLP framework. In Proceedings of IceTAL, 7th International Conference on Natural Language Processing, pages 293–304, Reykjav´ık.

P. Sgall, E. Hajicovˇ a,´ and J. Panevova.´ 1986. The meaning of the sentence in its semantic and pragmatic aspects. D. Reidel, Dordrecht.

50

Page 74 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

G Target-Side Context for Discriminative Models in Statistical Ma- chine Translation

Target-Side Context for Discriminative Models in Statistical Machine Translation

Alesˇ Tamchyna1,2, Alexander Fraser1, Ondrejˇ Bojar2 and Marcin Junczys-Dowmunt3 1LMU Munich, Munich, Germany 2Charles University in Prague, Prague, Czech Republic 3Adam Mickiewicz University in Poznan,´ Poznan,´ Poland tamchyna,bojar @ufal.mff.cuni.cz [email protected] [email protected] { }

Abstract help with semantics but also to improve morpho- logical and syntactic coherence. Discriminative translation models utiliz- For sense disambiguation, source context is the ing source context have been shown to main source of information, as has been shown in help statistical machine translation perfor- previous work (Vickrey et al., 2005), (Carpuat and mance. We propose a novel extension of Wu, 2007), (Gimpel and Smith, 2008) inter alia. this work using target context information. Consider the first set of examples in Figure 1, pro- Surprisingly, we show that this model can duced by a strong baseline PBMT system. The be efficiently integrated directly in the de- English word “shooting” has multiple senses when coding process. Our approach scales to translated into Czech: it may either be the act of large training data sizes and results in con- firing a weapon or making a film. When the cue sistent improvements in translation qual- word “film” is close, the phrase-based model is ity on four language pairs. We also pro- able to use it in one phrase with the ambiguous vide an analysis comparing the strengths “shooting”, disambiguating correctly the transla- of the baseline source-context model with tion. When we add a single word in between, the our extended source-context and target- model fails to capture the relationship and the most context model and we show that our ex- frequent sense is selected instead. Wider source tension allows us to better capture mor- context information is required for correct disam- phological coherence. Our work is freely biguation. available as part of Moses. While word/phrase senses can usually be in- ferred from the source sentence, the correct se- 1 Introduction lection of surface forms requires also information from the target. Note that we can obtain some Discriminative lexicons address some of the core information from the source. For example, an challenges of phrase-based MT (PBMT) when English subject is often translated into a Czech translating to morphologically rich languages, subject; in which case the Czech word should such as Czech, namely sense disambiguation and be in nominative case. But there are many deci- morphological coherence. The first issue is se- sions that happen during decoding which deter- mantic: given a source word or phrase, which of mine morphological and syntactic properties of its possible meanings (i.e., which stem or lemma) words – verbs can have translations which differ should we choose? Previous work has shown that in valency frames, they may be translated in either this can be addressed using a discriminative lex- active or passive voice (in which case subject and icon. The second issue has to do with morphol- object would be switched), nouns may have dif- ogy (and syntax): given that we selected the cor- ferent possible translations which differ in gender, rect meaning, which of its inflected surface forms etc. is appropriate? In this work, we integrate such a The correct selection of surface forms plays model directly into the SMT decoder. This enables a crucial role in preserving meaning in morpho- our classifier to extract features not only from the logically rich languages because it is morphol- full source sentence but also from a limited target- ogy rather than word order that expresses rela- side context. This allows the model to not only tions between words. (Word order tends to be

1704 Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 1704–1714, Berlin, Germany, August 7-12, 2016. c 2016 Association for Computational Linguistics

Page 75 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

Input PBMT Output shooting of the film . nata´cenˇ ´ı filmu .  shootingcamera of film . shooting of the expensive film . strelbyˇ na drahy´ film .  shootingsgun at expensive film . the man saw a cat . muzˇ uvidelˇ kockuˇ .  man saw catacc . the man saw a black cat . muzˇ spatrilˇ cernouˇ kockuˇ .  man saw blackacc catacc . the man saw a yellowish cat . muzˇ spatrilˇ nazloutlˇ a´ kockaˇ .  man saw yellowishnom catnom .

Figure 1: Examples of problems of PBMT: lexical selection and morphological coherence. Each trans- lation has a corresponding gloss in italics.

relatively free and driven more by semantic con- known). Importantly, we also show that our straints rather than syntactic constraints.) novel use of target context improves morpho- The language model is only partially able to logical and syntactic coherence. capture this phenomenon. It has a limited scope In addition to extensive experimentation on and perhaps more seriously, it suffers from data • sparsity. The units captured by both the phrase ta- translation from English to Czech, we also ble and the LM are mere sequences of words. In evaluate English to German, English to Pol- order to estimate their probability, we need to ob- ish and English to Romanian tasks, with im- serve them in the training data (many times, if the provements on translation quality in all tasks, estimates should be reliable). However, the num- showing that our work is broadly applicable. ber of possible n-grams grows exponentially as we We describe several optimizations which al- • increase n, leading to unrealistic requirements on low target-side features to be used efficiently training data sizes. This implies that the current in the context of phrase-based decoding. models can (and often do) miss relationships be- tween words even within their theoretical scope. Our implementation is freely available in the • The second set of sentences in Figure 1 demon- widely used open-source MT toolkit Moses, strates the problem of data sparsity for morpho- enabling other researchers to explore dis- logical coherence. While the phrase-based sys- criminative modelling with target context in tem can correctly transfer the morphological case MT. of “cat” and even “black cat”, the less usual “yellowish cat” is mistranslated into nominative 2 Discriminative Model with Target-Side case, even though the correct phrase “yellowish Context ||| nazloutlou”ˇ exists in the phrase table. A model Several different ways of using feature-rich mod- with a suitable representation of two preceding els in MT have been proposed, see Section 6. We words could easily infer the correct case in this describe our approach in this section. example. Our contributions are the following: 2.1 Model Definition We show that the addition of a feature-rich Let f be the source sentence and e its translation. • discriminative model significantly improves We denote source-side phrases (given a particular ¯ ¯ translation quality even for large data sizes phrasal segmentation) (f1,..., fm) and the indi- and that target-side context information con- vidual words (f1, . . . , fn). We use a similar nota- sistently further increases this improvement. tion for target-side words/phrases. For simplicity, let eprev, eprev 1 denote the − We provide an analysis of the outputs which words preceding the current target phrase. As- • confirms that source-context features indeed suming target context size of two, we model the help with semantic disambiguation (as is well following probability distribution:

1705

Page 76 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

and label-dependent features. When generating features for a particular set of translations, we first ¯ P (e f) P (e ¯i fi, f, eprev, eprev 1) create the shared features (in the namespace S). | ∝ | − (e ¯ ,f¯ ) (e,f) i Yi ∈ These only depend on (source and target) context (1) and are therefore constant for all possible transla- The probability of a translation is the product of tions of a given phrase. (Note that target-side con- phrasal translation probabilities which are condi- text naturally depends on the current partial trans- tioned on the source phrase, the full source sen- lation. However, when we process the possible tence and several previous target words. translations for a single source phrase, the target ¯ Let GEN(fi) be the set of possible translations context is constant.) ¯ of the source phrase fi according to the phrase Then for each translation, we extract its features table. We also define a “feature vector” function and store them in the namespace T . Note that we fv(e ¯i, f¯i, f, eprev, eprev 1) which outputs a vector − do not provide a label (or class) to VW – it is up of features given the phrase pair and its context to these translation features to describe the target information. We also have a vector of feature phrase. (And this is what is referred to as “label- weights w estimated from the training data. Then dependent features” in VW.) our model defines the phrasal translation probabil- Finally, we add the Cartesian product between ity simply as follows: the two namespaces to the feature set: every shared feature is combined with every translation

P (e ¯i f¯i, f, eprev, eprev 1) feature. | − This setting allows us to train only a single, exp(w fv(e ¯i, f¯i, f, eprev, eprev 1)) = · ¯ ¯ − global model with powerful feature sharing. For exp(w fv(e0, fi, f, eprev, eprev 1)) − e¯ GEN(f¯ ) · example, thanks to the label-dependent format, we 0∈ i P (2) can decompose both the source phrase and the tar- get phrase into words and have features such as This definition implies that we have to locally s cat t kockaˇ which capture phrase-internal normalize the classifier outputs so that they sum word translations. Predictions for rare phrase pairs to one. are then more robust thanks to the rich statistics In PBMT, translations are usually scored by a collected for these word-level feature pairs. log-linear model. Our classifier produces a single score (the conditional phrasal probability) which 2.3 Extraction of Training Examples we add to the standard log-linear model as an addi- Discriminative models in MT are typically trained tional feature. The MT system therefore does not by creating one training instance per extracted have direct access to the classifier features, only to phrase from the entire training data. The target the final score. side of the extracted phrase is a positive label, and all other phrases observed aligned to the extracted 2.2 Global Model phrase (anywhere in the training data) are the neg- 1 We use the Vowpal Wabbit (VW) classifier in this ative labels. work. Tamchyna et al. (2014) already integrated We train our model in a similar fashion: for each VW into Moses. We started from their implemen- sentence in the parallel training data, we look at tation in order to carry out our work. Classifier all possible phrasal segmentations. Then for each features are divided into two “namespaces”: source span, we create a training example. We ob- ¯ S. Features that do not depend on the current tain the set of possible translations GEN(f) from • phrasal translation (i.e., source- and target- the phrase table. Because we do not have actual context features). classes, each translation is defined by its label- dependent features and we associate a loss with T. Features of the current phrasal translation. it: 0 loss for the correct translation and 1 for all • others. We make heavy use of feature processing avail- Because we train both our model and the stan- able in VW, namely quadratic feature expansions dard phrase table on the same dataset, we use 1http://hunch.net/˜vw/ leaving-one-out in the classifier training to avoid

1706

Page 77 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

Configurations Feature Type Czech German Polish, Romanian Source Indicator f, l, l+t, t f, l, l+t, t l, t Source Internal f, f+a, f+p, l, l+t, t, a+p f, f+a, f+p, l, l+t, t, a+p l, l+a, l+p, t, a+p Source Context f (-3,3), l (-3,3), t (-5,5) f (-3,3), l (-3,3), t (-5,5) l (-3,3), t (-5,5) Target Context f (2), l (2), t (2), l+t (2) f (2), l (2), t (2), l+t (2) l (2), t (2) Bilingual Context — l+t/l+t (2) l+t/l+t (2) Target Indicator f, l, t f, l, t l, t Target Internal f, l, l+t, t f, l, l+t, t l, t

Table 1: List of used feature templates. Letter abbreviations refer to word factors: f (form), l (lemma), t (morphological tag), a (analytical function), p (lemma of dependency parent). Numbers in parentheses indicate context size.

over-fitting. We look at phrase counts and co- (Koehn and Hoang, 2007) and we represent each occurrence counts in the training data, we subtract type of information as an individual factor. On one from the number of occurrences for the cur- the source side, we use the word surface form, rent source phrase, target phrase and the phrase its lemma, morphological tag, analytical function pair. If the count goes to zero, we skip the train- (such as Subj for subjects) and the lemma of the ing example. Without this technique, the classifier parent node in the dependency parse tree. On the might learn to simply trust very long phrase pairs target side, we only use word lemmas and mor- which were extracted from the same training sen- phological tags. tence. Table 1 lists our feature sets for each language For target-side context features, we simply use pair. We implemented indicator features for both the true (gold) target context. This leads to train- the source and target side; these are simply con- ing which is similar to language model estima- catenations of the words in the current phrase into tion; this model is somewhat similar to the neural a single feature. Internal features describe words joint model for MT (Devlin et al., 2014), but in within the current phrase. Context features are our case implemented using a linear (maximum- extracted either from a window of a fixed size entropy-like) model. around the current phrase (on the source side) or from a limited left-hand side context (on the tar- 2.4 Training get side). Bilingual context features are concatena- We use Vowpal Wabbit in the --csoaa ldf mc tions of target-side context words and their source- setting which reduces our multi-class problem to side counterparts (according to word alignment); one-against-all binary classification. We use the these features are similar to bilingual tokens in logistic loss as our objective. We experimented bilingual LMs (Niehues et al., 2011). Each of our with various settings of L2 regularization but were feature types can be configured to look at any in- not able to get an improvement over not using reg- dividual factors or their combinations. ularization at all. We train each model with 10 The features in Table 1 are divided into three iterations over the data. sets. The first set contains label-independent We evaluate all of our models on a held-out set. (=shared) features which only depend on the We use the same dataset as for MT system tuning source sentence. The second set contains shared because it closely matches the domain of our test features which depend on target-side context; set. We evaluate model accuracy after each pass these can only be used when VW is applied dur- over the training data to detect over-fitting and we ing decoding. We use target context size two in select the model with the highest held-out accu- all our experiments.2 Finally, the third set con- racy. tains label-dependent features which describe the currently predicted phrasal translation. 2.5 Feature Set 2In preliminary experiments we found that using a single Our feature set requires some linguistic process- word was less effective and larger context did not bring im- ing of the data. We use the factored MT setting provements, possibly because of over-fitting.

1707

Page 78 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

Going back to the examples from Figure 1, our 3.2 Model Training model can disambiguate the translation of “shoot- VW is a very fast classifier by itself, however for ing” based on the source-context features (either very large data, its training can be further sped up the full form or lemma). For the morphologi- by using parallelization. We take advantage of its cal disambiguation of the translation of “yellow- implementation of the AllReduce scheme which ish cat”, the model has access to the morpholog- we utilize in a grid engine environment. We shuf- ical tags of the preceding target words which can fle and shard the data and then assign each shard disambiguate the correct morphological case. to a worker job. With AllReduce, there is a master We used slightly different subsets of the full fea- job which synchronizes the learned weight vector ture set for different languages. In particular, we with all workers. We have compared this approach left out surface form features and/or bilingual fea- with the standard single-threaded, single-process tures in some settings because they decreased per- training and found that we obtain identical model formance, presumably due to over-fitting. accuracy. We usually use around 10-20 training jobs. This way, we can process our large training 3 Efficient Implementation files quickly and train the full model (using multi- ple passes over the data) within hours; effectively, Originally, we assumed that using target-side con- neither feature extraction nor model training be- text features in decoding would be too expen- come a significant bottleneck in the full MT sys- sive, considering that we would have to query our tem training pipeline. model roughly as often as the language model. In preliminary experiments, we therefore focused on 3.3 Decoding n-best list re-ranking. We obtained small gains but all of our results were substantially worse than In phrase-based decoding, translation is generated with the integrated model, so we omit them from from left to right. At each step, a partial transla- the paper. tion (initially empty) is extended by translating a previously uncovered part of the source sentence. We find that decoding with a feature-rich target- There are typically many ways to translate each context model is in fact feasible. In this section, source span, which we refer to as translation op- we describe optimizations at different stages of tions. The decoding process gradually extends our pipeline which make training and inference the generated partial translations until the whole with our model practical. source sentence is covered; the final translation is then the full translation hypothesis with the high- 3.1 Feature Extraction est model score. Various pruning strategies are ap- plied to make decoding tractable. We implemented the code for feature extraction Evaluating a feature-rich classifier during de- only once; identical code is used at training time coding is a computationally expensive operation. and in decoding. At training time, the generated Because the features in our model depend on features are written into a file whereas at test time, target-side context, the feature function which they are fed directly into the classifier via its li- computes the classifier score cannot evaluate the brary interface. translation options in isolation (independently of This design decision not only ensures consis- the partial translation). Instead, similarly to a lan- tency in feature representation but also makes the guage model, it needs to look at previously gener- process of feature extraction efficient. In training, ated words. This also entails maintaining a state we are easily able to use multi-threading (already which captures the required context information. implemented in Moses) and because the process- A naive integration of the classifier would sim- ing of training data is a trivially parallel task, we ply generate all source-context features, all target- can also use distributed computation and run sep- context features and all features describing the arate instances of (multi-threaded) Moses on sev- translation option each time a partial hypothesis is eral machines. This enables us to easily produce evaluated. This is a computationally very expen- training files from millions of parallel sentences sive approach. within a short time. We instead propose several technical solutions

1708

Page 79 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

which make decoding reasonably fast. Decoding a function EVALUATE(t, s) single sentence with the naive approach takes 13.7 span = t.getSourceSpan() seconds on average. With our optimization, this if not resultCache.has(span, s) then average time is reduced to 2.9 seconds, i.e. almost scores = () by 80 per cent. The baseline system produces a if not stateCache.has(s) then translation in 0.8 seconds on average. stateCache[s] = CtxFeatures(s) Separation of source-context and target- end if for all t0 span.tOpts() do context evaluation. Because we have a linear ← model, the final score is simply the dot product be- srcScore = srcScoreCache[t0] tween a weight vector and a (sparse) feature vec- c.addFeatures(stateCache[s]) tor. It is therefore trivial to separate it into two c.addFeatures(translationCache[t0]) components: one that only contains features which tgtScore = c.predict() depend on the source context and the other with scores[t0] = srcScore + tgtScore target context features. We can pre-compute the end for source-context part of the score before decoding normalize(scores) (once we have all translation options for the given resultCache[span, s] = scores sentence). We cache these partial scores and when end if the translation option is evaluated, we add the par- return resultCache[span, s][t] tial score of the target-context features to arrive at end function the final classifier score. Figure 2: Algorithm for obtaining classifier pre- Caching of feature hashes. VW uses feature dictions during decoding. The variable t stands for hashing internally and it is possible to obtain the the current translation, s is the current state and c hash of any feature that we use. When we en- is an instance of the classifier. counter a previously unseen target context (=state) during decoding, we store the hashes of extracted features in a cache. Therefore for each context, applicable to other language pairs, we also present we only run the expensive feature extraction once. experiments in English to German, Polish, and Ro- Similarly, we pre-compute feature hash vectors for manian. all translation options. In all experiments, we use Treex (Popel and Caching of final results. Our classifier locally Zabokrtskˇ y,´ 2010) to lemmatize and tag the source normalizes the scores so that the probabilities of data and also to obtain dependency parses of all translations for a given span sum to one. This English sentences. cannot be done without evaluating all translation options for the span at the same time. Therefore, 4.1 English-Czech Translation when we get a translation option to be scored, we As parallel training data, we use (subsets of) the fetch all translation options for the given source CzEng 1.0 corpus (Bojar et al., 2012). For tuning, span and evaluate all of them. We then normalize we use the WMT13 test set (Bojar et al., 2013) the scores and add them to a cache of final results. and we evaluate the systems on the WMT14 test When the other translation options come up, their set (Bojar et al., 2014). We lemmatize and tag scores are simply fetched from the cache. This can the Czech data using Morphodita (Strakova´ et al., also further save computation when we get into a 2014). previously seen state (from the point of view of our Our baseline system is a standard phrase-based classifier) and we evaluate the same set of transla- Moses setup. The phrase table in both cases is fac- tion options in that state; we will simply find the tored and outputs also lemmas and morphological result in cache in such cases. tags. We train a 5-gram LM on the target side of When we combine all of these optimizations, parallel data. we arrive at the query algorithm shown in Figure 2. We evaluate three settings in our experiments:

4 Experimental Evaluation baseline – vanilla phrase-based system, • We run the main set of experiments on English to +source – our classifier with source-context • Czech translation. To verify that our method is features only,

1709

Page 80 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

+target – our classifier with both source- we would in training) and scoring each phrase • context and target-context features. pair (along with other possible translations of the source phrase) with our classifier. An instance is For each of these settings, we vary the size of classified correctly if the true translation obtains the training data for our classifier, the phrase ta- the highest score by our model. A baseline which ble and the LM. We experiment with three dif- always chooses the most frequent phrasal trans- ferent sizes: small (200 thousand sentence pairs), lation obtains accuracy of 51.5. For the source- medium (5 million sentence pairs), and full (the context model, the held-out accuracy was 66.3, whole CzEng corpus, over 14.8 million sentence while the target context model achieved accuracy pairs). of 74.8. Note that this high difference is some- For each setting, we run system weight opti- what misleading because in this setting, the target- mization (tuning) using minimum error rate train- context model has access to the true target context ing (Och, 2003) five times and report the aver- (i.e., it is cheating). age BLEU score. We use MultEval (Clark et al., 2011) to compare the systems and to determine 4.2 Additional Language Pairs whether the differences in results are statistically significant. We always compare the baseline with We experiment with translation from English into +source and +source with +target. German, Polish, and Romanian. Table 2 shows the obtained results. Statisti- Our English-German system is trained on the cally significant differences (α=0.01) are marked data available for the WMT14 translation task: in bold. The source-context model does not help in Europarl (Koehn, 2005) and the Common Crawl 3 the small data setting but brings a substantial im- corpus, roughly 4.3 million sentence pairs alto- provement of 0.7-0.8 BLEU points for the medium gether. We tune the system on the WMT13 test and full data settings, which is an encouraging re- set and we test on the WMT14 set. We use Tree- sult. Tagger (Schmid, 1994) to lemmatize and tag the Target-side context information allows our German data. model to push the translation quality further: even English-Polish has not been included in WMT for the small data setting, it brings a substantial shared tasks so far, but was present as a language improvement of 0.5 BLEU points and the gain re- pair for several IWSLT editions which concentrate mains significant as the data size increases. Even on TED talk translation. Full test sets are only in the full data setting, target-side features improve available for 2010, 2011, and 2012. The refer- the score by roughly 0.2 BLEU points. ences for 2013 and 2014 were not made public. Our results demonstrate that feature-rich mod- We use the development set and test set from 2010 els scale to large data size both in terms of techni- as development data for parameter tuning. The cal feasibility and of translation quality improve- remaining two test sets (2011, 2012) are our test ments. Target side information seems consistently data. We train on the concatenation of Europarl 3 beneficial, adding further 0.2-0.5 BLEU points on and WIT (Cettolo et al., 2012), ca. 750 thousand top of the source-context model. sentence pairs. The Polish half has been tagged using WCRFT (Radziszewski, 2013) which pro- data size small medium full duces full morphological tags compatible with the baseline 10.7 15.2 16.7 NKJP tagset (Przepiorkowski,´ 2009). +source 10.7 16.0 17.3 English-Romanian was added in WMT16. We +target 11.2 16.4 17.5 train our system using the available parallel data – Europarl and SETIMES2 (Tiedemann, 2009), Table 2: BLEU scores obtained on the WMT14 roughly 600 thousand sentence pairs. We tune the test set. We report the performance of the baseline, English-Romanian system on the official develop- the source-context model and the full model. ment set and we test on the WMT16 test set. We use the online tagger by Tufis et al. (2008) to pre- Intrinsic Evaluation. For completeness, we process the data. report intrinsic evaluation results. We evaluate Table 3 shows the obtained results. Similarly to the classifier on a held-out set (WMT13 test set) English-Czech experiments, BLEU scores are av- by extracting all phrase pairs from the test in- put aligned with the test reference (similarly as 3http://commoncrawl.org/

1710

Page 81 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

input: the most intensive mining took place there from 1953 to 1962 . baseline: nejv´ıce intenzivn´ı teˇzbaˇ dosloˇ tam z roku 1953 , aby 1962 . the most intensive miningnom there occurred there from 1953 , in order to 1962 . +source: nejv´ıce intenzivn´ı teˇzbyˇ m´ısto tam z roku 1953 do roku 1962 . the most intensive mininggen place there from year 1953 until year 1962 . +target: nejv´ıce intenzivn´ı teˇzbaˇ prob´ıhala od roku 1953 do roku 1962 . the most intensive miningnom occurred from year 1953 until year 1962 .

Figure 3: An example sentence from the test set. Each translation has a corresponding gloss in italics. Errors are marked in bold.

language de pl (2011) pl (2012) ro tion experiments. We took a random sample of baseline 15.7 12.8 10.4 19.6 104 sentences from the test set and blindly ranked +target 16.2 13.4 11.1 20.2 two competing translations (the selection of sen- tences was identical for both experiments). In the Table 3: BLEU scores of the baseline and of the first experiment, we compared the baseline sys- full model for English to German, Polish, and Ro- tem with +source. In the other experiment, we manian. compared the baseline with +target. The instruc- tions for annotation were simply to compare over- eraged over 5 independent optimization runs. Our all translation quality; we did not ask the annota- system outperforms the baseline by 0.5-0.7 BLEU tor to look for any specific phenomena. In terms points in all cases, showing that the method is ap- of automatic measures, our selection has similar plicable to other languages with rich morphology. characteristics as the full test set: BLEU scores obtained on our sample are 15.08, 16.22 and 16.53 5 Analysis for the baseline, +source and +target respectively. In the first case, the annotator marked 52 trans- We manually analyze the outputs of English- lations as equal in quality, 26 translations pro- Czech systems. Figure 3 shows an example sen- duced by +source were marked as better and in tence from the WMT14 test set translated by all the remaining 26 cases, the baseline won the rank- the system variants. The baseline system makes ing. Even though there is a difference in BLEU, an error in verb valency; the Czech verb “doslo”ˇ human annotation does not confirm this measure- could be used but this verb already has an (im- ment, ranking both systems equally. plicit) subject and the translation of “mining” (“teˇzba”)ˇ would have to be in a different case and In the second experiment, 52 translations were at a different position in the sentence. The second again marked as equal. In 34 cases, +target pro- error is more interesting, however: the baseline duced a better translation while in 18 cases, the system fails to correctly identify the word sense baseline output won. The difference between of the particle “to” and translates it in the sense of the baseline and +target suggests that the target- purpose, as in “in order to”. The source-context context model may provide information which is model takes the context (span of years) into con- useful for translation quality as perceived by hu- sideration and correctly disambiguates the trans- mans. lation of “to”, choosing the temporal meaning. Our overall impression from looking at the sys- It still fails to translate the main verb correctly, tem outputs was that both the source-context and though. Only the full model with target-context target-context model tend to fix many morpho- information is able to also correctly translate the syntactic errors. Interestingly, we do not observe verb and inflect its arguments according to their as many improvements in the word/phrase sense roles in the valency frame. The translation pro- disambiguation, though the source context does duced by this final system in this case is almost help semantics in some sentences. The target- flawless. context model tends to preserve the overall agree- In order to verify that the automatically mea- ment and coherence better than the system with a sured results correspond to visible improvements source-context model only. We list several such in translation quality, we carried out two annota- examples in Figure 4. Each of them is fully cor-

1711

Page 82 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

input: destruction of the equipment means that Syria can no longer produce new chemical weapons . +source: znicenˇ ´ım zarˇ´ızen´ı znamena´ , zeˇ Syrie´ jizˇ nemu˚zeˇ vytva´retˇ nove´ chemicke´ zbraneˇ . destruction ofinstr equipment means , that Syria already cannot produce new chemical weapons . +target: znicenˇ ´ı zarˇ´ızen´ı znamena´ , zeˇ Syrie´ jizˇ nemu˚zeˇ vytva´retˇ nove´ chemicke´ zbraneˇ . destruction ofnom equipment means , that Syria already cannot produce new chemical weapons . input: nothing like that existed , and despite that we knew far more about each other . +source: nic takoveho´ neexistovalo , a prestoˇ jsme vedˇ eliˇ daleko v´ıc o jeden na druheho´ . nothing like that existed , and despite that we knew far more about onenom on other . +target: nic takoveho´ neexistovalo , a prestoˇ jsme vedˇ eliˇ daleko v´ıc o sobeˇ navzajem´ . nothing like that existed , and despite that we knew far more about each other . input: the authors have been inspired by their neighbours . +source: autoriˇ byli inspirovani´ svych´ sousedu˚ . the authors have been inspired theirgen neighboursgen . +target: autoriˇ byli inspirovani´ svymi´ sousedy . the authors have been inspired theirinstr neighboursinstr .

Figure 4: Example sentences from the test set showing improvements in morphological coherence. Each translation has a corresponding gloss in italics. Errors are marked in bold.

rected by the target-context model, producing an explicit rules enforcing subject-verb agreement). accurate translation of the input. Our algorithm only assumes that hypotheses are constructed left to right and provides a general 6 Related Work way for including target context information in the Discriminative models in MT have been proposed classifier, regardless of the type of features. Our before. Carpuat and Wu (2007) trained a maxi- implementation is freely available and can be fur- mum entropy classifier for each source phrase type ther extended by other researchers in the future. which used source context information to disam- 7 Conclusions biguate its translations. The models did not cap- ture target-side information and they were inde- We presented a discriminative model for MT pendent; no parameters were shared between clas- which uses both source and target context infor- sifiers for different phrases. They used a strong mation. We have shown that such a model can feature set originally developed for word sense be used directly during decoding in a relatively disambiguation. Gimpel and Smith (2008) also efficient way. We have shown that this model used wider source-context information but did not consistently significantly improves the quality of train a classifier; instead, the features were in- English-Czech translation over a strong baseline cluded directly in the log-linear model of the de- with large training data. We have validated the ef- coder. Mauser et al. (2009) introduced the “dis- fectiveness of our model on several additional lan- criminative word lexicon” and trained a binary guage pairs. We have provided an analysis show- classifier for each target word, using as features ing concrete examples of improved lexical selec- only the bag of words (from the whole source sen- tion and morphological coherence. Our work is tence). Training sentences where the target word available in the main branch of Moses for use by occurred were used as positive examples, other other researchers. sentences served as negative examples. Jeong et Acknowledgements al. (2010) proposed a discriminative lexicon with a rich feature set tailored to translation into mor- This work has received funding from the European phologically rich languages; unlike our work, their Union’s Horizon 2020 research and innovation model only used source-context features. programme under grant agreements no. 644402 Subotin (2011) included target-side context in- (HimL) and 645452 (QT21), from the European formation in a maximum-entropy model for the Research Council (ERC) under grant agreement prediction of morphology. The work was done no. 640550, and from the SVV project num- within the paradigm of hierarchical PBMT and as- ber 260 333. This work has been using lan- sumes that cube pruning is used in decoding. Their guage resources stored and distributed by the LIN- algorithm was tailored to the specific problem of DAT/CLARIN project of the Ministry of Edu- passing non-local information about morphologi- cation, Youth and Sports of the Czech Republic cal agreement required by individual rules (such as (project LM2015071).

1712

Page 83 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

References model for complex morphology. In The Ninth Con- ference of the Association for Machine Translation ˇ Ondrejˇ Bojar, Zdenekˇ Zabokrtsky,´ Ondrejˇ Dusek,ˇ Pe- in the Americas. Association for Computational Lin- tra Galusˇcˇakov´ a,´ Martin Majlis,ˇ David Marecek,ˇ Jirˇ´ı guistics, November. Marsˇ´ık, Michal Novak,´ Martin Popel, and Alesˇ Tam- chyna. 2012. The Joy of Parallelism with CzEng Philipp Koehn and Hieu Hoang. 2007. Factored trans- 1.0. In Proc. of LREC, pages 3921–3928. ELRA. lation models. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Lan- Ondrejˇ Bojar, Christian Buck, Chris Callison-Burch, guage Processing and Computational Natural Lan- Christian Federmann, Barry Haddow, Philipp guage Learning (EMNLP-CoNLL), pages 868–876, Koehn, Christof Monz, Matt Post, Radu Soricut, and Prague, Czech Republic, June. Association for Com- Lucia Specia. 2013. Findings of the 2013 Work- putational Linguistics. shop on Statistical Machine Translation. In Pro- ceedings of the Eighth Workshop on Statistical Ma- Philipp Koehn. 2005. Europarl: A Parallel Corpus chine Translation, pages 1–44, Sofia, Bulgaria, Au- for Statistical Machine Translation. In Conference gust. Association for Computational Linguistics. Proceedings: the tenth Machine Translation Sum- mit, pages 79–86, Phuket, Thailand. AAMT, AAMT. Ondrejˇ Bojar, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Johannes Leveling, Arne Mauser, Sasa Hasan, and Hermann Ney. 2009. Christof Monz, Pavel Pecina, Matt Post, Herve Extending Statistical Machine Translation with Dis- Saint-Amand, Radu Soricut, Lucia Specia, and Alesˇ criminative and Trigger-Based Lexicon Models. Tamchyna. 2014. Findings of the 2014 workshop pages 210–218, Suntec, Singapore. on statistical machine translation. In Proceedings of the Ninth Workshop on Statistical Machine Transla- Jan Niehues, Teresa Herrmann, Stephan Vogel, and tion, pages 12–58, Baltimore, MD, USA. Associa- Alex Waibel. 2011. Wider Context by Using Bilin- tion for Computational Linguistics. gual Language Models in Machine Translation. In Proceedings of the Sixth Workshop on Statistical Marine Carpuat and Dekai Wu. 2007. Improving sta- Machine Translation, pages 198–206, Edinburgh, tistical machine translation using word sense dis- Scotland, July. Association for Computational Lin- ambiguation. In Proceedings of the Conference on guistics. Empirical Methods in Natural Language Processing (EMNLP), Prague, Czech Republic. Franz Josef Och. 2003. Minimum Error Rate Training in Statistical Machine Translation. In Proc. of ACL, Mauro Cettolo, Christian Girardi, and Marcello Fed- pages 160–167, Sapporo, Japan. ACL. erico. 2012. Wit3: Web inventory of transcribed and translated talks. In Proceedings of the 16th Con- Martin Popel and Zdenekˇ Zabokrtskˇ y.´ 2010. Tec- ference of the European Association for Machine toMT: Modular NLP Framework. In Hrafn Lofts- Translation (EAMT), pages 261–268, Trento, Italy, son, Eirikur Rognvaldsson,¨ and Sigrun Helgadottir, May. editors, IceTAL 2010, volume 6233 of Lecture Notes in Computer Science, pages 293–304. Iceland Cen- Jonathan H. Clark, Chris Dyer, Alon Lavie, and tre for Language Technology (ICLT), Springer. Noah A. Smith. 2011. Better hypothesis testing for statistical machine translation: Controlling for opti- Adam Przepiorkowski.´ 2009. A comparison of mizer instability. In Proceedings of the 49th Annual two morphosyntactic tagsets of Polish. In Violetta Meeting of the Association for Computational Lin- Koseska-Toszewa, Ludmila Dimitrova, and Roman guistics: Human Language Technologies: Short Pa- Roszko, editors, Representing Semantics in Digital pers - Volume 2, HLT ’11, pages 176–181, Strouds- Lexicography: Proceedings of MONDILEX Fourth burg, PA, USA. Association for Computational Lin- Open Workshop, pages 138–144, Warsaw. guistics. Adam Radziszewski. 2013. A tiered CRF tagger for Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas Polish. In Robert Bembenik, Lukasz Skonieczny, Lamar, Richard M. Schwartz, and John Makhoul. Henryk Rybinski, Marzena Kryszkiewicz, and 2014. Fast and robust neural network joint models Marek Niezgodka, editors, Intelligent Tools for for statistical machine translation. In Proceedings Building a Scientific Information Platform, volume of the 52nd Annual Meeting of the Association for 467 of Studies in Computational Intelligence, pages Computational Linguistics, ACL 2014, June 22-27, 215–230. Springer. 2014, Baltimore, MD, USA, Volume 1: Long Papers, pages 1370–1380. Helmut Schmid. 1994. Probabilistic part-of-speech tagging using decision trees. In International Con- K. Gimpel and N. A. Smith. 2008. Rich Source-Side ference on New Methods in Language Processing, Context for Statistical Machine Translation. Colum- pages 44–49, Manchester, UK. bus, Ohio. Jana Strakova,´ Milan Straka, and Jan Hajic.ˇ 2014. Minwoo Jeong, Kristina Toutanova, Hisami Suzuki, Open-Source Tools for Morphology, Lemmatiza- and Chris Quirk. 2010. A discriminative lexicon tion, POS Tagging and Named Entity Recognition.

1713

Page 84 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

In Proceedings of 52nd Annual Meeting of the As- sociation for Computational Linguistics: System Demonstrations, pages 13–18, Baltimore, Mary- land, June. Association for Computational Linguis- tics. Michael Subotin. 2011. An exponential translation model for target language morphology. In The 49th Annual Meeting of the Association for Computa- tional Linguistics: Human Language Technologies, Proceedings of the Conference, 19-24 June, 2011, Portland, Oregon, USA, pages 230–238. Alesˇ Tamchyna, Fabienne Braune, Alexander Fraser, Marine Carpuat, Hal Daume´ III, and Chris Quirk. 2014. Integrating a discriminative clas- sifier into phrase-based and hierarchical decoding. The Prague Bulletin of Mathematical Linguistics, 101:29–41. Jorg¨ Tiedemann. 2009. News from OPUS - A col- lection of multilingual parallel corpora with tools and interfaces. In N. Nicolov, K. Bontcheva, G. Angelova, and R. Mitkov, editors, Recent Advances in Natural Language Processing, vol- ume V, pages 237–248. John Benjamins, Amster- dam/Philadelphia, Borovets, Bulgaria.

Dan Tufis, Radu Ion, Alexandru Ceausu, and Dan Ste- fanescu. 2008. Racai’s linguistic web services. In Proceedings of the International Conference on Language Resources and Evaluation, LREC 2008, 26 May - 1 June 2008, Marrakech, Morocco. D. Vickrey, L. Biewald, M. Teyssier, and D. Koller. 2005. Word-Sense Disambiguation for Machine Translation. In Conference on Empirical Methods in Natural Language Processing (EMNLP), Vancou- ver, Canada, October.

1714

Page 85 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

H CUNI-LMU Submissions in WMT2016: Chimera Constrained and Beaten

CUNI-LMU Submissions in WMT2016: Chimera Constrained and Beaten

Alesˇ Tamchyna1,2 Roman Sudarikov1 Ondrejˇ Bojar1 Alexander Fraser2 1Charles University in Prague, Prague, Czech Republic 2LMU Munich, Munich, Germany [email protected] [email protected]

Abstract 2 English-Czech System Our “baseline” setup is fairly complex, follow- This paper describes the phrase-based sys- ing Bojar et al. (2013). The key components of tems jointly submitted by CUNI and LMU CHIMERA are: to English-Czech and English-Romanian News translation tasks of WMT16. In con- Moses, a phrase-based factored system trast to previous years, we strictly limited • (Koehn et al., 2007). our training data to the constraint datasets, to allow for a reliable comparison with TectoMT, a deep-syntactic transfer-based • other research systems. We experiment system (Popel and Zabokrtskˇ y,´ 2010). with using several additional models in our system, including a feature-rich discrimi- Depfix, a rule-based post-processing system • native model of phrasal translation. (Rosa et al., 2012).

1 Introduction The core of the system is Moses. We combine it with TectoMT in a simple way which we refer to We have a long-term experience with English-to- as “poor man’s” system combination: we translate Czech machine translation and over the years, our our development and test data with TectoMT first systems have grown together from rather diverse and then add the source sentences and their trans- set of system types to a single system combination lations as additional (synthetic) parallel data to the called CHIMERA (Bojar et al., 2013). Moses system. This new corpus is used to train a This system has been successful in the previ- separate phrase table. At test time, we run Moses ous three years of WMT (Bojar et al., 2013; Tam- which uses both phrase tables and we correct its chyna et al., 2014; Bojar and Tamchyna, 2015) and output using Depfix. The system is described in we follow a similar design this year. Unlike pre- detail in Bojar et al. (2013). vious years, we only use constrained data in sys- Our subsequent analysis in Tamchyna and Bojar tem training, to allow for a more meaningful com- (2015) shows that the contribution of TectoMT is parison with the competing systems. The gains essential for the performance of CHIMERA. In par- thanks to the additional data in contrast to the ticular, TectoMT provides new translations which gains thanks the system combination have been are otherwise not available to the phrase-based evaluated in terms of BLEU in Bojar and Tam- system and it also improves the morphological and chyna (2015). The details of our English-to-Czech syntactic coherence of translations. system are in Section 2. In this work, we also present our system sub- 2.1 Translation Models mission for English-Romanian translation. This Similarly to previous years, we build two phrase system uses a factored setting similar to CHIMERA tables – one from parallel data and another from but lacks its two key components: the deep- TectoMT translations of the development and test syntactic translation system TectoMT and the rule- sets. Here we describe the first phrase table. based post-processing component Depfix. All de- Our main system uses CzEng16pre (Bojar et al., tails are in Section 3. 2016) as parallel data. We train a factored TM

385 Proceedings of the First Conference on Machine Translation, Volume 2: Shared Task Papers, pages 385–390, Berlin, Germany, August 11-12, 2016. c 2016 Association for Computational Linguistics

Page 86 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

which uses surface forms on the source and pro- 2.3 Discriminative Translation Model duces target form, lemma and tag. Similarly to We add a feature-rich, discriminative model of previous years, we find that increasing the phrase phrasal translation to our system (Tamchyna et al., table limit (the maximum number of possible 2016). This classifier produces a single phrase translations per source phrase) is necessary to ob- translation probability which is additionally con- tain good performance. ditioned on the full source sentence and limited Our input is also factored (though the phrase ta- left-hand-side target context. The probability is bles do not condition on these additional factors) added as an additional feature to Moses’ log-linear and contains the form, lemma and morphological model. The motivation for adding the context tag. We use these factors to extract rich features model is to improve lexical choice (which can be for our discriminative context model. better inferred thanks to full source-context infor- Linearly interpolated translation models. mation) and morphological coherence. There is some evidence that when dealing with The model uses a rich feature set on both sides: heterogeneous domains, it might be beneficial to In the source, the model has access to the full in- construct the final TM as a linear, uniform interpo- put sentence and uses surface forms, lemmas and lation of many small phrase tables (Carpuat et al., tags. On the target side, the model has access to 2014). We experiment with splitting the data into limited context (similarly to an LM) and uses tar- 20 parts (without any domain selection, simply a get surface forms, lemmas and tags. However, random shuffle) and using linear interpolation to our English-Czech submission to WMT16 does combine the partial models. The added benefit is not use target-context information due to time con- that phrase extraction for all these parts can run in straints. parallel (2h25m per part on average). The merging of these parts took 16h12m, which is still substan- 2.4 Lexicalized Reordering and OSM tially faster than the single extraction (53h7m). We experiment with using a lexicalized reordering model (Koehn et al., 2005) in the common setting: 2.2 Language Models model monotone/swap/discontinuous reordering, Our LM configuration is based on the successful word-based extraction, bidirectional, conditioned setting from previous years, however all LMs are both on the source and target language. trained using the constrained data; this is a major We also train an operation sequence model difference from our previous submissions which (OSM, Durrani et al., 2013), which is a generative used several gigawords of monolingual text for model that sees the translation process as a linear language modeling. sequence of operations which generate a source We train an 7-gram LM on surface forms from and target sentence in parallel. The probability all monolingual news data available for WMT. of a sequence of operations is defined according This LM is linearly interpolated (each year is a to an n-gram model, that is, the probability of an separate model) to optimize perplexity on a held- operation depends on the n 1 preceding opera- − out set (WMT newstest2012). The individual LMs tions. We have trained our 5-gram model on sur- were pruned: we discarded all singleton n-grams face forms, using the CzEng16pre corpus. (apart from unigrams). All other LMs are trained on simple concate- 2.5 Hard POS for Short Words nation of the news part of CzEng16pre and all In addition to the more principled attempts at im- WMT monolingual news sets. We train 4-gram proving our model, mainly Section 2.3, we also LMs on forms and lemmas (with a different prun- manually checked the output and added an ad-hoc ing scheme: we discard 2- and 3-grams which ap- solution for the single most disturbing error: the pear fewer than 2 or 3 times, respectively). abbreviated form “’s” was often translated as the We have two LMs over morphological tags to verb “to be” even in the clearly possessive uses. help maintain morphological coherence of trans- The ambiguity of “’s” is apparently easy to lation outputs. The first LM is a 10-gram model resolve, our tagger does not have problems dis- and the second one is a 15-gram model, aimed at tinguishing and tagging the abbreviation as POS overall sentence structure. We prune all singleton (possesive), VBZ (present tense) and other situa- n-grams (again, with the exception of unigrams). tions. While the POS information is readily avail-

386

Page 87 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

able to the discriminative model, the model might As a contrastive result, CHIMERA, ranking #1 not be able to pick it up due to its wide focus on last year, achieves a BLEU score of 20.46 on new- many phenomena. As an alternative, we simply stest2015 (also prior to the application of Dep- modify the input token and append the POS tag to fix). This suggests that even though we limited our it for all tokens under three characters. training data this year, we did not lose anything in This hack clearly helps with “’s”: in a small terms of translation quality. manual analysis of 52 occurrences of “’s”, the TMs OSM Disc. POS BLEU discriminative model still translated 7 possessive CzEng - - - 19.08 0.62 meanings as present tense, while the hacked model - - - 20.23±0.64 ± avoided these errors. It would be best to combine CzEng+TectoMT  - - 20.91 0.67   - 20.89±0.69 these two approaches, but we did not have the time ± ∗ CzEng20 parts+TectoMT  -  20.70 0.66 to run this setting for the WMT evaluation. Chimera in WMT15  - - 20.46± •

2.6 Results Table 1: Different experiment configurations for CHIMERA. We report BLEU scores on new- We evaluate all system variants on the WMT15 stest2015. The system denoted corresponds to ∗ test set and report all BLEU scores in Table 1 prior our WMT16 submission cu-tamchyna and the to applying the last component, Depfix. system denoted corresponds to cu-chimera. The reordering model achieved mixed results in • our initial experiments and we opt not to include it in our final submission, relying instead only on 3 English-Romanian System the standard distortion penalty feature. As in previous years, the addition of TectoMT We also submitted a constrained phrase-based system for English Romanian translation which to the main phrase table extracted from the paral- → lel corpus (denoted “CzEng” in Table 1) is highly is loosely inspired by the basic components of beneficial, improving the BLEU score by roughly CHIMERA. Additionally, our submission uses the 1.2 points. The addition of OSM also helps, source- and target-context discriminative transla- adding about 0.7 points. tion model as well. The source-context discriminative model does 3.1 Data and Pre-Processing not improve translation quality according to BLEU. We suspect that the space for its contribu- We use all the data available to constrained sub- tion is diminished by the addition of TectoMT and missions: Europarl v8 (Koehn, 2005) and SE- possibly also the OSM and the strong LMs. This TIMES2 (Tiedemann, 2009) parallel corpora and system (labelled with ) was submitted as a pri- News 2015 and Common Crawl monolingual cor- ∗ 1 mary system CU-TAMCHYNA. After the deadline, pora. We split the official development set into we also ran an experiment which included target- two halves; we use the first part for system tuning context features in the model and obtained BLEU and the second part serves as our test set. of 20.96. Data pre-processing differs between English and Romanian. For English, we use Treex (Popel Experiments with the interpolated TM and Zabokrtskˇ y,´ 2010) to obtain morphological (“CzEng20 parts” in the table) and POS ap- tags, lemmas and dependency parses of the sen- pended to words under three characters show a tences. For Romanian, we use the online tagger lower BLEU score (20.70, denoted ) but we • by Tufis et al. (2008) as run by our colleagues at also carried out a small manual evaluation where LIMSI-CNRS for the joint QT21 Romanian sys- the system output seemed to be better than the tem (Peter et al., 2016). baseline (20.91). We therefore submitted this system as our primary CU-CHIMERA. 3.2 Factored Translation In the official WMT16 manual evaluation, both Similarly to CHIMERA, we train a factored phrase our systems end up in the same cluster, ranking table which translates source surface forms to tu- #4 and #5 among all systems for this language ples (form, lemma, tag). Our input is factored pair. The hacked system seems negligibly better • and contains the form, lemma, morphological tag, (0.302 TrueSkill) than the one with the discrimi- native model ( , reaching 0.299 TrueSkill). 1http://commoncrawl.org/ ∗

387

Page 88 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

lemma of dependency parent and analytical func- Setting BLEU tion (“surface” syntactic role, e.g. Subj for sub- baseline 26.2 jects). These additional source-side factors are +tagLM 26.6 again not used by the phrase table and serve only +ccrawl 28.0 as information for the discriminative model. +RM 28.1 +discTM 28.3 3.3 Language Models Table 2: BLEU scores of system variants for Our full system contains three separate language English-Romanian translation. models (LMs). The first is a 5-gram LM over sur- face forms, trained on the target side of the parallel data and monolingual news 2015. We observe a relatively steady additive effect of The second LM only uses 4-grams but addition- the individual components: the addition of each ally contains the full Common Crawl corpus. We model (apart from lexicalized reordering) leads to prune this second LM by discarding 2-, 3- and 4- a statistically significant improvement in transla- grams which appear fewer than 2, 3, 4 times, re- tion quality. spectively. Our discriminative model further improves the Finally, we also include a 7-gram LM over mor- system, despite only being trained on the paral- phological tags. We only use target parallel data lel data (roughly 0.6 M sentence pairs) and build- for estimating the model. ing upon the strong language models which use orders-of-magnitude larger monolingual data (al- 3.4 Reordering Model most 300 M sentences). This variant (BLEU 28.3) Similarly to our experiments with CHIMERA, corresponds to our submission LMU-CUNI. we utilize a lexicalized reordering model (Koehn et al., 2005). Again, we model mono- 4 Conclusion tone/swap/discontinuous reordering, word-based We have described our English-Czech and extraction, bidirectional, conditioned both on the English-Romanian submissions to WMT16: CU- source and target language. CHIMERA, CU-TAMCHYNA and LMU-CUNI. For English-Czech, our work is an incremen- 3.5 Discriminative Translation Model tal improvement of the previously successful We utilize the same discriminative model as for CHIMERA system. This time, our submission CHIMERA. For English-Romanian, we also use is constrained and additionally uses interpolated dependency parses of the source sentences and TMs, an OSM and a discriminative phrasal trans- target-side context features as additional source of lation model. information in our official submission. For English-Romanian, we have built a sys- tem somewhat similar to the statistical component 3.6 Results of CHIMERA. We have added the discriminative Table 2 lists BLEU scores of various system set- model which conditions both on the source and tings. Each BLEU score is an average over 5 runs target context to the system and obtained a small of system tuning (MERT, Och, 2003). The ta- but significant improvement in BLEU. ble shows how BLEU score develops as we add 5 Acknowledgement the individual components to the system: the 7- gram morphological LM (“tagLM”), the 4-gram This work has received funding from the European LM from Common Crawl (“ccrawl”), the lexical- Union’s Horizon 2020 research and innovation ized reordering (“RR”) and finally the discrimina- programme under grant agreements no. 644402 tive translation model (“discTM”). (HimL) and no. 645452 (QT21). This work We test for statistical significance using MultE- has been using language resources stored and dis- val (Clark et al., 2011); we test each new compo- tributed by the LINDAT/CLARIN project of the nent against the system without it (i.e., +tagLM Ministry of Education, Youth and Sports of the is compared to baseline, +ccrawl is tested against Czech Republic (project LM2015071). This work +tagLM etc.). When the p-value is lower than was partially supported by SVV project number 0.05, we mark the result in bold. 260 333.

388

Page 89 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

References Chris Callison-Burch, Miles Osborne, David Talbot, and Michael White. 2005. Edinburgh Ondrejˇ Bojar, Ondrejˇ Dusek,ˇ Tom Kocmi, Jindrichˇ system description for the 2005 IWSLT speech Libovicky,´ Michal Novak,´ Martin Popel, Ro- translation evaluation. In Proceedings of Inter- man Sudarikov, and Dusanˇ Varis.ˇ 2016. CzEng national Workshop on Spoken Language Trans- 1.6: Enlarged Czech-English Parallel Corpus lation. with Processing Tools Dockered. In Text, Speech and Dialogue: 19th International Con- Philipp Koehn, Hieu Hoang, Alexandra Birch, ference, TSD 2016, Brno, Czech Republic, Chris Callison-Burch, Marcello Federico, September 12-16, 2016, Proceedings. Springer Nicola Bertoldi, Brooke Cowan, Wade Shen, Verlag. In press. Christine Moran, Richard Zens, Chris Dyer, Ondrejˇ Bojar, Rudolf Rosa, and Alesˇ Tamchyna. Ondrejˇ Bojar, Alexandra Constantin, and Evan 2013. Chimera – Three Heads for English- Herbst. 2007. Moses: Open Source Toolkit for to-Czech Translation. In Proceedings of the Statistical Machine Translation. In ACL 2007, Eighth Workshop on Statistical Machine Trans- Proceedings of the 45th Annual Meeting of lation. Association for Computational Linguis- the Association for Computational Linguistics tics, Sofia, Bulgaria, pages 92–98. Companion Volume Proceedings of the Demo and Poster Sessions. Association for Compu- Ondrejˇ Bojar and Alesˇ Tamchyna. 2015. CUNI tational Linguistics, Prague, Czech Republic, in WMT15: Chimera Strikes Again. In Pro- pages 177–180. ceedings of the Tenth Workshop on Statistical Machine Translation. Association for Computa- Franz Josef Och. 2003. Minimum Error Rate tional Linguistics, Lisboa, Portugal, pages 79– Training in Statistical Machine Translation. In 83. Proc. of the Association for Computational Lin- guistics. Sapporo, Japan. Marine Carpuat, Cyril Goutte, and George Fos- ter. 2014. Linear mixture models for robust Jan-Thorsten Peter, Tamer Alkhouli, machine translation. In Proceedings of the Matthias Huck Hermann Ney, Fabienne Ninth Workshop on Statistical Machine Trans- Braune, Alexander Fraser, Alesˇ Tamchyna, lation. Association for Computational Linguis- Ondrejˇ Bojar, Barry Haddow, Rico Sennrich, tics, Baltimore, Maryland, USA, pages 499– Fred´ eric´ Blain, Lucia Specia, Jan Niehues, 509. Alex Waibel, Alexandre Allauzen, Lauri- Jonathan H. Clark, Chris Dyer, Alon Lavie, and ane Aufrant, Franck Burlot, Elena Knyazeva, Noah A. Smith. 2011. Better hypothesis test- Thomas Lavergne, Franc¸ois Yvon, Stella Frank, ing for statistical machine translation: Control- and Marcis¯ Pinnis. 2016. The QT21/HimL ling for optimizer instability. In Proceedings of Combined Machine Translation System. In the 49th Annual Meeting of the Association for Proceedings of the Tenth Workshop on Sta- Computational Linguistics: Human Language tistical Machine Translation. Association for Technologies: Short Papers - Volume 2. Asso- Computational Linguistics, Berlin, Germany. ciation for Computational Linguistics, Strouds- In print. burg, PA, USA, HLT ’11, pages 176–181. Martin Popel and Zdenekˇ Zabokrtskˇ y.´ 2010. Tec- Nadir Durrani, Alexander M Fraser, Helmut toMT: Modular NLP Framework. In Hrafn Schmid, Hieu Hoang, and Philipp Koehn. 2013. Loftsson, Eirikur Rognvaldsson,¨ and Sigrun Can markov models over minimal translation Helgadottir, editors, IceTAL 2010. Iceland Cen- units help phrase-based smt? In ACL (2). pages tre for Language Technology (ICLT), Springer, 399–405. volume 6233 of Lecture Notes in Computer Sci- Philipp Koehn. 2005. Europarl: A Parallel Corpus ence, pages 293–304. for Statistical Machine Translation. In Confer- Rudolf Rosa, David Marecek,ˇ and Ondej Dusek.ˇ ence Proceedings: the tenth Machine Transla- 2012. DEPFIX: A System for Automatic Cor- tion Summit. AAMT, AAMT, Phuket, Thailand, rection of Czech MT Outputs. In Proceed- pages 79–86. ings of the Seventh Workshop on Statistical Philipp Koehn, Amittai Axelrod, Alexandra Birch, Machine Translation. Association for Compu-

389

Page 90 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

tational Linguistics, Montreal,´ Canada, pages 362–368. Alesˇ Tamchyna, Alexander Fraser, Ondrejˇ Bojar, and Marcin Junczsys-Dowmunt. 2016. Target- Side Context for Discriminative Models in Statistical Machine Translation. In Proc. of ACL. Association for Computational Linguis- tics, Berlin, Germany. In print. Alesˇ Tamchyna and Ondrejˇ Bojar. 2015. What a Transfer-Based System Brings to the Combi- nation with PBMT. In Bogdan Babych, Kurt Eberle, Patrik Lambert, Reinhard Rapp, Rafael Banchs, and Marta Costa-Jussa,` editors, Pro- ceedings of the Fourth Workshop on Hybrid Ap- proaches to Translation (HyTra). Association for Computational Linguistics, Association for Computational Linguistics, Stroudsburg, PA, USA, pages 11–20. Alesˇ Tamchyna, Martin Popel, Rudolf Rosa, and Ondrejˇ Bojar. 2014. CUNI in WMT14: Chimera Still Awaits Bellerophon . In Pro- ceedings of the Ninth Workshop on Statistical Machine Translation. Association for Computa- tional Linguistics, Baltimore, MD, USA, pages 195–200. Jorg¨ Tiedemann. 2009. News from OPUS - A collection of multilingual parallel corpora with tools and interfaces. In N. Nicolov, K. Bontcheva, G. Angelova, and R. Mitkov, editors, Recent Advances in Natural Lan- guage Processing, John Benjamins, Amster- dam/Philadelphia, Borovets, Bulgaria, vol- ume V, pages 237–248. Dan Tufis, Radu Ion, Alexandru Ceausu, and Dan Stefanescu. 2008. Racai’s linguistic web ser- vices. In Proceedings of the International Con- ference on Language Resources and Evalua- tion, LREC 2008, 26 May - 1 June 2008, Mar- rakech, Morocco.

390

Page 91 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

I Sampling Phrase Tables for the Moses Statistical Machine Transla- tion System

The Prague Bulletin of Mathematical Linguistics NUMBER 104 OCTOBER 2015 39–50

Sampling Phrase Tables for the Moses Statistical Machine Translation System

Ulrich Germann

University of Edinburgh

Abstract The idea of virtual phrase tables for statistical machine translation (SMT) that construct phrase table entries on demand by sampling a fully indexed bitext was first proposed ten years ago by Callison-Burch et al. (2005). However, until recently (Germann, 2014) no working and practical implementation of this approach was available in the Moses SMT system. We describe and evaluate this implementation in more detail. Sampling phrase tables are much faster to build and are competitive with conventional phrase tables in terms of translation quality and speed.

1. Introduction

Phrase-based statistical MT translates by concatenating phrase-level translations that are looked up in a dictionary called the phrase table. In this context, a phrase is any sequence of consecutive words, regardless of whether or not it is a phrase from a linguistic point of view. In addition to the translation options for each phrase, the table stores for each translation option a number of scores that are used by the translation engine (decoder) to rank translation hypotheses according to a statistical model. In the Moses SMT system, the phrase table is traditionally pre-computed as shown in Fig. 1. First, all pairs of phrases up to an arbitrary length limit (usually between 5 and 7 words), and their corresponding translations are extracted from a word-aligned parallel corpus, using the word alignment links as a guide to establish translational correspondence between phrases. Phrase pairs are scored both in the forward and backward translation direction, i.e., p (target | source) and p (source |target), respectively. Computing these scores is traditinally done by sorting the lists on disk first to facili-

© 2015 PBML. Distributed under CC BY-NC-ND. Corresponding author: [email protected] Cite as: Ulrich Germann. Sampling Phrase Tables for the Moses Statistical Machine Translation System. The Prague Bulletin of Mathematical Linguistics No. 104, 2015, pp. 39–50. doi: 10.1515/pralin-2015-0012.

Page 92 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

PBML 104 OCTOBER 2015

word-aligned bitext. (parallel corpus)

extract phrase pairs

phrase pair list invert phrase pair list source ||| target ||| alignment ||| ... target ||| source ||| alignment ||| ... sort & score sort, score & revert sorted & scored phrase table half scored & reverted phr. table half source ||| target ||| alignment ||| ... target ||| source ||| alignment ||| ... merge sort full phrase table (text format) merge scored & sorted phrase table half source ||| target ||| fwd. & bwd. scores ||| ... source ||| target ||| bwd. scores ||| ... prune

pruned phrase table source ||| target ||| fwd. & bwd. scores ||| internal word alignment ||| ...

binarise

binary phrase table

Figure 1. Conventional Phrase Table Construction

tate the accumulation of marginals.This approach requires sorting the list of extracted phrase pairs at least twice: once to obtain joint and marginal counts for estimation of the forward translation probabilities, and once to calculate the marginals for the back- ward probabilities. In practice, forward and backward scoring take place in parallel, as shown in Figure 1. The resulting phrase tables often have considerable levels of noise, due to mis- aligned sentence pairs or alignment errors at the word level. Phrase table pruning removes entries of dubious quality. Even with pruning, conventional phrase tables built from large amounts of parallel data are often too large to be loaded and stored completely in memory. Therefore, various binary phrase table implementations were developed in Moses over the years, providing access to disk-based data base structures (Zens and Ney, 2007)1 or using compressed representations that can be mapped into memory and “unpacked” on demand (Junczys-Dowmunt, 2012).

1The original implementation by R. Zens (PhraseDictionaryBinary) has recently been replaced in Moses by PhraseDictionaryOnDisk (H. Huang, personal communication).

40

Page 93 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

U. Germann Sampling Phrase Tables for Moses (39–50)

Due to the way they are built, conventional phrase tables for Moses are fundamen- tally static in nature: they cannot be updated easily without repeating the entire costly creation process.

2. Phrase tables with on-demand sampling

As an alternative to pre-computed phrase tables, Callison- 7 suffix array Burch et al. (2005) suggested the use of suffix arraysMan- ( 10 suffixarr ay ber and Myers, 1990) to index the parallel training data 3 su ffixarray 4 suf fixarray for full-text search, and to create phrase table entries on 5 suff ixarray demand at translation time, by sampling in the bitext oc- 9 suffixar ray currences of each source phrase in question, extracting 8 suffixa rray counts and statistics as necessary. 1 suffixarray ⟨ ⟩ 2 s uffixarray A suffix array over a corpus w1, ... , wn is an array 6 suffi xarray ⟨1 . . . n⟩ of all token positions in that corpus, sorted in lex- 11 suffixarra y icographic order of the token sequences that start at the respective positions. Figure 2 shows a letter-based suffix Figure 2. Letter-based array over the word ‘suffixarray’. For bitext indexing for suffix array over the word MT, we index at the word level. ‘suffixarray’ Given a suffix array and the underlying corpus, wecan easily find all occurrences of a given search sequence by performing a binary search in the array to determine the first item that is greater or equal to the search sequence, and a second search to find the first item that is strictly greater. Every item in thissub- range of the array is the start position of an occurrence of the search sequence in the corpus. From this pool of occurrences, we extract phrase translations for a reasonably large sample using the usual phrase extraction heuristics. Lopez (2007, 2008) explored this approach in detail in the context of hierarchical phrase-based translation (Chiang, 2007). Schwartz and Callison-Burch (2010) imple- mented Lopez’s methods in the Joshua decoder (Li et al., 2009). Suffix array-based translation rule extraction is also used in cdec (Dyer et al., 2010). However, until re- cently (Germann, 2014), no efficient, working implementation of sampling phrase ta- bles was available in the Moses decoder. The purpose of this article is to document this implementation in detail, and to present results of empirical evaluations that demon- strate that sampling phrase tables are an attractive, efficient, and competitive alterna- tive to conventional phrase tables for phrase-based SMT. The apparent lack of interest in sampling phrase tables in the phrase-based SMT community may have been partly due to the fact that naïve implementations of the approach tend perform worse than conventional phrase tables. To illustrate this point, we repeat in Table 1 the results of a comparison of systems from Germann (2014). Sev- eral German-to-English systems were constructed with conventional and sampling phrase tables. All systems were trained on ca. 2 million parallel sentence pairs from Europarl (Version 7) and the News Commentary corpus (Version 9), both available

41

Page 94 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

PBML 104 OCTOBER 2015

95% conf. # method low high median mean runs intervala 1 precomp., Kneser-Ney smoothing 18.36 18.50 18.45 18.43 17.93 – 18.95 10 2 precomp., Good-Turing smoothing 18.29 18.63 18.54 18.52 18.05 – 19.05 10 3 precomp., Good-Turing smoothing, filteredb 18.43 18.61 18.53 18.53 18.04 – 19.08 10 4 precomp., no smoothing 17.86 18.12 18.07 18.05 17.58 – 18.61 10 5 max. 1000 smpl., no sm., no bwd. prob. 16.70 16.92 16.84 16.79 16.35 – 17.32 10 6 max. 1000 smpl., no sm., with bwd. prob. 17.61 17.72 17.69 17.68 17.14 – 18.22 8 7 max. 1000 smpl., α = .05, with bwd. prob.c 18.35 18.43 18.38 18.38 17.86 – 18.90 10 8 max. 1000 smpl., α = .01, with bwd. prob. 18.43 18.65 18.53 18.52 18.03 – 19.12 10 9 max. 0100 smpl., α = .01, with bwd. prob. 18.40 18.55 18.46 18.46 17.94 – 19.00 10 table adapted from Germann (2014) a computed via bootstrap resampling for the median system in the group. b top 100 entries per source phrase selected according to p (t | s). c α: one-sided confidence level of the Clopper-Pearson confidence interval for the observed counts.

Table 1. Bleu scores (de → en) with different phrase score computation methods.

from the web site of the 2014 Workshop on Statistical Machine Translation (WMT).2 They were tuned on the NewsTest 2013 data set, and evaluated on the NewsTest 2014 data set from the Shared Translation Tasks at WMT-2013 and WMT-2014, respectively. The systems differ in the number of feature functions used (with and without backwards phrase-level translation probabilities) and smoothing methods applied. No lexical- ized reordering model was used in these experiments. Each system was tuned 8-10 times in independent tuning runs with Minimum Error Rate Training (MERT; Och, 2003). Table 1 shows low, high, median, and mean scores over the multiple tuning runs for each system. The first risk in the use of sampling phrase tables is that it is tempting to forfeit the backwards phrase-level translation probabilities in the scoring. The basic sampling and phrase extraction procedure produces source-side marginal and joint counts over a sample of the data, but not the target-side marginal counts necessary to compute p (source | target). These backwards probabilities are, however, important indicators of phrase-level translation quality, and leaving them out hurts performance, as illus- trated by a comparison of Lines 2 and 5 in Tab. 1 (standard setup vs. naïve implemen- tation of the sampling approach without backward probabilities and smoothing). While it is technically possible to “back-sample” phrase translation candidates by performing the sampling and gathering of counts inversely for each phrase translation candidate, this would greatly increase the computational effort required at translation time, and slow down the decoder. A convenient and effective short-cut, however, is to simply scale the target-side global phrase occurence counts of each translation

2http://www.statmt.org/wmt14/translation-task.html#download

42

Page 95 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

U. Germann Sampling Phrase Tables for Moses (39–50)

candidate by the proportion of sample size to total source phrase count:

joint phr. count in sample total source phr. count p (source | target) ≈ · (1) total target phr. count # of source phr. sampled

As Line 6 in Tab. 1 shows, this method narrows the performance gap between con- ventional systems and sampling phrase tables, althout it does not perform as well as “proper” computation of the backwards probabilities (cf. Line 4 in the table). The second disadvantage of the sampling approach is that it cannot use the stan- dard smoothing techniques used to compute smoothed phrase table scores in con- ventional phrase tables, i.e. Good-Turing or Kneser-Ney, as these require global in- formation about the phrase table that is not available when sampling. The results in Lines 4 and 6 (vs. Line 2) confirm the finding by Foster et al. (2006) that phrase table smoothing improves translation quality. One particular problem with maximum likelihood (ML) estimates in the context of translation modeling is the over-estimatation of the observations in small samples. The smaller the sample, the bigger the estimation error. Since the decoder is free to choose the segmentation of the input into source phrases, it has an incentive to pick long, rare phrases. The smaller sample sizes result in bigger over-estimation of the true translation probabilities. This in turn leads to higher model scores, which is what the decoder aims for. Alas, in this case higher model scores usually do not mean higher translation quality — ML estimates introduce modelling errors. Smoothing dampens this effect. In lieu of the established smoothing techniques, we counteract the lure of 0.3 small sample sizes by replacing max- 0.2 imum likelihood estimates with the . 75% confidence. interval lower bound of the binomial confidence . 90% confidence. interval lower bound 0.1 . 95% confidence. interval interval (Clopper and Pearson, 1934) for . 99% confidence. interval 0 . the observed counts in the actual sam- 50 100 150 200 250 300 ple, at an appropriate level of confi- number of trials dence.3 Figure 3 shows the “response curve” of this method for a constant suc- Figure 3. Lower bound of binomial confidence interval for success rate 1 cess rate of 1/3, as the underlying sam- 3 ple size increases. In practice, a confidence level of 99% appears to be a reasonable choice: in our German-to-English experiments, using the lower bound of the binomial confidence interval at this level brought the BLEU performance of the system witha sampling phrase table back to the level of decoding with a conventional phrase table (cf. Line 8 vs. Line 2 in Tab. 1).

3An alternative is the use of additional features that keep track of quantized raw joint phrase counts, e.g., how many phrase translations used in a translation hypothesis were actually observed at most once, twice, three times, or more frequently (Mauser et al., 2007).

43

Page 96 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

PBML 104 OCTOBER 2015

Another concern about sampling phrase tables is speed. After all, phrase extrac- tion and assembly of phrase table entries on the fly do require more computation at translation time. However, caching of entries once created as well as multi-threaded sampling make the current implementation of sampling phrase tables in Moses very competitive with their alternatives. A comparision of translation times for a large French–English system with sam- pling phrase tables vs. the compressed phrase tables of Junczys-Dowmunt (2012) (CompactPT) is given in Tab. 3 (Sec. 7) below. CompactPT is the fastest phrase table implementation available in Moses for translation in practice.

3. Lexicalised reordering model Lexicalized reordering models improve the quality of translation (Galley and Man- ning, 2008). Sampled lexicalised reordering models were not available for the work presented in Germann (2014), but have been implemented since. Our sampling proce- dure keeps track of the necessary information for hierarchical lexicalized reordering (Galley and Manning, 2008) and communicates this information to the lexicalised re- ordering model.

4. Dynamic updates One special feature of the sampling phrase table implementation in Moses is that it al- lows to add parallel text dynamically through an RPC call when moses is run in server mode. This is useful, for example, when Moses serves as the MT back-end in an inter- active post-editing scenario, where bilingual humans post-edit the MT output. Dy- namic updates allow immediate exploitation of the newly created high-quality data to improve MT performance on the spot. To accommodate these updates, the phrase table maintains two separate bitexts: the memory-mapped, static background bitext whose build process was just described, and a dynamic foreground bitext that is kept entirely in-memory. The phrase table’s built-in feature functions can be configured to compute separate scores for foreground and background corpus, or to simply pool the counts. Details for use of this feature are available in the Moses online documentation.4

5. Building and using sampling phrase tables in Moses 5.1. Moses compilation

Compile Moses as usual, but with the switch --with-mm:5 ./bjam --with-mm --prefix=...

4http://www.statmt.org/moses 5Suffix-array based sampling phrase tables are scheduled to be included with the standard builtsoon.

44

Page 97 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

U. Germann Sampling Phrase Tables for Moses (39–50)

The binaries mtt-build, symal2mam, and mmlex-build will be placed in the same direc- tory as the moses executable.

5.2. Binarizing the word-aligned parallel corpus

Binarisation converts the corpus from a text representation into large arrays of 32-bit word IDs, and creates the suffix arrays. Word alignment information is also converted from a text representation (symal output format) to a binary format. In addition, a probabilistic word translation lexicon is extracted from the word-aligned corpus and also stored in binary format. All files are designed to be mapped directly in tomemory for fast loading. Let corpus be the base name of the parallel corpus. The tags src and trg are lan- guage tags that identify the source and the target language. Normally, these tags are mnemonic tags such as en, fr, de, etc. • corpus.src is the source side of the corpus (one sentence per line, tokenized); • corpus.trg is the respective target side in the same format; • corpus.src-trg.symal is the word alignment between the two in the format pro- duced by the symal word alignment symmetriser; • /some/path/ is the path where the binarized model files will be stored. It must ex- ist prior to running the binarizers. The path specification may include a file pre- fix bname. for the individual file names, in which case /some/path/bname. should be used instead of /some/path/ in all steps. Binarisation consists of four steps, the first three of which can be run in parallel. The Step 1: binarise source side: mtt-build < corpus.src -i -o /some/path/src Step 2: binarise target side: mtt-build < corpus.trg -i -o /some/path/trg Step 3: binarise word alignments symal2mam < corpus.src-trg.symal/some/path/src-trg.mam Step 4: produce a word lexicon for lexical scoring mmlex-build corpus.srctrg -o /some/path/src-trg.lex Steps 1 and 2 will produce 3 files each: a map from word strings to word IDs and vice versa (*.tdx), a file with the binarized corpus (*.mct), and the corresponding suffix array (*.sfa). Steps 3 and 4 produce one file each (*.mam and *.lex, respectively).

5.3. Setting up entries in the moses.ini file

In the section [feature], add the following two entries. LexicalReordering name=DM0 type=hier-mslr-bidirectional-fe-allff Mmsapt name=PT0 lrfunc=DM0 path=/some/path/ L1=src L2=trg sample=1000 Note that the value of the path parameter must end in ‘/’ or ‘.’, depending on whether it points to a directory or includes a file name prefix. The value of the parameter lrfunc must match the name of the lexical reordering feature.

45

Page 98 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

PBML 104 OCTOBER 2015

5.4. Setting up sampling phrase tables in EMS In Moses’s Experiment Management System (EMS), the use of sampling phrase tables can be specified by adding the following two lines to the EMS configuration file. mmsapt = "sample=1000" binarize-all = $moses-script-dir/training/binarize-model.perl

6. Configuring the phrase table The phrase table implementation offers numerous configuration options. Due to space constraints, we list only the most important ones here; the full documentation can be found in the online Moses documentation at http://statmt.org/moses. All options can be specified in the phrase table’s configuration linein moses.ini in the format key=value. Below, the letter ‘n’ designates numbers in N,‘f ’ floating point numbers, and ‘s’ strings.

6.1. General Options sample=n the maximum number of samples considered per source phrase. smooth=f the “smoothing” parameter. A value of 0.01 corresponds to a 99% confi- dence interval. workers=n the degree of parallelism for sampling. By default (workers=0), all avail- able cores are used. The phrase table implements its own thread pool; the gen- eral Moses option threads has no effect here. cache=n size of the cache. Once the limit is reached, the least recently used entries are dropped first. ttable-limit=n maximum number of distinct translations to return. extra=s path to additional word-aligned parallel data to seed the foreground corpus for use in an interactive dynamic scenario where phrase tables can be updated while the server is running. This use case is explained in more detail in Germann (2014).

6.2. Feature Functions Currently, word-level lexical translation scores are always computed and provided. Below we list some core feature scores that the phrase table can provide. A compre- hensive list including experimental features is provided online at http://statmt.org/ moses. pfwd=g[+] forward phrase-level translation probability. If g+ is specified, scores are computed and reported separately for the static background and the dynamic foreground corpus. Otherwise, the underlying counts are pooled. pbwd=g[+] backwards phrase-level translation probability with the same interpre- tation of the value specified as for pfwd.

46

Page 99 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

U. Germann Sampling Phrase Tables for Moses (39–50)

source sentence pairs French tokens English tokens CommonCrawl 3.2 M 86 M 78 M EuroParl 2.0 M 58 M 52 M Fr–En Gigaword 21.4 M 678 M 562 M News Commentary 0.2 M 6 M 5 M UN 12.3 M 367 M 318 M Total for TM training 39.1 M 1,185 M 1,016 M News data for LM training 140.0 M 2,874 M

Table 2. Corpus statistics for the parallel WMT-2015 French-English training data.

lenrat={0|1} phrase length ratio score (off/on). Phrase pair creation is modelled as a Bernoulli process with a biased coin: ‘heads’: produce a word in L1, ‘tails’: produce a word in L2. The bias of the coin is determined by the ratio of the lengths (in words) of the two sides of the training corpus. This score is the log of the probability that the phrase length ratio is no more extreme (removed from the mean) than observed. f rare=f rarity penalty: f+j , where j is the phrase pair joint count. This feature is always computed on the pooled counts of foreground and background corpus. j prov=f provenance reward: f+j . This feature is always computed separately for fore- ground and background corpus.

7. Performance on a large dataset Table 3 shows build times and translation performance of two systems built with the large French–English data set available for the WMT-2015 shared translation task (cf. Tab. 2). The first system uses a pruned conventional phrase table binarized asa compact phrase table (Junczys-Dowmunt, 2012) (tuning was performed with an un- pruned, filtered in-memory phrase table); the other system uses a sampling phrase table. The systems were tuned on 760 sentence pairs from the newsdiscussdev2015 development set and evaluated on the newsdiscusstest2015 test set. For technical reasons, we were not able to run the build processes on dedicated machines with identical specifications; build times reported are therefore only ap- proximate numbers. To give the conventional phrase table construction process the benefit of the doubt, the data binarization for the sampling phrase table wasper- formed on a less powerful machine (8 cores) than conventional phrase table construc- tion (24-32 cores), although not all steps in the process can utilize multiple cores. Nev- ertheless, even under these slightly unfair conditions the time savings of the sampling approach are obvious. The translation speed experiments were performed with cube pruning (pop limit: 1000) on the same 24-core machine with 148GB of memory, trans- lating the test set of 1500 sentences (30,000 words) in bulk using all available cores on

47

Page 100 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

PBML 104 OCTOBER 2015

conventional system sampling phrase tables phrase table build time ≫ 20 hrs. ca. 1h 30m Model features: total: 28 total: 18 • word penalty yes yes • phrase penalty yes yes • distortion distance yes yes • language model 5-gram Markov model 5-gram Markov model • TM: phrase transl. forward, backward forward, backward w/ Good-Turing sm. lower bound of 99% conf. interv. • TM: lexical transl. forward, backward forward, backward • rare counts 6 bins: 1/2/3/4/6/10 rarity penalty • lex. reord. model hierarchical-fe-mslr-all-ff hierarchical-fe-mslr-all-ff • phrase length ratio no yes Evaluation 3 independend tuning runs (newsdiscusstest2015) 95% conf. interval 95% conf. interval run BLEU run BLEU via boostrap resampling via boostrap resampling #1 33.16 32.07 – 34.21 #1 33.16 31.96 – 34.27 batch MIRA #2 33.42 32.42 – 34.52 #2 32.89 31.74 – 34.04 #3 33.30 32.16 – 34.39 #3 33.12 32.03 – 34.20 #1 32.19 31.15 – 33.13 #1 34.25 33.11 – 35.37 MERT #2 32.93 31.90 – 34.08 #2 34.11 32.91 – 35.37 #3 31.53 30.39 – 32.68 #3 33.80 32.69 – 34.90 translation speed unpruned top30 sample=1000 sample=100 threads 8 24 24 24 wrds./sec. (sec./wrd.) 13 (0.075) 547 (0.002) 300 (0.003) 501 (0.002) snts./sec. (sec./snt.) 0.7 (1.498) 27 (0.037) 15 (0.067) 25 (0.040) BLEU (best system) 33.42 33.55 34.25 33.82

Table 3. Features used and translation performance for the WMT15 fr–en experiments.

the machine. Prior to the start of Moses, all model files were copied to /dev/null to push them into the operating system’s file cache. Due to race conditions between threads, we limited the number of threads to 8 for the legacy system with the un- pruned phrase table. Notice that in terms of BLEU scores, the two systems perform differently with different tuning methods. The lower performance of MERT for the conventional sys- tem with 28 features is not surprising: it is well known that MERT tends to perform poorly when the number of features exceeds 20. That MIRA fares worse than MERT for the sampling phrase tables may be due to a sub-optimal choice of MIRA’s meta- parameters (cf. Hasler et al., 2011 for details on MIRA’s meta-parameters).

48

Page 101 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

U. Germann Sampling Phrase Tables for Moses (39–50)

8. Conclusion

We have presented an efficient implementation of sampling phrase tables in Moses. With the recent integration of hierarchical lexicalized reordering models into the ap- proach, sampling phrase tables reach the same level of translation quality while ap- proaching CompactPT in terms of speed. In addition, sampling phrase tables offer the following advantages that make them an attractive option both for experimentation and research, and for use in production environments: • They are much faster to build. • They offer flexibility in the choice of feature functions used. Feature functions can be added or disabled without creating the need to re-run the entire phrase table construction pipeline. • They have a lower memory footprint. It is not necessary to filter or prune the phrase tables prior to translation.

9. Availability

Sampling phrase tables are included in the master branch of Moses in the Moses github repository at http://github.com/moses-smt/mosesdecoder.git.

Acknowledgements

This work was supported by the European Union’s Horizon 2020 research and innovation programme (H2020) under grant agreements 645487 (MMT) and 645452 (QT21). It extends work supported by the EU’s Framework 7 (FP7) program un- der grant agreements 287688 (MateCat), 287576 (CasMaCat), and 287576 (ACCEPT).

Bibliography

Callison-Burch, Chris, Colin Bannard, and Josh Schroeder. Scaling Phrase-Based Statistical Machine Translation to Larger Corpora and Longer Phrases. In 43rd Annual Meeting of the Association for Computational Linguistics (ACL ’05), pages 255–262, Ann Arbor, Michigan, 2005. Chiang, David. Hierarchical Phrase-Based Translation. Computational Linguistics, 33(2):1–28, 2007. Clopper, C.J. and E.S. Pearson. The use of confidence or fiducial limits illustrated in the caseof the binomial. Biometrika, 1934. Dyer, Chris, Adam Lopez, Juri Ganitkevitch, Jonathan Weese, Ferhan Ture, Phil Blunsom, Hen- dra Setiawan, Vladimir Eidelman, and Philip Resnik. cdec: A Decoder, Alignment, and Learning Framework for Finite-State and Context-Free Translation Models. In Proceedings of the ACL 2010 System Demonstrations, pages 7–12, Uppsala, Sweden, 2010.

49

Page 102 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

PBML 104 OCTOBER 2015

Foster, George F., Roland Kuhn, and Howard Johnson. Phrasetable Smoothing for Statistical Machine Translation. In EMNLP, pages 53–61, 2006. Galley, Michel and Christopher D. Manning. A Simple and Effective Hierarchical Phrase Re- ordering Model. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 848–856, Honolulu, Hawaii, 2008. Germann, Ulrich. Dynamic Phrase Tables for Machine Translation in an Interactive Post-editing Scenario. In Proceedings of the Workshop on Interactive and Adaptive Machine Translation, pages 20–31, 2014. Hasler, Eva, Barry Haddow, and Philipp Koehn. Margin Infused Relaxed Algorithm for Moses. The Prague Bulletin of Mathematical Linguistics, 96:69–78, 2011. Junczys-Dowmunt, Marcin. Phrasal Rank-Encoding: Exploiting Phrase Redundancy and Translational Relations for Phrase Table Compression. Prague Bull. Math. Linguistics, 98: 63–74, 2012. Li, Zhifei, Chris Callison-Burch, Chris Dyer, Sanjeev Khudanpur, Lane Schwartz, Wren Thorn- ton, Jonathan Weese, and Omar Zaidan. Joshua: An Open Source Toolkit for Parsing-Based Machine Translation. In Fourth Workshop on Statistical Machine Translation, pages 135–139, Athens, Greece, 2009. Lopez, Adam. Hierarchical Phrase-Based Translation with Suffix Arrays. In EMNLP-CoNLL, pages 976–985, 2007. Lopez, Adam. Machine Translation by Pattern Matching. PhD thesis, University of Maryland, College Park, MD, USA, 2008. Manber, Udi and Gene Myers. Suffix Arrays: A New Method for On-line String Searches. In Proceedings of the First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’90, pages 319–327, Philadelphia, PA, USA, 1990. ISBN 0-89871-251-3. Mauser, Arne, David Vilar, Gregor Leusch, Yuqi Zhang, and Hermann Ney. The RWTH Ma- chine Translation System for IWSLT 2007. In Proceedings of the International Workshop on Spoken Language Translation (IWSLT), 2007. Och, Franz Josef. Minimum Error Rate Training in Statistical Machine Translation. In Proceed- ings of the 41st Annual Meeting of the Association for Computational Linguistics, pages 160–167, Sapporo, Japan, 2003. Schwartz, Lane and Chris Callison-Burch. Hierarchical Phrase-Based Grammar Extraction in Joshua: Suffix Arrays and Prefix Trees. The Prague Bulletin of Mathematical Linguistics, 93: 157–166, 2010. Zens, Richard and Hermann Ney. Efficient Phrase-Table Representation for Machine Transla- tion with Applications to Online MT and Speech Translation. In Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT- NAACL ’07), pages 492–499, Rochester, New York, 2007.

Address for correspondence: Ulrich Germann [email protected] University of Edinburgh • 10 Crichton Street • Edinburgh, EH8 9AB, United Kingdom

50

Page 103 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

J Reordering constraints for English-Latvian SMT

The main goal of these experiments was to improve SMT translations by (1) limiting reordering of rather independent parts of sentence (e.g. text in brackets, direct speech, etc.) and (2) limiting unjustified split of phrases (e.g. prepositional phrases) during translation. Two types of experiments were performed:

• Limiting reordering of the top-level phrases; • Limiting reordering of the specific categories of phrases.

J.1 SMT System and Tools The general domain English-Latvian SMT system trained on LetsMT platform [21] was used for experiments. The LetsMT platform is based on Moses toolkit [11]. It supports mechanisms to specify reordering constraints to the decoder, i.e., allowing to mark fragments of the text that needs to be kept together during translation. Two types of reordering constraints – zones and walls – are supported by Moses. Zone is a sequence of the text to be translated without reordering with outside material. It is often used to mark text in brackets or quotes. Walls are “hard reordering constraints: first all words before a wall have to be translated, before words afterwards are translated”.2 When walls are combined with zones they act as local walls, i.e., they are only valid within the zone. To introduce reordering constraints during translation the input text was pre-processed. The following steps were performed during pre-processing:

• Tokenization: each sentence was split into tokens using default Moses tokenizer; • Parsing: the Berkley parser [14] was used for source text parsing; • Adding constraints: special and tags were added in a text as reordering constraints.

J.2 Constraints on Main Constituents We started our experiments with introduction of reordering constraints for main constituents(top level phrases) of the sentence and then added some restrictions for more fine grained phrases. We included phrases that are part (sub-phrases) of the main constituents, some specific com- binations of phrases were considered as well. Results of these experiments are summarized in Table 1. All these experiments lead to slight decrease of BLEU score.

J.3 Constraints on Specific Phrases The aim of this experiment was to evaluate influence of constraints on translation of several types of phrases:

• noun phrases, • prepositional phrases, • verb phrases, • some other phrases.

J.3.1 Noun Phrases Reordering constraints were introduced mostly for long noun phrases. Such noun phrases could contain simple prepositional phrase, have complicated structure or include a simple attributive clause. Some examples of such noun phrases are provided in Figure 1.

2http://www.statmt.org/moses/?n=Advanced.Hybrid

Page 104 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

Description BLEU Baseline 20.11 All main constituents 19.83 All main constituents and all bottom second level phrases 19.79 Main constituents – NP,VP and S – and all bottom second level phrases, except PP 19.89

Table 1: Results of experiments with reordering constraints for main constituents.

Figure 1: Examples of noun phrases for which reordering constraints were applied.

J.3.2 Prepositional Phrases Reordering constraints were introduced for simple prepositional phrases that start with prepo- sitions in, for, by, from or at. Some complex prepositional phrases were also grouped together. Figure 2 illustrates cases when reordering constraints were introduced.

J.3.3 Verb Phrases Reordering constraints were introduced for complex verb phrases, where several verbs are sepa- rated by conjunction, or verb phrase is expressed by several words, e.g., starts with to or will, or verb phrase is followed by object noun phrase. Some examples of such verb phrases are provided in Figure 3.

J.3.4 Other Phrases In addition to previously described categories of phrases, reordering restrictions were applied to some sub-clauses, simple sentences containing simple noun phrase and verb phrase, simple phrases containing single noun, adjective, adverbial or verb. Figure 4 illustrates a case where two parts of the sentence are separated by reordering constraints before translation.

J.3.5 Results of Reordering Constraints in SMT Our first experiment – reordering constraints that were introduced for the top-level phrases – did not lead to improvement in terms of BLEU score. The second experiment, in which reordering constraints were introduced to specific phrase structures, resulted in increase of the BLEU score by 0.5 points. We started with the baseline – 20.11 BLEU points. By inclusion of walls and zones to the particular group of phrases our best system reached 20.64 BLEU.

Page 105 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

Figure 2: Examples of prepositional phrases for which reordering constraints were applied.

Figure 3: Examples of verb phrases for which reordering constraints were applied.

Page 106 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models

Figure 4: Example of complex sentence for which reordering constraints were introduced

Page 107 of 107