D1.4 Semantics in Shallow Models

This document is part of the Research and Innovation Action “Quality Translation 21 (QT21)”. This project has received funding from the European Union’s Horizon 2020 program for ICT under grant agreement no. 645452. Deliverable D1.4 Semantics in Shallow Models Ondřej Bojar (CUNI), Rico Sennrich (UEDIN), Philip Williams (UEDIN), Khalil Sima’an (UvA), Stella Frank (UvA), Inguna Skadiņa (Tilde), Daiga Deksne (Tilde) Dissemination Level: Public 31st January, 2018 Quality Translation 21 D1.4: Semantics in Shallow Models Grant agreement no. 645452 Project acronym QT21 Project full title Quality Translation 21 Type of action Research and Innovation Action Coordinator Prof. Josef van Genabith (DFKI) Start date, duration 1st February, 2015, 36 months Dissemination level Public Contractual date of delivery 31st July, 2016 Actual date of delivery 31st July, 2016 Deliverable number D1.4 Deliverable title Semantics in Shallow Models Type Report Status and version Final (Version 1.0) Number of pages 107 Contributing partners CUNI, UEDIN, RWTH, UvA, Tilde WP leader CUNI Author(s) Ondřej Bojar (CUNI), Rico Sennrich (UEDIN), Philip Williams (UEDIN), Khalil Sima’an (UvA), Stella Frank (UvA), Inguna Skadiņa (Tilde), Daiga Deksne (Tilde) EC project officer Susan Fraser The partners in QT21 are: • Deutsches Forschungszentrum für Künstliche Intelligenz GmbH (DFKI), Germany • Rheinisch-Westfälische Technische Hochschule Aachen (RWTH), Germany • Universiteit van Amsterdam (UvA), Netherlands • Dublin City University (DCU), Ireland • University of Edinburgh (UEDIN), United Kingdom • Karlsruher Institut für Technologie (KIT), Germany • Centre National de la Recherche Scientifique (CNRS), France • Univerzita Karlova v Praze (CUNI), Czech Republic • Fondazione Bruno Kessler (FBK), Italy • University of Sheffield (USFD), United Kingdom • TAUS b.v. (TAUS), Netherlands • text & form GmbH (TAF), Germany • TILDE SIA (TILDE), Latvia • Hong Kong University of Science and Technology (HKUST), Hong Kong For copies of reports, updates on project activities and other QT21-related information, contact: Prof. Stephan Busemann, DFKI GmbH [email protected] Stuhlsatzenhausweg 3 Phone: +49 (681) 85775 5286 66123 Saarbrücken, Germany Fax: +49 (681) 85775 5338 Copies of reports and other material can also be accessed via the project’s homepage: http://www.qt21.eu/ © 2018, The Individual Authors No part of this document may be reproduced or transmitted in any form, or by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval system, without permission from the copyright owner. Page 2 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models Contents 1 Executive Summary 4 2 Improved Modelling of Target Grammar in SMT 5 2.1 Free Word-Order and Predicting Alternative Target Word Orders with Hierar- chical Models ...................................... 5 2.2 Dependency Language Models ............................. 5 2.3 Unification-based Constraints ............................. 6 2.4 Modeling Selectional Preferences ........................... 6 3 Richer Source Annotation in SMT 6 3.1 Using Verb Patterns in PBMT ............................ 7 3.2 Discriminative Models with Context Information for PBMT ............ 7 3.3 Source Annotation in Syntax-based Models ..................... 7 3.4 Sampling Phrase Tables for the Moses Statistical Machine Translation System .. 8 3.5 Reordering Constraints in PBMT ........................... 8 References 8 Appendices 12 Appendix A Examining the Relationship between Preordering and Word Order Freedom in Machine Translation 12 Appendix B A Joint Dependency Model of Morphological and Syntactic Struc- ture for Statistical Machine Translation 25 Appendix C Edinburgh’s Syntax-Based Systems at WMT 2015 32 Appendix D Edinburgh’s Statistical Machine Translation Systems for WMT16 43 Appendix E Modeling Selectional Preferences of Verbs and Nouns in String-to- Tree Machine Translation 55 Appendix F Using Verb Patterns in PBMT 66 Appendix G Target-Side Context for Discriminative Models in Statistical Ma- chine Translation 75 Appendix H CUNI-LMU Submissions in WMT2016: Chimera Constrained and Beaten 86 Appendix I Sampling Phrase Tables for the Moses Statistical Machine Transla- tion System 92 Appendix J Reordering constraints for English-Latvian SMT 104 J.1 SMT System and Tools ................................ 104 J.2 Constraints on Main Constituents .......................... 104 J.3 Constraints on Specific Phrases ............................ 104 J.3.1 Noun Phrases .................................. 104 J.3.2 Prepositional Phrases ............................. 105 J.3.3 Verb Phrases .................................. 105 J.3.4 Other Phrases ................................. 105 J.3.5 Results of Reordering Constraints in SMT .................. 105 Page 3 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models 1 Executive Summary This deliverable reports all the results achieved by the consortium during the whole 36 months of the project falling under Task 1.2 as specified in the project proposal: Task 1.2 Semantics in Shallow Models [M01–M36] (CUNI, UEDIN, RWTH, UvA) In this task, syntactically oriented approaches will be experimented with. Shortcomings of current sequential models will be addressed by exploring string-to- tree and SCFG models. In addition, richer source-side annotation will be considered. We will also experiment with methods for automatic identification of systematic phrase table errors within the PBMT framework. As described in the introduction to Deliverable 1.2 (Semantic Translation Models), a major paradigm shift towards neural MT has happened over the course of the project, setting new baselines and somewhat postponing the benefit of explicit linguistically-motivated syntactic descriptions of the sentence over these baselines. Since the detailed description of this task mentions pre-neural approaches to MT explicitly (synchronous context-free grammars, SCFG, and phrase-based SMT, PBMT), we decided to reserve this deliverable to work on non-neural models. Due to the general shift of attention as an immediate consequence of the difference in translation quality, the majority of this work happened in the first half of the project. The first part of this deliverable (Section 2) addresses the first of the topics in Task 1.2, namely handling of syntax in shallow models and in string-to-tree models. In Section 2.1, UvA investigates the use of pre-orderings in the more challenging free word- order setting, showing where the shortcomings of current SCFG-based approaches arise and proposing a potential solution. In string-to-tree models, which incorporate syntactic annotation on the target-side, UEDIN aimed to make better use of this syntactic information. In Section 2.2, UEDIN improves the grammaticality of dependency-syntax models by developing a dependency representation of Ger- man compounds and particle verbs (the German counterpart of English phrasal verbs, where the separable prefix gets moved to the end of the clause or sentence). In Section 2.3, in collaboration with CUNI, UEDIN develops agreement constraints for Czech to try to ensure that grammatical features, including syntactic case, are expressed consistently. In Section 2.4, UEDIN introduces a model of semantic affinity between verbal predicates and nominal argument heads. The second part of this deliverable (Section 3) then comprises works that allow the SMT system to somehow benefit from a richer annotation of the source sentence. With respect to phrase-based models, disambiguation of words in the source sentence has been shown to improve performance. In Section 3.1, CUNI experiments with disambiguation of verb senses using verb patterns. A shortcoming of this approach is that it can only use a limited window of source context. We proceed with Section 3.2, where CUNI incorporated discriminative models into PBMT, to allow for inclusion of arbitrary additional information from the source and also in part from the target into consideration. Where a language pair has fixed word order, syntax is particularly informative for deter- mining the semantics of a sentence and we would like to exploit this source of information. In Section 3.3, UEDIN investigates the use of source-side syntactic annotation in tree-to-string models. This is particularly relevant for the common scenario of translation from English. In Section 3.4, UEDIN extends support for sampling phrase tables in the Moses toolkit, demonstrating that this method can match the performance of the conventional method of building large static phrase tables up-front, while expanding the potential for more rapid ex- perimentation with rich features. Finally, in Section 3.5, Tilde experiments with reordering constraints by applying them to the different parts of the sentence, as well as to several categories of phrases. For English- Latvian SMT, putting reordering limits on top level phrases did not show improvements in terms of BLEU score, while limiting reordering for specific phrases resulted in increase by 0.5 BLEU. Page 4 of 107 Quality Translation 21 D1.4: Semantics in Shallow Models 2 Improved Modelling of Target Grammar in SMT This part of the deliverable deals with SMT models that focus on target syntactic structure using various technical means. In Section 2.1, grammaticality of the target is improved by pre-ordering the source. In Section 2.2, a dedicated dependency language model is designed. The two following sections (Section 2.3 and Section 2.4), unification

D1.4 Semantics in Shallow Models

Machine-Translation Inspired Reordering As Preprocessing for Cross-Lingual Sentiment Analysis

Actes Des 2Èmes Journées Scientifiques Du Groupement De Recherche Linguistique Informatique Formelle Et De Terrain (LIFT)

Projecto Em Engenharia Informatica

New Kazakh Parallel Text Corpora with On-Line Access

Multilingual Semantic Role Labeling with Unified Labels

Learning and Applications of Paraphrastic Representations for Natural Language, Ph.D

*Λcross-Lingual Semantic Parsing with Categorial Grammars

Towards a Corsican Basic Language Resource Kit

Proceedings of the Celtic Language Technology Workshop 2019

Corpus Linguistics Main Issues

Dirt Cheap Web-Scale Parallel Text from the Common Crawl Jason R

Download Through Lepage-Lab.Ips.Waseda.Ac.Jp/En/Projects/Kakenhi-15K00317/ (Accessed on 20 September 2020) Or the Links Provided in the Body of the Paper