Scientific Report on Rich Tree-Based

D1.4: Scientific Report on Rich Tree-Based SMT Ondˇrej Bojar, Mauro Cettolo, Silvie Cinková, Philipp Koehn, Miroslav Týnovský, Zdenˇek Zabokrtskýˇ Distribution: Public EuroMatrixPlus Bringing Machine Translation for European Languages to the User ICT 231720 Deliverable D1.4 Revision 324, 2010-03-14 23:16:53 +0100 (Sun, 14 Mar 2010) Project funded by the European Community under the Seventh Framework Programme for Research and Technological Development. Project ref no. ICT-231720 Project acronym EuroMatrixPlus Project full title Bringing Machine Translation for European Languages to the User Instrument STREP Thematic Priority ICT-2007.2.2 Cognitive systems, interaction, robotics Start date / duration 01 March 2009 / 38 Months Distribution Public Contractual date of delivery February 28, 2012 Actual date of delivery March 31, 2012 Deliverable number D1.4 Deliverable title Scientific Report on Rich Tree-Based SMT Type Report Status & version Final, (revision 324) Number of pages 16 Contributing WP(s) WP1 WP / Task responsible CU Other contributors UEDIN, FBK Internal reviewer Chris Callison-Burch Author(s) Ondˇrej Bojar, Mauro Cettolo, Silvie Cinková, Philipp Koehn, Miroslav Týnovský, Zdenˇek Zabokrtskýˇ EC project officer Michel Brochard The partners in DFKI GmbH, Saarbrücken (DFKI) EuroMatrixPlus University of Edinburgh (UEDIN) are: Charles University (CUNI-MFF) Johns Hopkins University (JHU) Fondazione Bruno Kessler (FBK) Universitédu Maine, Le Mans (LeMans) Dublin City University (DCU) Lucy Software and Services GmbH (Lucy) Central and Eastern European Translation, Prague (CEET) Ludovit Stur Institute of Linguistics, Slovak Academy of Sciences (LSIL) Institute of Information and Communication Technologies, Bulgarian Academy of Sciences (IICT-BAS) For copies of reports, updates on project activities and other EuroMatrixPlus-related information, contact: The EuroMatrixPlus Project Co-ordinator Prof. Dr. Hans Uszkoreit, DFKI GmbH Stuhlsatzenhausweg 3, 66123 Saarbrücken, Germany [email protected] Phone +49 (681) 85775-5282 - Fax +49 (681) 85775-5338 Copies of reports and other material can also be accessed via the project’s homepage: http://www.euromatrixplus.net/ c 2012, The Individual Authors No part of this document may be reproduced or transmitted in any form, or by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval system, without permission from the copyright owner. Contents Executive Summary 4 1 WP1: Rich Tree-Based Statistical Translation 5 1.1 Task 1.1: Shallow syntax modelling . ........ 5 1.2 Task 1.2: Develop rich contextual features . ........... 5 1.3 Task 1.3: TectoMT platform development . ........ 6 1.4 Task 1.4: Czech/English parallel data: extended annotation............ 6 1.4.1 PCEDT2.0................................... 6 1.4.2 PEDT2.0.................................... 7 1.5 Task 1.5: Tree-based feature combination mapping . ............ 7 1.5.1 Motivation ................................... 7 1.5.2 TaskDescription ............................... 8 1.5.3 DataDescription ............................... 8 1.5.4 ExperimentsConfiguration . 10 1.5.5 Conclusion ................................... 11 1.6 Task 1.6: Internal System Evaluation . ......... 11 References 14 3 Executive Summary This deliverable describes the work of WP1 (Rich Tree-Based Statistical Translation) of the Eu- roMatrixPlus project. The following pages provide more details on the progress in the following tasks: Task Months Status Task 1.1: Shallow syntax modeling 1–24 done Task 1.2: Develop rich contextual features 12–24 done Task 1.3: TectoMT platform development 1–36 done Task 1.4: Czech-English annotation 1–33 done Task 1.5: Tree-based features 1–24 done Task 1.6: System evaluation 1–36 done Work in Years Two and Three Work was carried out on the following tasks as outlined in the Description of Work: Task 1.1 Shallow syntax modeling (fbk, month 1–24) We continued to carry out experiments with various system configurations and various languages; in particular, on Arabic a new effective verb reordering model was designed, implemented and experimentally tested. For details, see Section 1.1 and the cited publi- cations. Task 1.2 Develop rich contextual features (uedin, month 12-24) We developed a number of frameworks that allow for the integration of a large number of features: work based on Gibbs sampling and SampleRank. We also re-implemented state- of-the-art approaches, namely MIRA and pairwise ranked optimization (PRO). These approaches have shown modest gain when using simple sparse features. We also explored the use of large contextual features in a maximum entropy approach during training (not tuning), aiming at reordering with nice gains over a strong baseline. Task 1.3 TectoMT platform development (cu, month 1–36) The Treex (formerly TectoMT) platform was greatly improved in terms of robustness (multiple new languages used and very large data for English and Czech processed), speed (profiling and a speedup of about 30%) as well as release. The core Treex modules are now publicly available from CPAN. Task 1.4 Czech-English annotation (cu, month 1–33) The annotation of both Czech and English data at the tectogrammatical layer of repre- sentation was finished in time. The data were wrapped and made publicly available in two separate releases: Prague Czech-English Dependency Treebank 2.0 (covering both sides of the treebank) and Prague English Dependency Treebank 2.0 (covering just the English tectogrammatical annotation with merged with some additional linguistic resources). Task 1.5 Tree-based features (cu, month 1–24) The aim of this task was to tackle the problem of predicting attributes of nodes in target- side deep-syntactic trees based on the source nodes. The study is presented in this deliverable. Unfortunately, the data-driven method suggested here does not reach satisfactory accuracy, so it was not incorporated in out deep-syntactic MT system (TectoMT, see also Task 1.3). Task 1.6 Internal system evaluation (all participants, month 1–36) Several partners have taken part in various evaluation campaigns, thus evaluating their systems in an open competition. Several studies of techniques of manual MT evaluation were also published. 4 Chapter 1 WP1: Rich Tree-Based Statistical Translation 1.1 Task 1.1: Shallow syntax modelling Syntactic disfluencies in Arabic-to-English phrase-based SMT output are often due to incorrect verb reordering in VerbSubjectObject sentences. As a solution, we proposed (Bisazza and Fed- erico, 2010; Bisazza et al., 2011) a chunk-based reordering technique to automatically displace clause-initial verbs in the Arabic side of a word-aligned parallel corpus. This method is used to preprocess the training data, and to collect statistics about verb movements. From this analysis we build specific verb reordering lattices on the test sentences before decoding, and test different lattice-weighting schemes. Finally, we train a feature-rich discriminative model to predict likely verb reorderings for a given Arabic sentence. The model scores are used to prune the reordering lattice, leading to better word reordering at decoding time. The application of our reordering methods to the training and test data resulted in consistent improvements on the NIST-MT 2009 ArabicEnglish benchmark, both in terms of BLEU (+1.06%) and of reordering quality (+0.85%) measured with the Kendall Reordering Score. 1.2 Task 1.2: Develop rich contextual features The long-term goal of this task is to develop models for machine translation that may use arbitrary features over the source context of a word, phrase, sentence, and document. This involves both the development of models and training methods to allow for such rich featured models (machine learning research) and the investigation on which features are most beneficial (feature engineering research). On the machine learning side, we have explored the use of Bayesian models that are trained on a sampling of the space of possible translations using Gibb’s Sampling (Arun et al., 2009; Arun et al., 2010a; Arun et al., 2010b). This sampling is guaranteed to converge to the true distribution, and hence avoids the bias of just looking at the most likely events. A different sampling method, SampleRank (Haddow et al., 2011), performs a random walk. While it does not come with the same guarantees, it tends to converge faster. We also re-implemented MIRA (Hasler et al., 2011), which has been reported in the literature to work well with large tuning sets. We have shown that this implementation copes well with a large number of sparse features. We also re-implemented Pairwise Ranked Optimization (PRO), which gave us improvements when applied to a complex factored model. Finally, we also explored the use of rich contextual features to aid reordering at the training stage. A maximum entropy classifier aids reordering decisions in a hierarchical model (Gao et al., 2011), improving over a strong baseline. 5 1.3 Task 1.3: TectoMT platform development Treex (formerly TectoMT), which is a common platform developed for linguistically rich pro- cessing of text, went through a number of substantial design improvements in the last year. We focused especially on three aspects: (1) Robustness, (2) Speed, and (3) Support for external users: Robustness: numerous tests were created for testing functionality correctness as well as for checking overall design and coding quality. Data-intensive tests were executed too: about 15 million sentence pairs from an English-Czech parallel corpus were analyzed by Treex tools, the same amount of English sentences were translated to Czech by Treex MT sce- nario, and Treex was also tested on a number of other languages (more than 30 treebanks are converted into Treex now). Speed: a careful profiling of all core components was performed, which led to overall MT pipeline speed-up of about 30%. Support for external users: all Treex core components are now fully documented and can be easily installed by anyone from CPAN, which is a broadly respected (de facto standard) repository of Perl libraries. Besides implementing infrastructure improvements, Treex was used: • in several NLP studies, such as Popel et al.

Scientific Report on Rich Tree-Based

(Meta-) Evaluation of Machine Translation

Public Project Presentation and Updates

8.14: Annual Public Report

Statistical and Hybrid Machine Translation Between All European Languages

Abstract the Circle of Meaning: from Translation

Improving Statistical Machine Translation Efficiency by Triangulation

Factored Translation Models and Discriminative Training For

Convergence of Translation Memory and Statistical Machine Translation

Statistical Machine Translation

Edinburgh's Statistical Machine Translation Systems for WMT16

Improved Learning for Machine Translation

A Survey of Statistical Machine Translation