Joshua 2.0: a Toolkit for Parsing-Based Machine Translation with Syntax, Semirings, Discriminative Training and Other Goodies

Total Page:16

File Type:pdf, Size:1020Kb

Joshua 2.0: a Toolkit for Parsing-Based Machine Translation with Syntax, Semirings, Discriminative Training and Other Goodies Joshua 2.0: A Toolkit for Parsing-Based Machine Translation with Syntax, Semirings, Discriminative Training and Other Goodies Zhifei Li, Chris Callison-Burch, Chris Dyer,y Juri Ganitkevitch, Ann Irvine, Lane Schwartz,? Wren N. G. Thornton, Ziyuan Wang, Jonathan Weese and Omar F. Zaidan Center for Language and Speech Processing, Johns Hopkins University, Baltimore, MD y Computational Linguistics and Information Processing Lab, University of Maryland, College Park, MD ? Natural Language Processing Lab, University of Minnesota, Minneapolis, MN Abstract authors and several other groups in their daily re- search, and has been substantially refined since the We describe the progress we have made in first release. The most important new functions in the past year on Joshua (Li et al., 2009a), the toolkit are: an open source toolkit for parsing based • Support for any style of synchronous context machine translation. The new functional- free grammar (SCFG) including syntax aug- ity includes: support for translation gram- ment machine translation (SAMT) grammars mars with a rich set of syntactic nonter- (Zollmann and Venugopal, 2006) minals, the ability for external modules to posit constraints on how spans in the in- • Support for external modules to posit transla- put sentence should be translated, lattice tions for spans in the input sentence that con- parsing for dealing with input uncertainty, strain decoding (Irvine et al., 2010) a semiring framework that provides a uni- fied way of doing various dynamic pro- • Lattice parsing for dealing with input un- gramming calculations, variational decod- certainty, including ambiguous output from ing for approximating the intractable MAP speech recognizers or Chinese word seg- decoding, hypergraph-based discrimina- menters (Dyer et al., 2008) tive training for better feature engineering, • A semiring architecture over hypergraphs a parallelized MERT module, document- that allows many inference operations to be level and tail-based MERT, visualization implemented easily and elegantly (Li and of the derivation trees, and a cleaner Eisner, 2009) pipeline for MT experiments. • Improvements to decoding through varia- 1 Introduction tional decoding and other approximate meth- ods that overcome intractable MAP decoding Joshua is an open-source toolkit for parsing-based (Li et al., 2009b) machine translation that is written in Java. The initial release of Joshua (Li et al., 2009a) was a • Hypergraph-based discriminative training for re-implementation of the Hiero system (Chiang, better feature engineering (Li and Khudan- 2007) and all its associated algorithms, includ- pur, 2009b) ing: chart parsing, n-gram language model inte- • A parallelization of MERT’s computations, gration, beam and cube pruning, and k-best ex- and supporting document-level and tail-based traction. The Joshua 1.0 release also included optimization (Zaidan, 2010) re-implementations of suffix array grammar ex- traction (Lopez, 2007; Schwartz and Callison- • Visualization of the derivation trees and hy- Burch, 2010) and minimum error rate training pergraphs (Weese and Callison-Burch, 2010) (Och, 2003; Zaidan, 2009). Additionally, it in- cluded parallel and distributed computing tech- • A convenient framework for designing and niques for salability (Li and Khudanpur, 2008). running reproducible machine translation ex- This paper describes the additions to the toolkit periments (Schwartz, under review) over the past year, which together form the 2.0 re- The sections below give short descriptions for lease. The software has been heavily used by the each of these new functions. 2 Support for Syntax-based Translation ticular span to be translated as an NP. We modi- fied Joshua’s chart-based decoder to support these The initial release of Joshua supported only constraints. Hiero-style SCFGs, which use a single nontermi- nal symbol X. This release includes support for ar- bitrary SCFGs, including ones that use a rich set 4 Semiring Parsing of linguistic nonterminal symbols. In particular we have added support for Zollmann and Venu- In Joshua, we use a hypergraph (or packed forest) gopal (2006)’s syntax-augmented machine trans- to compactly represent the exponentially many lation. SAMT grammar extraction is identical to derivation trees generated by the decoder for an Hiero grammar extraction, except that one side of input sentence. Given a hypergraph, we may per- the parallel is parsed, and syntactic labels replace form many atomic inference operations, such as the X nonterminals in Hiero-style rules. Instead of finding one-best or k-best translations, or com- extracting this Hiero rule from the bitext puting expectations over the hypergraph. For [X] ) [X,1] sans [X,2] j [X,1] without [X,2] each such operation, we could implement a ded- the nonterminals can be labeled according to icated dynamic programming algorithm. How- which constituents cover the nonterminal span on ever, a more general framework to specify these the parsed side of the bitext. This constrains what algorithms is semiring-weighted parsing (Good- types of phrases the decoder can use when produc- man, 1999). We have implemented the in- ing a translation. side algorithm, the outside algorithm, and the [VP] ) [VBN] sans [NP] j [VBN] without [NP] inside-outside speedup described by Li and Eis- [NP] ) [NP] sans [NP] j [NP] without [NP] ner (2009), plut the first-order expectation semir- Unlike GHKM (Galley et al., 2004), SAMT has ing (Eisner, 2002) and its second-order version (Li the same coverage as Hiero, because it allows and Eisner, 2009). All of these use our newly im- non-constituent phrases to get syntactic labels us- plemented semiring framework. ing CCG-style slash notation. Experimentally, we The first- and second-order expectation semi- have found that the derivations created using syn- rings can also be used to compute many interesting tactically motivated grammars exhibit more coher- quantities over hypergraphs. These quantities in- ent syntactic structure than Hiero and typically re- clude expected translation length, feature expec- sult in better reordering, especially for languages tation, entropy, cross-entropy, Kullback-Leibler with word orders that diverge from English, like divergence, Bayes risk, variance of hypothesis Urdu (Baker et al., 2009). length, gradient of entropy and Bayes risk, covari- 3 Specifying Constraints on Translation ance and Hessian matrix, and so on. Integrating output from specialized modules (like transliterators, morphological analyzers, and 5 Word Lattice Input modality translators) into the MT pipeline can improve translation performance, particularly for We generalized the bottom-up parsing algorithm low-resource languages. We have implemented that generates the translation hypergraph so that an XML interface that allows external modules it supports translation of word lattices instead of to propose alternate translation rules (constraints) just sentences. Our implementation’s runtime and for a particular word span to the decoder (Irvine memory overhead is proportional to the size of the et al., 2010). Processing that is separate from lattice, rather than the number of paths in the lat- the MT engine can suggest translations for some tice (Dyer et al., 2008). Accepting lattice-based set of source side words and phrases. The XML input allows the decoder to explore a distribution format allows for both hard constraints, which over input sentences, allowing it to select the best must be used, and soft constraints, which compete translation from among all of them. This is es- with standard extracted translation rules, as well pecially useful when Joshua is used to translate as specifying associated feature weights. In ad- the output of statistical preprocessing components, dition to specifying translations, the XML format such as speech recognizers or Chinese word seg- allows constraints on the lefthand side of SCFG menters, which can encode their alternative analy- rules, which allows constraints like forcing a par- ses as confusion networks or lattices. 6 Variational Decoding need to use an oracle translation (i.e., the transla- tion in the hypergraph that is most simmilar to the Statistical models in machine translation exhibit reference translation) as a surrogate for training. spurious ambiguity. That is, the probability of an We implemented the oracle extraction algorithm output string is split among many distinct deriva- described by Li and Khudanpur (2009a) for this tions (e.g., trees or segmentations) that have the purpose. same yield. In principle, the goodness of a string Given the current infrastructure, other training is measured by the total probability of its many methods (e.g., maximum conditional likelihood or derivations. However, finding the best string dur- MIRA as used by Chiang et al. (2009)) can also be ing decoding is then NP-hard. The first version of easily supported with minimum coding. We plan Joshua implemented the Viterbi approximation, to implement a large number of feature functions which measures the goodness of a translation us- in Joshua so that exhaustive feature engineering is ing only its most probable derivation. possible for MT. The Viterbi approximation is efficient, but it ig- nores most of the derivations in the hypergraph. 8 Minimum Error Rate Training We implemented variational decoding (Li et al., 2009b), which works as follows. First, given a for- Joshua’s MERT module optimizes parameter eign string (or lattice), the MT system produces a weights so as to maximize performance on a de-
Recommended publications
  • Handbook of Natural Language Processing and Machine Translation
    Handbook of Natural Language Processing and Machine Translation Joseph Olive · Caitlin Christianson · John McCary Editors Handbook of Natural Language Processing and Machine Translation DARPA Global Autonomous Language Exploitation 123 Editors Joseph Olive Caitlin Christianson Defense Advanced Research Projects Agency Defense Advanced Research Projects Agency IPTO Reston Virginia, USA N Fairfax Drive 3701 [email protected] Arlington, VA 22203, USA [email protected] John McCary Defense Advanced Research Projects Agency Bethesda Maryland, USA [email protected] ISBN 978-1-4419-7712-0 e-ISBN 978-1-4419-7713-7 DOI 10.1007/978-1-4419-7713-7 Springer New York Dordrecht Heidelberg London Library of Congress Control Number: 2011920954 c Springer Science+Business Media, LLC 2011 All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com) Acknowledgements First, I would like to thank the community for all of its hard work in making GALE a success.
    [Show full text]
  • Part 5: Machine Translation Evaluation Editor: Bonnie Dorr
    Part 5: Machine Translation Evaluation Editor: Bonnie Dorr Chapter 5.1 Introduction Authors: Bonnie Dorr, Matt Snover, Nitin Madnani The evaluation of machine translation (MT) systems is a vital field of research, both for determining the effectiveness of existing MT systems and for optimizing the performance of MT systems. This part describes a range of different evaluation approaches used in the GALE community and introduces evaluation protocols and methodologies used in the program. We discuss the development and use of automatic, human, task-based and semi-automatic (human-in-the-loop) methods of evaluating machine translation, focusing on the use of a human-mediated translation error rate HTER as the evaluation standard used in GALE. We discuss the workflow associated with the use of this measure, including post editing, quality control, and scoring. We document the evaluation tasks, data, protocols, and results of recent GALE MT Evaluations. In addition, we present a range of different approaches for optimizing MT systems on the basis of different measures. We outline the requirements and specific problems when using different optimization approaches and describe how the characteristics of different MT metrics affect the optimization. Finally, we describe novel recent and ongoing work on the development of fully automatic MT evaluation metrics that can have the potential to substantially improve the effectiveness of evaluation and optimization of MT systems. Progress in the field of machine translation relies on assessing the quality of a new system through systematic evaluation, such that the new system can be shown to perform better than pre-existing systems. The difficulty arises in the definition of a better system.
    [Show full text]
  • TER-Plus: Paraphrase, Semantic, and Alignment Enhancements to Translation Edit Rate
    Mach Translat DOI 10.1007/s10590-009-9062-9 TER-Plus: paraphrase, semantic, and alignment enhancements to Translation Edit Rate Matthew G. Snover · Nitin Madnani · Bonnie Dorr · Richard Schwartz Received: 15 May 2009 / Accepted: 16 November 2009 © Springer Science+Business Media B.V. 2009 Abstract This paper describes a new evaluation metric, TER-Plus (TERp) for auto- matic evaluation of machine translation (MT). TERp is an extension of Translation Edit Rate (TER). It builds on the success of TER as an evaluation metric and alignment tool and addresses several of its weaknesses through the use of paraphrases, stemming, synonyms, as well as edit costs that can be automatically optimized to correlate better with various types of human judgments. We present a correlation study comparing TERptoBLEU,METEOR and TER, and illustrate that TERp can better evaluate translation adequacy. Keywords Machine translation evaluation · Paraphrasing · Alignment M. G. Snover (B) · N. Madnani · B. Dorr Laboratory for Computational Linguistics and Information Processing, Institute for Advanced Computer Studies, University of Maryland, College Park, MD, USA e-mail: [email protected] N. Madnani e-mail: [email protected] B. Dorr e-mail: [email protected] R. Schwartz BBN Technologies, Cambridge, MA, USA e-mail: [email protected] 123 M. G. Snover et al. 1 Introduction TER-Plus, or TERp1 (Snover et al. 2009), is an automatic evaluation metric for machine translation (MT) that scores a translation (the hypothesis) of a foreign lan- guage text (the source) against a translation of the source text that was created by a human translator, which we refer to as a reference translation.
    [Show full text]
  • The History and Promise of Machine Translation
    The History and Promise of Machine Translation Lane Schwartz Department of Linguistics University of Illinois at Urbana-Champaign Urbana IL, USA [email protected] Abstract This work examines the history of machine translation (MT), from its intellectual roots in the 17th century search for universal language through its practical realization in the late 20th and early 21st centuries. We survey the major MT paradigms, including transfer-based, interlingua, and statistical approaches. We examine the current state of human-machine partnership in translation, and consider the substantial, yet largely unfulfilled, promise that MT technology has for human translation professionals. 1 1 Introduction With the possible exception of the calculation of artillery trajectory tables (Goldstine and Goldstine 1946), machine translation has a strong claim to be the oldest established research discipline within computer science. Machine translation is a modern discipline, one whose success and very existence is predicated on the existence of modern computing hardware. Yet certain core ideas which would eventually serve as the foundations for this new discipline have roots which predate the development of electronic digital computers. This work examines the history of machine translation (MT), from its intellectual roots in the 17th century search for universal language through its practical realization in the late 20th and early 21st centuries. Beginning with the development of the first general-purpose electronic digital computers in the late 1940s, substantial
    [Show full text]
  • CURRICULUM Vitae Nitin MADNANI
    CURRICULUM vitae Nitin MADNANI 666 Rosedale Rd., MS 13-R [email protected] Princeton NJ 08541 http://www.desilinguist.org Current Status: Legal Permanent Resident Current Senior Research Scientist, NLP & Speech Position Educational Testing Service (ETS), Princeton, New Jersey (2017-present) Previous Research Scientist, NLP & Speech Positions Educational Testing Service (ETS), Princeton, New Jersey (2013-2017) Associate Research Scientist, NLP & Speech Educational Testing Service (ETS), Princeton, New Jersey (2010-2013) Education University of Maryland, College Park, MD. Ph.D. in Computer Science, GPA: 4.0 Dissertation Title: The Circle of Meaning: From Translation to Paraphrasing and Back 2010. University of Maryland, College Park, MD. M.S in Computer Engineering, GPA: 3.77 Concentration: Computer Organization, Microarchitecture and Embedded Systems 2004. Punjab Engineering College, Panjab University, India. B.E. in Electrical Engineering, With Honors Senior Thesis: Interactive Visualization of Grounding Systems for Power Stations 2000. Professional Computational Linguistics, Natural Language Processing, Statistical Machine Translation, Auto- Interests matic Paraphrase Generation, Machine Learning, Artificial Intelligence, Educational Technology, and Computer Science Education. Publications Book Chapters ◦ Jill Burstein, Beata Beigman-Klebanov, Nitin Madnani and Adam Faulkner. Sentiment Analy- sis & Detection for Essay Evaluation. Handbook for Automated Essay Scoring, Taylor and Francis. Mark D. Shermis & Jill Burstein (eds.). 2013. ◦ Jill Burstein, Joel Tetreault and Nitin Madnani. The E-rater Automated Essay Scoring System. Handbook for Automated Essay Scoring, Taylor and Francis. Mark D. Shermis & Jill Burstein (eds.). 2013. ◦ Yaser Al-Onaisan, Bonnie Dorr, Doug Jones, Jeremy Kahn, Seth Kulick, Alon Lavie, Gregor Leusch, Nitin Madnani, Chris Manning, Arne Mauser, Alok Parlikar, Mark Przybocki, Rich Schwartz, Matt Snover, Stephan Vogel and Clare Voss.
    [Show full text]