Joshua 2.0: A Toolkit for Parsing-Based with Syntax, Semirings, Discriminative Training and Other Goodies

Zhifei Li, Chris Callison-Burch, Chris Dyer,† Juri Ganitkevitch, Ann Irvine, Lane Schwartz,? Wren N. G. Thornton, Ziyuan Wang, Jonathan Weese and Omar F. Zaidan Center for Language and Speech Processing, Johns Hopkins University, Baltimore, MD † Computational Linguistics and Information Processing Lab, University of Maryland, College Park, MD ? Natural Language Processing Lab, University of Minnesota, Minneapolis, MN

Abstract authors and several other groups in their daily re- search, and has been substantially refined since the We describe the progress we have made in first release. The most important new functions in the past year on Joshua (Li et al., 2009a), the toolkit are: an open source toolkit for parsing based • Support for any style of synchronous context machine translation. The new functional- free grammar (SCFG) including syntax aug- ity includes: support for translation gram- ment machine translation (SAMT) grammars mars with a rich set of syntactic nonter- (Zollmann and Venugopal, 2006) minals, the ability for external modules to posit constraints on how spans in the in- • Support for external modules to posit transla- put sentence should be translated, lattice tions for spans in the input sentence that con- parsing for dealing with input uncertainty, strain decoding (Irvine et al., 2010) a semiring framework that provides a uni- fied way of doing various dynamic pro- • Lattice parsing for dealing with input un- gramming calculations, variational decod- certainty, including ambiguous output from ing for approximating the intractable MAP speech recognizers or Chinese word seg- decoding, hypergraph-based discrimina- menters (Dyer et al., 2008) tive training for better feature engineering, • A semiring architecture over hypergraphs a parallelized MERT module, document- that allows many inference operations to be level and tail-based MERT, visualization implemented easily and elegantly (Li and of the derivation trees, and a cleaner Eisner, 2009) pipeline for MT experiments. • Improvements to decoding through varia- 1 Introduction tional decoding and other approximate meth- ods that overcome intractable MAP decoding Joshua is an open-source toolkit for parsing-based (Li et al., 2009b) machine translation that is written in Java. The initial release of Joshua (Li et al., 2009a) was a • Hypergraph-based discriminative training for re-implementation of the Hiero system (Chiang, better feature engineering (Li and Khudan- 2007) and all its associated algorithms, includ- pur, 2009b) ing: chart parsing, n-gram language model inte- • A parallelization of MERT’s computations, gration, beam and cube pruning, and k-best ex- and supporting document-level and tail-based traction. The Joshua 1.0 release also included optimization (Zaidan, 2010) re-implementations of suffix array grammar ex- traction (Lopez, 2007; Schwartz and Callison- • Visualization of the derivation trees and hy- Burch, 2010) and minimum error rate training pergraphs (Weese and Callison-Burch, 2010) (Och, 2003; Zaidan, 2009). Additionally, it in- cluded parallel and distributed computing tech- • A convenient framework for designing and niques for salability (Li and Khudanpur, 2008). running reproducible machine translation ex- This paper describes the additions to the toolkit periments (Schwartz, under review) over the past year, which together form the 2.0 re- The sections below give short descriptions for lease. The software has been heavily used by the each of these new functions. 2 Support for Syntax-based Translation ticular span to be translated as an NP. We modi- fied Joshua’s chart-based decoder to support these The initial release of Joshua supported only constraints. Hiero-style SCFGs, which use a single nontermi- nal symbol X. This release includes support for ar- bitrary SCFGs, including ones that use a rich set 4 Semiring Parsing of linguistic nonterminal symbols. In particular we have added support for Zollmann and Venu- In Joshua, we use a hypergraph (or packed forest) gopal (2006)’s syntax-augmented machine trans- to compactly represent the exponentially many lation. SAMT grammar extraction is identical to derivation trees generated by the decoder for an Hiero grammar extraction, except that one side of input sentence. Given a hypergraph, we may per- the parallel is parsed, and syntactic labels replace form many atomic inference operations, such as the X nonterminals in Hiero-style rules. Instead of finding one-best or k-best translations, or com- extracting this Hiero rule from the bitext puting expectations over the hypergraph. For [X] ⇒ [X,1] sans [X,2] | [X,1] without [X,2] each such operation, we could implement a ded- the nonterminals can be labeled according to icated dynamic programming algorithm. How- which constituents cover the nonterminal span on ever, a more general framework to specify these the parsed side of the bitext. This constrains what algorithms is semiring-weighted parsing (Good- types of phrases the decoder can use when produc- man, 1999). We have implemented the in- ing a translation. side algorithm, the outside algorithm, and the [VP] ⇒ [VBN] sans [NP] | [VBN] without [NP] inside-outside speedup described by Li and Eis- [NP] ⇒ [NP] sans [NP] | [NP] without [NP] ner (2009), plut the first-order expectation semir- Unlike GHKM (Galley et al., 2004), SAMT has ing (Eisner, 2002) and its second-order version (Li the same coverage as Hiero, because it allows and Eisner, 2009). All of these use our newly im- non-constituent phrases to get syntactic labels us- plemented semiring framework. ing CCG-style slash notation. Experimentally, we The first- and second-order expectation semi- have found that the derivations created using syn- rings can also be used to compute many interesting tactically motivated grammars exhibit more coher- quantities over hypergraphs. These quantities in- ent syntactic structure than Hiero and typically re- clude expected translation length, feature expec- sult in better reordering, especially for languages tation, entropy, cross-entropy, Kullback-Leibler with word orders that diverge from English, like divergence, Bayes risk, variance of hypothesis Urdu (Baker et al., 2009). length, gradient of entropy and Bayes risk, covari- 3 Specifying Constraints on Translation ance and Hessian matrix, and so on. Integrating output from specialized modules (like transliterators, morphological analyzers, and 5 Word Lattice Input modality translators) into the MT pipeline can improve translation performance, particularly for We generalized the bottom-up parsing algorithm low-resource languages. We have implemented that generates the translation hypergraph so that an XML interface that allows external modules it supports translation of word lattices instead of to propose alternate translation rules (constraints) just sentences. Our implementation’s runtime and for a particular word span to the decoder (Irvine memory overhead is proportional to the size of the et al., 2010). Processing that is separate from lattice, rather than the number of paths in the lat- the MT engine can suggest translations for some tice (Dyer et al., 2008). Accepting lattice-based set of source side words and phrases. The XML input allows the decoder to explore a distribution format allows for both hard constraints, which over input sentences, allowing it to select the best must be used, and soft constraints, which compete translation from among all of them. This is es- with standard extracted translation rules, as well pecially useful when Joshua is used to translate as specifying associated feature weights. In ad- the output of statistical preprocessing components, dition to specifying translations, the XML format such as speech recognizers or Chinese word seg- allows constraints on the lefthand side of SCFG menters, which can encode their alternative analy- rules, which allows constraints like forcing a par- ses as confusion networks or lattices. 6 Variational Decoding need to use an oracle translation (i.e., the transla- tion in the hypergraph that is most simmilar to the Statistical models in machine translation exhibit reference translation) as a surrogate for training. spurious ambiguity. That is, the probability of an We implemented the oracle extraction algorithm output string is split among many distinct deriva- described by Li and Khudanpur (2009a) for this tions (e.g., trees or segmentations) that have the purpose. same yield. In principle, the goodness of a string Given the current infrastructure, other training is measured by the total probability of its many methods (e.g., maximum conditional likelihood or derivations. However, finding the best string dur- MIRA as used by Chiang et al. (2009)) can also be ing decoding is then NP-hard. The first version of easily supported with minimum coding. We plan Joshua implemented the Viterbi approximation, to implement a large number of feature functions which measures the goodness of a translation us- in Joshua so that exhaustive feature engineering is ing only its most probable derivation. possible for MT. The Viterbi approximation is efficient, but it ig- nores most of the derivations in the hypergraph. 8 Minimum Error Rate Training We implemented variational decoding (Li et al., 2009b), which works as follows. First, given a for- Joshua’s MERT module optimizes parameter eign string (or lattice), the MT system produces a weights so as to maximize performance on a de- hypergraph, which encodes a probability distribu- velopment set as measuered by an automatic eval- tion p over possible output strings and their deriva- uation metric, such as Bleu (Och, 2003). tions. Second, a distribution q is selected that ap- We have parallelized our MERT module in proximates p as well as possible but comes from two ways: parallelizing the computation of met- a family of distributions Q in which inference is ric scores, and parallelizing the search over pa- tractable. Third, the best string according to q rameters. The computation of metric scores is (instead of p) is found. In our implementation, a computational concern when tuning to a met- the q distribution is parameterized by an n-gram ric that is slow to compute, such as translation model, under which the second and third steps can edit rate (Snover et al., 2006). Since scoring a be performed efficiently and exactly via dynamic candidate is independent from scoring any other programming. In this way, variational decoding candidate, we parallelize this computation using a considers all derivations in the hypergraph but still multi-threaded solution3. Similarly, we parallelize allows tractable decoding. the optimization of the intermediate initial weight vectors, also using a multi-threaded solution. 7 Hypergraph-based Discriminative Another feature is the module’s awareness of Training document information, and the capability to per- form optimization of document-based variants of Discriminative training with a large number of the automatic metric (Zaidan, 2010). For example, features has potential to improve the MT perfor- in document-based Bleu, a Bleu score is calculated mance. We have implemented the hypergraph- for each document, and the tuned score is the aver- based minimum risk training (Li and Eisner, age of those document scores. The MERT module 2009), which minimizes the expected loss of the can furthermore be instructed to target a specific reference translations. The minimum-risk objec- subset of those documents, namely the tail subset, tive can be optimized by a gradient-based method, where only the subset of documents with the low- where the risk and its gradient can be computed est document Bleu scores are considered.4 using a second-order expectation semiring. For More details on the MERT method and the im- optimization, we use both L-BFGS1 and Rprop2. plementation can be found in Zaidan (2009).5 We have also implemented the average Percep- tron algorithm and forest-reranking (Li and Khu- 3Based on sample code by Kenneth Heafield. danpur, 2009b). Since the reference translation 4This feature is of interest to GALE teams, for instance, may not be in the hypergraph due to pruning or in- since GALE’s evaluation criteria place a lot of focus on trans- lation quality of tail documents. herent defficiency of the translation grammar, we 5The module is also available as a standalone applica- tion, Z-MERT, that can be used with other MT systems. 1http://en.wikipedia.org/wiki/L-BFGS (Software and documentation at: http://cs.jhu.edu/ 2http://en.wikipedia.org/wiki/Rprop ˜ozaidan/zmert.) 9 Visualization Acknowledgements

We created tools for visualizing two of the Research funding was provided by the NSF un- main data structures used in Joshua (Weese and der grant IIS-0713448, by the European Commis- Callison-Burch, 2010). The first visualizer dis- sion through the EuroMatrixPlus project, and by plays hypergraphs. The user can choose from a the DARPA GALE program under Contract No. set of input sentences, then call the decoder to HR0011-06-2-0001. The views and findings are build the hypergraph. The second visualizer dis- the authors’ alone. plays derivation trees. Setting a flag in the con- figuration file causes the decoder to output parse References trees instead of strings, where each nonterminal is annotated with its source-side span. The visual- Kathy Baker, Steven Bethard, Michael Bloodgood, Ralf Brown, Chris Callison-Burch, Glen Copper- izer can read in multiple n-best lists in this format, smith, Bonnie Dorr, Wes Filardo, Kendall Giles, then display the resulting derivation trees side-by- Anni Irvine, Mike Kayser, Lori Levin, Justin Mar- side. We have found that visually inspecting these tineau, Jim Mayfield, Scott Miller, Aaron Phillips, derivation trees is useful for debugging grammars. Andrew Philpot, Christine Piatko, Lane Schwartz, and David Zajic. 2009. Semantically informed ma- We would like to add visualization tools for chine translation (SIMT). SCALE summer work- more parts of the pipeline. For example, a chart shop final report, Human Language Technology visualizer would make it easier for researchers to Center Of Excellence. tell where search errors were happening during David Chiang, Kevin Knight, and Wei Wang. 2009. decoding, and why. An alignment visualizer for 11,001 new features for statistical machine transla- aligned parallel corpora might help to determine tion. In NAACL, pages 218–226. how grammar extraction could be improved. David Chiang. 2007. Hierarchical phrase-based trans- lation. Computational Linguistics, 33(2):201–228. 10 Pipeline for Running MT Christopher Dyer, Smaranda Muresan, and Philip Experiments Resnik. 2008. Generalizing word lattice transla- tion. In Proceedings of ACL-08: HLT, pages 1012– Reproducing other researchers’ machine transla- 1020, Columbus, Ohio, June. Association for Com- tion experiments is difficult because the pipeline is putational Linguistics. too complex to fully detail in short conference pa- Jason Eisner. 2002. Parameter estimation for proba- pers. We have put together a workflow framework bilistic finite-state transducers. In ACL. for designing and running reproducible machine translation experiments using Joshua (Schwartz, Michel Galley, Mark Hopkins, Kevin Knight, and Daniel Marcu. 2004. What’s in a translation rule? under review). Each step in the machine transla- In HLT-NAACL. tion workflow (data preprocessing, grammar train- ing, MERT, decoding, etc) is modeled by a Make Joshua Goodman. 1999. Semiring parsing. Computa- script that defines how to run the tools used in that tional Linguistics, 25(4):573–605. step, and an auxiliary configuration file that de- Ann Irvine, Mike Kayser, Zhifei Li, Wren Thornton, fines the exact parameters to be used in that step and Chris Callison-Burch. 2010. Integrating out- for a particular experimental setup. Workflows put from specialized modules in machine transla- tion: Transliteration in joshua. The Prague Bulletin configured using this framework allow a complete of Mathematical Linguistics, 93:107–116. experiment to be run – from downloading data and software through scoring the final translated re- Zhifei Li and Jason Eisner. 2009. First- and second- order expectation semirings with applications to sults – by executing a single Makefile. minimum-risk training on translation forests. In This framework encourages researchers to sup- EMNLP, Singapore. plement research publications with links to the Zhifei Li and Sanjeev Khudanpur. 2008. A scalable complete set of scripts and configurations that decoder for parsing-based machine translation with were actually used to run the experiment. The equivalent language model state maintenance. In Johns Hopkins University submission for the ACL SSST, pages 10–18. WMT10 shared translation task was implemented Zhifei Li and Sanjeev Khudanpur. 2009a. Efficient in this framework, so it can be easily and exactly extraction of oracle-best translations from hyper- reproduced. graphs. In Proceedings of NAACL. Zhifei Li and Sanjeev Khudanpur. 2009b. Forest reranking for machine translation with the percep- tron algorithm. In GALE book chapter on “MT From Text”. Zhifei Li, Chris Callison-Burch, Chris Dyer, Juri Ganitkevitch, Sanjeev Khudanpur, Lane Schwartz, Wren Thornton, Jonathan Weese, and Omar. Zaidan. 2009a. Joshua: An open source toolkit for parsing- based machine translation. In WMT09. Zhifei Li, Jason Eisner, and Sanjeev Khudanpur. 2009b. Variational decoding for statistical machine translation. In ACL. Adam Lopez. 2007. Hierarchical phrase-based trans- lation with suffix arrays. In EMNLP-CoNLL. Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In ACL.

Lane Schwartz and Chris Callison-Burch. 2010. Hier- archical phrase-based grammar extraction in joshua. The Prague Bulletin of Mathematical Linguistics, 93:157–166.

Lane Schwartz. under review. Reproducible results in parsing-based machine translation: The JHU shared task submission. In WMT10.

Matthew Snover, Bonnie J. Dorr, and Richard Schwartz. 2006. A study of translation edit rate with targeted human annotation. In AMTA.

Jonathan Weese and Chris Callison-Burch. 2010. Vi- sualizing data structures in parsing-based machine translation. The Prague Bulletin of Mathematical Linguistics, 93:127–136.

Omar F. Zaidan. 2009. Z-MERT: A fully configurable open source tool for minimum error rate training of machine translation systems. The Prague Bulletin of Mathematical Linguistics, 91:79–88.

Omar F. Zaidan. 2010. Document- and tail-based min- imum error rate training of machine translation sys- tems. In preparation. Andreas Zollmann and Ashish Venugopal. 2006. Syn- tax augmented machine translation via chart pars- ing. In Proceedings of the NAACL-2006 Workshop on Statistical Machine Translation (WMT-06), New York, New York.