Phrasal: A Toolkit for Statistical with Facilities for Extraction and Incorporation of Arbitrary Model Features

Daniel Cer, Michel Galley, Daniel Jurafsky and Christopher D. Manning Stanford University Stanford, CA 94305, USA

Abstract statistics from aligned bitexts and for incor- porating the new features into the decoding We present a new Java-based open source toolkit for phrase-based machine translation. model. The system has already been used to The key innovation provided by the toolkit develop a number of innovative new features is to use APIs for integrating new fea- (Chang et al., 2009; Galley and Manning, 2008; tures (/knowledge sources) into the decod- Galley and Manning, 2009; Green et al., 2010) and ing model and for extracting feature statis- to build translation systems that have placed well tics from aligned bitexts. The package in- at recent competitive evaluations, achieving second cludes a number of useful features written to place for to English translation on the NIST these APIs including features for hierarchi- 1 cal reordering, discriminatively trained linear 2009 constrained data track. distortion, and syntax based language models. We implemented the toolkit in Java because it of- Other useful utilities packaged with the toolkit fers a good balance between performance and de- include: a conditional phrase extraction sys- veloper productivity. Compared to C++, develop- tem that builds a phrase table just for a spe- ers using Java are 30 to 200% faster, produce fewer cific dataset; and an implementation of MERT defects, and correct defects up to 6 times faster that allows for pluggable evaluation metrics (Phipps, 1999). While Java programs were histori- for both training and evaluation with built in support for a variety of metrics (e.g., TERp, cally much slower than similar programs written in BLEU, METEOR). C or C++, modern Java virtual machines (JVMs) re- sult in Java programs being nearly as fast as C++ 1 Motivation programs (Bruckschlegel, 2005). Java also allows for trivial code portability across different platforms. Progress in machine translation (MT) depends crit- In the remainder of the paper, we will highlight ically on the development of new and better model various useful capabilities, components and model- features that allow translation systems to better iden- ing features included in the toolkit. tify and construct high quality machine translations. The popular Moses decoder (Koehn et al., 2007) 2 Toolkit was designed to allow new features to be defined us- ing factored translation models. In such models, the The toolkit provides end-to-end support for the cre- individual phrases being translated can be factored ation and evaluation of machine translation models. into two or more abstract phrases (e.g., lemma, POS- Given sentence-aligned , a new transla- tags) that can be translated individually and then tion system can be built using a single command: combined in a seperate generation stage to arrive at java edu.stanford.nlp.mt.CreateModel \ the final target translation. While greatly enriching (source.txt) (target.txt) \ the space of models that can be used for phrase- (dev.source.txt) (dev.ref) (model_name) based machine translation, Moses only allows fea- Running this command will first create tures that can be defined at the level of individual level alignments for the sentences in source.txt and phrases. and target.txt using the Berkeley cross-EM aligner The Phrasal toolkit provides easy-to-use APIs for the development of arbitrary new model fea- 1http://www.itl.nist.gov/iad/mig/tests tures. It includes an API for extracting feature /mt/2009/ResultsRelease/currentArabic.html

9 Proceedings of the NAACL HLT 2010: Demonstration Session, pages 9–12, Los Angeles, California, June 2010. c 2010 Association for Computational Linguistics Figure 1: Chinese-to-English translation using discontinuous phrases.

(Liang et al., 2006).2 From the word-to-word alignments, the system extracts a phrase ta- ble (Koehn et al., 2003) and hierarchical reorder- ing model (Galley and Manning, 2008). Two n- gram language models are trained on the tar- 15 25 35 tranlations per minute tranlations get.txt sentences: one over lowercased target sen- 1 2 3 4 5 6 7 8 tences that will be used by the Phrasal decoder Cores and one over the original source sentences that will be used for truecasing the MT output. Fi- Figure 2: Multicore translations per minute on a sys- tem with two Intel Xeon L5530 processors running at nally, the system trains the feature weights for the 2.40GHz. decoding model using minimum error rate train- ing (Och, 2003) to maximize the system’s BLEU score (Papineni et al., 2002) on the development threads past four results in only marginal additional data given by dev.source.txt and dev.ref. The toolkit gains as the cost of managing the resources shared is distributed under the GNU general public license between the threads is starting to overwhelm the (GPL) and can be downloaded from http:// value provided by each additional thread. Moses nlp.stanford.edu/software/phrasal. also does not run faster with more than 4-5 threads.3 3 Decoder Feature API The feature API was designed to abstract away complex implementation details of Decoding Engines The package includes two de- the underlying decoding engine and provide a sim- coding engines, one that implements the left-to- ple consistent framework for creating new decoding right beam search algorithm that was first intro- model features. During decoding, as each phrase duced with the Pharaoh machine translation system that is translated, the system constructs a Featuriz- (Koehn, 2004), and another that provides a recently able object. As seen in Table 1, Featurizable objects developed decoding algorithm for translating with specify what phrase was just translated and an over- discontinuous phrases (Galley and Manning, 2010). all summary of the translation being built. Code that Both engines use features written to a common but implements a feature inspects the Featurizable and extensible feature API, which allows features to be returns one or more named feature values. Prior to written once and then loaded into either engine. translating a new sentence, the sentence is passed to Discontinuous phrases provide a mechanism for the active features for a decoding model, so that they systematically translating grammatical construc- can perform any necessary preliminary analysis. tions. As seen in Fig. 1, using discontinuous phrases allows us to successfully capture that the Chinese Comparison with Moses Credible research into construction 当 X 的 can be translated as when X. new features requires baseline system performance that is on par with existing state-of-the-art systems. Multithreading The decoder has robust support Seen in Table 2, Phrasal meets the performance of for multithreading, allowing it to take full advantage Moses when using the exact same decoding model of modern hardware that provides multiple CPU feature set as Moses and outperforms Moses signifi- cores. As shown in Fig. 2, decoding speed scales cantly when using its own default feature set.4 well when the number of threads being used is in- creased from one to four. However, increasing the 3http://statmt.org/moses /?n=Moses.AdvancedFeatures (April 6, 2010) 2Optionally, GIZA++ (Och and Ney, 2003) can also be used 4Phrasal was originally written to replicate Moses as it was to create the word-to-word alignments. implemented in 2007 (release 2007-05-29), and the current ver-

10 Featurizable Phrasal feature API. These extensions are currently Last Translated Phrase Pair not included in the release: Source and Target Alignments Partial Translation Target Side Dependency Language Model The Source Sentence n-gram language models that are traditionally used Current Source Coverage to capture the syntax of the target language do a Pointer to Prior Featurizable poor job of modeling long distance syntactic rela- Table 1: Information passed to features in the form of a tionships. For example, if there are a number of Featurizable object for each translated phrase. intervening words between a verb and its subject, n-gram language models will often not be of much System Features MT06 (tune) MT03 MT05 help in selecting the verb form that agrees with the Moses Moses 34.23 33.72 32.51 subject. The target side dependency language model Phrasal Moses 34.25 33.72 32.49 feature captures these long distance relationships by Phrasal Default 35.02 34.98 33.21 providing a dependency score for the target transla- Table 2: Comparison of two configurations of Phrasal tions produced by the decoder. This is done using to Moses on Chinese-to-English. One Phrasal configura- an efficient quadratic time algorithm that operates tion uses the standard Moses feature set for single factor within the main decoding loop rather than in a sepa- phrase-based translation with distance and phrase level rate reranking stage (Galley and Manning, 2009). msd-bidirectional-fe reordering features. The other uses the default configuration of Phrasal, which replaces the Discriminative Distortion The standard distor- phrase level msd-bidirectional-fe feature with a heirarchi- tion cost model used in phrase-based MT systems cal reordering feature. such as Moses has two problems. First, it does not estimate the future cost of known required moves, 4 Features thus increasing search errors. Second, the model pe- nalizes distortion linearly, even when appropriate re- The toolkit includes the basic eight phrase-based orderings are performed. To address these problems, translation features available in Moses as well as we used the Phrasal feature API to design a new Moses’ implementation of lexical reordering fea- discriminative distortion model that predicts word tures. In addition to the common Moses features, we movement during translation and that estimates fu- also include innovative new features that improve ture cost. These extensions allow us to triple the translation quality. One of these features is a hier- distortion limit and provide a statistically significant archical generalization of the Moses lexical reorder- improvement over the baseline (Green et al., 2010). ing model. Instead of just looking at the reorder- ing relationship between individual phrases, the new Discriminative Reordering with Chinese Gram- feature examines the reordering of blocks of ad- matical Relations During translation, a source jacent phrases (Galley and Manning, 2008) and im- sentence can be more accurately reordered if the proves translation quality when the material being system knows something about the syntactic rela- reordered cannot be captured by single phrase. This tionship between the words in the phrases being re- hierarchical lexicalized reordering model is used by ordered. The discriminative reordering with Chinese default in Phrasal and is responsible for the gains grammatical relations feature examines the path be- shown in Table 2 using the default features. tween words in a source-side dependency tree and To illustrate how Phrasal can effectively be used uses it to evaluate the appropriateness of candidate to design rich feature sets, we present an overview phrase reorderings (Chang et al., 2009). of various extensions that have been built upon the 5 Other components sion still almost exactly replicates this implementation when using only the baseline Moses features. To ensure this con- Training Decoding Models The package includes figuration of the decoder is still competitive, we compared it a comprehensive toolset for training decoding mod- against the current Moses implementation (release 2009-04- els. It supports MERT training using coordinate de- 13) and found that the performance of the two systems is still scent, Powell’s method, line search along random close. Tthe current Moses implementation obtains slightly lower BLEU scores, respectively 33.98 and 32.39 on MT06 and search directions, and downhill Simplex. In addi- MT05. tion to the BLEU metric, models can be trained

11 to optimize other popular evaluation metrics such P. Chang, H. Tseng, D. Jurafsky, and C.D. Manning. as METEOR (Lavie and Denkowski, 2009), TERp 2009. Discriminative reordering with Chinese gram- (Snover et al., 2009), mWER (Nießen et al., 2000), matical relations features. In SSST Workshop at and PER (Tillmann et al., 1997). It is also possible NAACL. to plug in other new user-created evaluation metrics. Michel Galley and Christopher D. Manning. 2008. A simple and effective hierarchical phrase reordering Conditional Phrase Table Extraction Rather model. In EMNLP. than first building a massive phrase table from a par- Michel Galley and Christopher D. Manning. 2009. allel corpus and then filtering it down to just what Quadratic-time dependency for machine trans- is needed for a specific data set, our toolkit sup- lation. In ACL. ports the extraction of just those phrases that might Michel Galley and Christopher Manning. 2010. Improv- be used on a given evaluation set. In doing so, it ing phrase-based machine translation with discontigu- dramatically reduces the time required to build the ous phrases. In NAACL. phrase table and related data structures such as re- Spence Green, Michel Galley, and Christopher D. Man- ordering models. ning. 2010. Improved models of distortion cost for statistical machine translation. In In NAACL. Feature Extraction API In order to assist in the Philipp Koehn, Franz Och, and Daniel Marcu. 2003. Sta- development of new features, the toolkit provides tistical phrase-based translation. In NAACL. an API for extracting feature statistics from a word- P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, aligned parallel corpus. This API ties into the condi- M. Federico, N. Bertoldi, B. Cowan, W. Shen, tional phrase table extraction utility, and thus allows C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, for the extraction of just those feature statistics that and E. Herbst. 2007. Moses: Open source toolkit for are relevant to a given data set. statistical machine translation. In ACL. Philipp Koehn. 2004. Pharaoh: A beam search decoder 6 Conclusion for phrase-based statistical machine translation mod- Phrasal is an open source state-of-the-art Java- els. In AMTA. based machine translation system that was designed Alon Lavie and Michael J. Denkowski. 2009. The specifically for research into new decoding model METEOR metric for automatic evaluation of machine features. The system supports traditional phrase- translation. Machine Translation, 23. based translation as well as translation using discon- Percy Liang, Ben Taskar, and Dan Klein. 2006. Align- ment by agreement. In NAACL. tinuous phrases. It includes a number of new and Sonja Nießen, Franz Josef Och, and Hermann Ney. 2000. innovative model features in addition to those typi- An evaluation tool for machine translation: Fast eval- cally found in phrase-based translation systems. It is uation for MT research. In LREC. also packaged with other useful components such as Franz Josef Och and Hermann Ney. 2003. A system- tools for extracting feature statistics, building phrase atic comparison of various statistical alignment mod- tables for specific data sets, and MERT training rou- els. Computational Linguistics, 29(1):19–51. tines that support a number of optimization tech- Franz Josef Och. 2003. Minimum error rate training in niques and evaluation metrics. statistical machine translation. In ACL. Acknowledgements Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu. 2002. BLEU: a method for automatic eval- The Phrasal decoder has benefited from the help- uation of machine translation. In ACL. ful comments and code contributions of Pi-Chuan Geoffrey Phipps. 1999. Comparing observed bug and Chang, Spence Green, Karthik Raghunathan, Ankush Singla, and Huihsin Tseng. The software productivity rates for java and C++. Softw. Pract. Ex- presented in this paper is based on work work was per., 29(4):345–358. funded by the Defense Advanced Research Projects M. Snover, N. Madnani, B.J. Dorr, and R. Schwartz. Agency through IBM. The content does not neces- 2009. Fluency, adequacy, or HTER?: exploring dif- sarily reflect the views of the U.S. Government, and ferent human judgments with a tunable MT metric. In no official endorsement should be inferred. SMT workshop at EACL. C. Tillmann, S. Vogel, H. Ney, A. Zubiaga, and H. Sawaf. References 1997. Accelerated DP based search for statistical Thomas Bruckschlegel. 2005. Microbenchmarking C++, translation. In In Eurospeech. C#, and Java. C/C++ Users Journal.

12