Phrasal: a Statistical Machine Translation Toolkit for Exploring
Total Page:16
File Type:pdf, Size:1020Kb
Phrasal: A Toolkit for Statistical Machine Translation with Facilities for Extraction and Incorporation of Arbitrary Model Features Daniel Cer, Michel Galley, Daniel Jurafsky and Christopher D. Manning Stanford University Stanford, CA 94305, USA Abstract statistics from aligned bitexts and for incor- porating the new features into the decoding We present a new Java-based open source toolkit for phrase-based machine translation. model. The system has already been used to The key innovation provided by the toolkit develop a number of innovative new features is to use APIs for integrating new fea- (Chang et al., 2009; Galley and Manning, 2008; tures (/knowledge sources) into the decod- Galley and Manning, 2009; Green et al., 2010) and ing model and for extracting feature statis- to build translation systems that have placed well tics from aligned bitexts. The package in- at recent competitive evaluations, achieving second cludes a number of useful features written to place for Arabic to English translation on the NIST these APIs including features for hierarchi- 1 cal reordering, discriminatively trained linear 2009 constrained data track. distortion, and syntax based language models. We implemented the toolkit in Java because it of- Other useful utilities packaged with the toolkit fers a good balance between performance and de- include: a conditional phrase extraction sys- veloper productivity. Compared to C++, develop- tem that builds a phrase table just for a spe- ers using Java are 30 to 200% faster, produce fewer cific dataset; and an implementation of MERT defects, and correct defects up to 6 times faster that allows for pluggable evaluation metrics (Phipps, 1999). While Java programs were histori- for both training and evaluation with built in support for a variety of metrics (e.g., TERp, cally much slower than similar programs written in BLEU, METEOR). C or C++, modern Java virtual machines (JVMs) re- sult in Java programs being nearly as fast as C++ 1 Motivation programs (Bruckschlegel, 2005). Java also allows for trivial code portability across different platforms. Progress in machine translation (MT) depends crit- In the remainder of the paper, we will highlight ically on the development of new and better model various useful capabilities, components and model- features that allow translation systems to better iden- ing features included in the toolkit. tify and construct high quality machine translations. The popular Moses decoder (Koehn et al., 2007) 2 Toolkit was designed to allow new features to be defined us- ing factored translation models. In such models, the The toolkit provides end-to-end support for the cre- individual phrases being translated can be factored ation and evaluation of machine translation models. into two or more abstract phrases (e.g., lemma, POS- Given sentence-aligned parallel text, a new transla- tags) that can be translated individually and then tion system can be built using a single command: combined in a seperate generation stage to arrive at java edu.stanford.nlp.mt.CreateModel \ the final target translation. While greatly enriching (source.txt) (target.txt) \ the space of models that can be used for phrase- (dev.source.txt) (dev.ref) (model_name) based machine translation, Moses only allows fea- Running this command will first create word tures that can be defined at the level of individual level alignments for the sentences in source.txt words and phrases. and target.txt using the Berkeley cross-EM aligner The Phrasal toolkit provides easy-to-use APIs for the development of arbitrary new model fea- 1http://www.itl.nist.gov/iad/mig/tests tures. It includes an API for extracting feature /mt/2009/ResultsRelease/currentArabic.html 9 Proceedings of the NAACL HLT 2010: Demonstration Session, pages 9–12, Los Angeles, California, June 2010. c 2010 Association for Computational Linguistics Figure 1: Chinese-to-English translation using discontinuous phrases. (Liang et al., 2006).2 From the word-to-word alignments, the system extracts a phrase ta- ble (Koehn et al., 2003) and hierarchical reorder- ing model (Galley and Manning, 2008). Two n- gram language models are trained on the tar- 15 25 35 tranlations per minute tranlations get.txt sentences: one over lowercased target sen- 1 2 3 4 5 6 7 8 tences that will be used by the Phrasal decoder Cores and one over the original source sentences that will be used for truecasing the MT output. Fi- Figure 2: Multicore translations per minute on a sys- tem with two Intel Xeon L5530 processors running at nally, the system trains the feature weights for the 2.40GHz. decoding model using minimum error rate train- ing (Och, 2003) to maximize the system’s BLEU score (Papineni et al., 2002) on the development threads past four results in only marginal additional data given by dev.source.txt and dev.ref. The toolkit gains as the cost of managing the resources shared is distributed under the GNU general public license between the threads is starting to overwhelm the (GPL) and can be downloaded from http:// value provided by each additional thread. Moses nlp.stanford.edu/software/phrasal. also does not run faster with more than 4-5 threads.3 3 Decoder Feature API The feature API was designed to abstract away complex implementation details of Decoding Engines The package includes two de- the underlying decoding engine and provide a sim- coding engines, one that implements the left-to- ple consistent framework for creating new decoding right beam search algorithm that was first intro- model features. During decoding, as each phrase duced with the Pharaoh machine translation system that is translated, the system constructs a Featuriz- (Koehn, 2004), and another that provides a recently able object. As seen in Table 1, Featurizable objects developed decoding algorithm for translating with specify what phrase was just translated and an over- discontinuous phrases (Galley and Manning, 2010). all summary of the translation being built. Code that Both engines use features written to a common but implements a feature inspects the Featurizable and extensible feature API, which allows features to be returns one or more named feature values. Prior to written once and then loaded into either engine. translating a new sentence, the sentence is passed to Discontinuous phrases provide a mechanism for the active features for a decoding model, so that they systematically translating grammatical construc- can perform any necessary preliminary analysis. tions. As seen in Fig. 1, using discontinuous phrases allows us to successfully capture that the Chinese Comparison with Moses Credible research into construction 当 X 的 can be translated as when X. new features requires baseline system performance that is on par with existing state-of-the-art systems. Multithreading The decoder has robust support Seen in Table 2, Phrasal meets the performance of for multithreading, allowing it to take full advantage Moses when using the exact same decoding model of modern hardware that provides multiple CPU feature set as Moses and outperforms Moses signifi- cores. As shown in Fig. 2, decoding speed scales cantly when using its own default feature set.4 well when the number of threads being used is in- creased from one to four. However, increasing the 3http://statmt.org/moses /?n=Moses.AdvancedFeatures (April 6, 2010) 2Optionally, GIZA++ (Och and Ney, 2003) can also be used 4Phrasal was originally written to replicate Moses as it was to create the word-to-word alignments. implemented in 2007 (release 2007-05-29), and the current ver- 10 Featurizable Phrasal feature API. These extensions are currently Last Translated Phrase Pair not included in the release: Source and Target Alignments Partial Translation Target Side Dependency Language Model The Source Sentence n-gram language models that are traditionally used Current Source Coverage to capture the syntax of the target language do a Pointer to Prior Featurizable poor job of modeling long distance syntactic rela- Table 1: Information passed to features in the form of a tionships. For example, if there are a number of Featurizable object for each translated phrase. intervening words between a verb and its subject, n-gram language models will often not be of much System Features MT06 (tune) MT03 MT05 help in selecting the verb form that agrees with the Moses Moses 34.23 33.72 32.51 subject. The target side dependency language model Phrasal Moses 34.25 33.72 32.49 feature captures these long distance relationships by Phrasal Default 35.02 34.98 33.21 providing a dependency score for the target transla- Table 2: Comparison of two configurations of Phrasal tions produced by the decoder. This is done using to Moses on Chinese-to-English. One Phrasal configura- an efficient quadratic time algorithm that operates tion uses the standard Moses feature set for single factor within the main decoding loop rather than in a sepa- phrase-based translation with distance and phrase level rate reranking stage (Galley and Manning, 2009). msd-bidirectional-fe reordering features. The other uses the default configuration of Phrasal, which replaces the Discriminative Distortion The standard distor- phrase level msd-bidirectional-fe feature with a heirarchi- tion cost model used in phrase-based MT systems cal reordering feature. such as Moses has two problems. First, it does not estimate the future cost of known required moves, 4 Features thus increasing search errors. Second, the model pe- nalizes distortion linearly, even when appropriate re- The toolkit includes the basic eight phrase-based orderings are performed. To address these problems, translation features available in Moses as well as we used the Phrasal feature API to design a new Moses’ implementation of lexical reordering fea- discriminative distortion model that predicts word tures. In addition to the common Moses features, we movement during translation and that estimates fu- also include innovative new features that improve ture cost.