A Systematic Comparison of Phrase-Based, Hierarchical and Syntax-Augmented Statistical MT

A systematic comparison of phrase-based, hierarchical and syntax-augmented statistical MT Andreas Zollmann∗ and Ashish Venugopal∗ and Franz Och and Jay Ponte Google Inc. 1600 Amphitheatre Parkway Mountain View, CA 94303, USA {zollmann,ashishv}@cs.cmu.edu {och,ponte}@google.com Abstract VB → veux, want : w2 . Probabilistic synchronous context-free As with probabilistic context-free grammars, each grammar (PSCFG) translation models rule has a left-hand-side nonterminal (VP and VB define weighted transduction rules that in the two rules above), which constrains the rule’s represent translation and reordering oper- usage in further composition, and is assigned a ations via nonterminal symbols. In this weight w, estimating the quality of the rule based work, we investigate the source of the im- on some underlying statistical model. Transla- provements in translation quality reported tion with a PSCFG is thus a process of compos- when using two PSCFG translation mod- ing such rules to parse the source language while els (hierarchical and syntax-augmented), synchronously generating target language output. when extending a state-of-the-art phrase- PSCFG approaches such as Chiang (2005) and based baseline that serves as the lexical Zollmann and Venugopal (2006) typically begin support for both PSCFG models. We with a phrase-based model as the foundation for isolate the impact on translation quality the PSCFG rules described above. Starting with for several important design decisions in bilingual phrase pairs extracted from automatically each model. We perform this comparison aligned parallel text (Och and Ney, 2004; Koehn et on three NIST language translation tasks; al., 2003), these PSCFG approaches augment each Chinese-to-English, Arabic-to-English contiguous (in source and target words) phrase and Urdu-to-English, each representing pair with a left-hand-side symbol (like the VP in unique challenges. the example above), and perform a generalization procedure to form rules that include nonterminal 1 Introduction symbols. We can thus view PSCFG methods as Probabilistic synchronous context-free grammar an attempt to generalize beyond the purely lexi- (PSCFG) models define weighted transduction cal knowledge represented in phrase based mod- rules that are automatically learned from parallel els, allowing reordering decisions to be explicitly training data. As in monolingual parsing, such encoded in each rule. It is important to note that rules make use of nonterminal categories to gener- while phrase-based models cannot explicitly repre- alize beyond the lexical level. In the example be- sent context sensitive reordering effects like those low, the French (source language) words “ne” and in the example above, in practice, phrase based “pas” are translated into the English (target lan- models often have the potential to generate the guage) word “not”, performing reordering in the same target translation output by translating source context of a nonterminal of type “VB” (verb). phrases out of order, and allowing empty trans- lations for some source words. Apart from one VP → ne VB pas, do not VB : w1 or more language models scoring these reorder- ∗Work done during internships at Google Inc. ing alternatives, state-of-the-art phrase-based sys- ∗c 2008. Licensed under the Creative Commons tems are also equipped with a lexicalized distortion Attribution-Noncommercial-Share Alike 3.0 Unported li- cense (http://creativecommons.org/licenses/by-nc-sa/3.0/). model accounting for reordering behavior more di- Some rights reserved. rectly. While previous work demonstrates impres- sive improvements of PSCFG over phrase-based designed to be local to each phrase pair or rule. approaches for large Chinese-to-English data sce- A notable exception is the n-gram language model narios (Chiang, 2005; Chiang, 2007; Marcu et al., (LM), which evaluates the likelihood of the se- 2006; DeNeefe et al., 2007), these phrase-based quential target words output. Phrase-based sys- baseline systems were constrained to distortion tems also typically allow source segments to be limits of four (Chiang, 2005) and seven (Chiang, translated out of order, and include distortion mod- 2007; Marcu et al., 2006; DeNeefe et al., 2007), els to evaluate such operations. These features respectively, while the PSCFG systems were able suggest the efficient dynamic programming al- to operate within an implicit reordering window of gorithms for phrase-based systems described in 10 and higher. Koehn et al. (2004). In this work, we evaluate the impact of the ex- We now discuss the translation models com- tensions suggested by the PSCFG methods above, pared in this work. looking to answer the following questions. Do the relative improvements of PSCFG methods persist 2.1 Phrase Based MT when the phrase- based approach is allowed com- Phrase-based methods identify contiguous bilin- parable long-distance reordering, and when the n- gual phrase pairs based on automatically gener- gram language model is strong enough to effec- ated word alignments (Och et al., 1999). Phrase tively select from these reordered alternatives? Do pairs are extracted up to a fixed maximum length, these improvements persist across language pairs since very long phrases rarely have a tangible im- that exhibit significantly different reodering effects pact during translation (Koehn et al., 2003). Dur- and how does resource availablility effect relative ing decoding, extracted phrase pairs are reordered performance? In order to answer these questions to generate fluent target output. Reordered trans- we extend our PSCFG decoder to efficiently han- lation output is evaluated under a distortion model dle the high order LMs typically applied in state- and corroborated by one or more n-gram language of-the-art phrase based translation systems. We models. These models do not have an explicit rep- evaluate the phrase-based system for a range of re- resentation of how to reorder phrases. To avoid ordering limits, up to those matching the PSCFG search space explosion, most systems place a limit approaches, isolating the impact of the nontermi- on the distance that source segments can be moved nal based approach to reordering. Results are pre- within the source sentence. This limit, along with sented in multiple language pairs and data size the phrase length limit (where local reorderings scenarios, highlighting the limited impact of the are implicit in the phrase), determine the scope of PSCFG model in certain conditions. reordering represented in a phrase-based system. All experiments in this work limit phrase pairs to 2 Summary of approaches have source and target length of at most 12, and Given a source language sentence f, statistical ma- either source length or target length of at most 6 chine translation defines the translation task as se- (higher limits did not result in additional improve- lecting the most likely target translation e under a ments). In our experiments phrases are extracted model P (e|f), i.e.: by the method described in Och and Ney (2004) and reordering during decoding with the lexical- m X ized distortion model from Zens and Ney (2006). eˆ(f) = arg max P (e|f) = arg max hi(e, f)λi e e The reordering limit for the phrase based system i=1 (for each language pair) is increased until no addi- where the arg max operation denotes a search tional improvements result. through a structured space of translation ouputs 2.2 Hierarchical MT in the target language, hi(e, f) are bilingual features of e and f and monolingual features of e, Building upon the success of phrase-based meth- and weights λi are trained discriminitively to max- ods, Chiang (2005) presents a PSCFG model of imize translation quality (based on automatic met- translation that uses the bilingual phrase pairs of rics) on held out data (Och, 2003). phrase-based MT as starting point to learn hierar- Both phrase-based and PSCFG approaches chical rules. For each training sentence pair’s set of make independence assumptions to structure this extracted phrase pairs, the set of induced PSCFG search space and thus most features hi(e, f) are rules can be generated as follows: First, each phrase pair is assigned a generic X-nonterminal as 2.3 Syntax Augmented MT left-hand-side, making it an initial rule. We can Syntax Augmented MT (SAMT) (Zollmann and now recursively generalize each already obtained Venugopal, 2006) extends Chiang (2005) to in- rule (initial or including nonterminals) clude nonterminal symbols from target language phrase structure parse trees. Each target sentence N → f1 . fm/e1 . en in the training corpus is parsed with a stochas- for which there is an initial rule tic parser—we use Charniak (2000))—to produce constituent labels for target spans. Phrases (ex- M → fi . fu/ej . ev tracted from a particular sentence pair) are assigned left-hand-side nonterminal symbols based where 1 ≤ i < u ≤ m and 1 ≤ j < v ≤ n, to on the target side parse tree constituent spans. obtain a new rule Phrases whose target side corresponds to a con- i−1 m j−1 n stituent span are assigned that constituent’s label as N → f1 Xkfu+1/e1 Xkev+1 their left-hand-side nonterminal. If the target span i−1 where e.g. f1 is short-hand for f1 . fi−1, and of the phrase does not match a constituent in the where k is an index for the nonterminal X that parse tree, heuristics are used to assign categories indicates the one-to-one correspondence between that correspond to partial rewriting of the tree. the new X tokens on the two sides (it is not in These heuristics first consider concatenation oper- the space of word indices like i, j, u, v, m, n). The ations, forming categories such as “NP+V”, and recursive form of this generalization operation al- then resort to CCG (Steedman, 1999) style “slash” lows the generation of rules with multiple nonter- categories such as “NP/NN.” or “DT\NP”.

A Systematic Comparison of Phrase-Based, Hierarchical and Syntax-Augmented Statistical MT

Champollion: a Robust Parallel Text Sentence Aligner

The Web As a Parallel Corpus

Resourcing Machine Translation with Parallel Treebanks John Tinsley

Towards Controlled Counterfactual Generation for Text

Natural Language Processing Security- and Defense-Related Lessons Learned

A User Interface-Level Integration Method for Multiple Automatic Speech Translation Systems

Lines: an English-Swedish Parallel Treebank

Cross-Lingual Bootstrapping of Semantic Lexicons: the Case of Framenet

Parallel Texts

Open Architecture for Multilingual Parallel Texts M.T

Unsupervised Learning of Cross-Modal Mappings Between

BITS: a Method for Bilingual Text Search Over the Web