CMU Multi-Engine Machine Translation for WMT 2010
Total Page:16
File Type:pdf, Size:1020Kb
CMU Multi-Engine Machine Translation for WMT 2010 Kenneth Heafield Alon Lavie Carnegie Mellon University Carnegie Mellon University Pittsburgh, PA, USA. Pittsburgh, PA, USA. [email protected] [email protected] Abstract 2 Related Work This paper describes our submission, Confusion networks (Rosti et al., 2008) are the cmu-heafield-combo, to the WMT most popular form of system combination. In this 2010 machine translation system combi- approach, a single system output acts as a back- nation task. Using constrained resources, bone to which the other outputs are aligned. This we participated in all nine language pairs, backbone determines word order while other out- namely translating English to and from puts vote for substitution, deletion, and insertion Czech, French, German, and Spanish as operations. Essentially, the backbone is edited well as combining English translations to produce a combined output which largely pre- from multiple languages. Combination serves word order. Our approach differs in that proceeds by aligning all pairs of system we allow paths to switch between sentences, effec- outputs then navigating the aligned out- tively permitting the backbone to switch at every puts from left to right where each path is word. a candidate combination. Candidate com- Other system combination techniques typically binations are scored by their length, agree- use TER (Snover et al., 2006) or ITGs (Karakos ment with the underlying systems, and a et al., 2008) to align system outputs, meaning language model. On tuning data, improve- they depend solely on positional information to ment in BLEU over the best system de- find approximate matches; we explicitly use stem, pends on the language pair and ranges synonym, and paraphrase data to find alignments. from 0.89% to 5.57% with mean 2.37%. Our use of paraphrases is similar to Leusch et al. 1 Introduction (2009), though they learn a monolingual phrase table while we apply cross-lingual pivoting (Ban- System combination merges the output of sev- nard and Callison-Burch, 2005). eral machine translation systems into a sin- gle improved output. Our system combina- 3 Alignment tion scheme, submitted to the Workshop on Sta- tistical Machine Translation (WMT) 2010 as System outputs are aligned at the token level using cmu-heafield-combo, is an improvement a variant of the METEOR (Denkowski and Lavie, over our previous system (Heafield et al., 2009), 2010) aligner. This identifies, in decreasing order called cmu-combo in WMT 2009. The scheme of priority: exact, stem, synonym, and unigram consists of aligning 1-best outputs from each sys- paraphrase matches. Stems (Porter, 2001) are tem using the METEOR (Denkowski and Lavie, available for all languages except Czech, though 2010) aligner, identifying candidate combinations this is planned for future work and expected by forming left-to-right paths through the aligned to produce significant improvement. Synonyms system outputs, and scoring these candidates us- come from WordNet (Fellbaum, 1998) and are ing a battery of features. Improvements this year only available in English. Unigram paraphrases include unigram paraphrase alignment, support for are automatically generated using phrase table piv- all target languages, new features, language mod- oting (Bannard and Callison-Burch, 2005). The eling without pruning, and more parameter opti- phrase tables are trained using parallel data from mization. This paper describes our scheme with Europarl (fr-en, es-en, and de-en), news commen- emphasis on improved areas. tary (fr-en, es-en, de-en, and cz-en), United Na- 301 Proceedings of the Joint 5th Workshop on Statistical Machine Translation and MetricsMATR, pages 301–306, Uppsala, Sweden, 15-16 July 2010. c 2010 Association for Computational Linguistics tions (fr-en and es-en), and CzEng (cz-en) (Bojar usable. The value of r is a hyperparameter and Zabokrtskˇ y,´ 2009) sections 0–8. The German considered in Section 6. and Spanish tables also use the German-Spanish Europarl corpus released for WMT08 (Callison- Completeness Tokens may not be skipped unless Burch et al., 2008). Currently, the generated para- the sentence ends or another constraint would phrases are filtered to solely unigram matches; be violated. Specifically, when a token from full use of this table is planned for future work. some system is used, it must be the first (left- When alignment is ambiguous (i.e. “that” appears most in the system output) available token twice in a system output), an alignment is chosen from that system. For example, the first de- to minimize crossing with other alignments. Fig- coded token must be the first token output by ure 1 shows an example alignment. Compared to some system. our previous system, this replaces heuristic “arti- ficial” alignments with automatically learned uni- Together, these define the search space. The candi- gram paraphrases. date starts at the beginning of sentence by choos- ing the first token from any system. Then it can Twice that produced by nuclear plants either continue with the next token from the same system or switch to another one. When it switches Double that that produce nuclear power stations to another system, it does so to the first available token from the new system. The repetition con- Figure 1: Alignment generated by METEOR straint requires that the token does not repeat con- showing exact (that–that and nuclear–nuclear), tent. The weak monotonicity constraint ensures stem (produced–produce), synonym (twice– that the jump to the new system goes at most r double), and unigram paraphrase (plants–stations) words back. The process repeats until the end of alignments. sentence token is encountered. The previous version (Heafield et al., 2009) also had a hard phrase constraint and heuristics to de- 4 Search Space fine a phrase; this has been replaced with new A candidate combination consists of a string of to- match features. kens (words and punctuation) output by the under- Search is performed using beam search where lying systems. Unconstrained, the string could re- the beam contains partial candidates of the same peat tokens and assemble them in any order. We length, each of which starts with the beginning of therefore have several constraints: sentence token. In our experiments, the beam size is 500. When two partial candidates will extend Sentence The string starts with the beginning of in the same way (namely, the set of available to- sentence token and finishes with the end of kens is the same) and have the same feature state sentence token. These tokens implicitly ap- (i.e. language model history), they are recom- pear in each system’s output. bined. The recombined partial candidate subse- quently acts like its highest scoring element, until Repetition A token may be used at most once. k-best list extraction when it is lazily unpacked. Tokens that METEOR aligned are alterna- tives and cannot both be used. 5 Scoring Features Weak Monotonicity This prevents the scheme Candidates are scored using three feature classes: from reordering too much. Specifically, the path cannot jump backwards more than r to- Length Number of tokens in the candidate. This kens, where positions are measured relative compensates, to first order, for the impact of to the beginning of sentence. It cannot make length on other features. a series of smaller jumps that add up to more than r either. Equivalently, once a token Match For each system s and small n, feature in the ith position of some system output is ms,n is the number of n-grams in the candi- used, all tokens before the i − rth position in date matching the sentence output by system their respective system outputs become un- s. This is detailed in Section 5.1. 302 Language Model Log probability from a n-gram less systems are able to vote on a word order deci- language model and backoff statistics. Sec- sion mediated by the bigram and trigram features. tion 5.2 details our training data and backoff We find that both versions have their advantages, features. and therefore include two sets of match features: one that counts only exact alignments and another Features are combined into a score using a linear that counts all alignments. We also tried copies of model. Equivalently, the score is the dot product the match features at the stem and synonym level of a weight vector with the vector of our feature but found these impose additional tuning cost with values. The weight vector is a parameter opti- no measurable improvement in quality. mized in Section 6. Since systems have different strengths and weaknesses, we avoid assigning a single system 5.1 Match Features confidence (Rosti et al., 2008) or counting n-gram The n-gram match features reward agreement be- matches with uniform system confidence (Hilde- tween the candidate combination and underlying brand and Vogel, 2009). The weight on match system outputs. For example, feature m1,1 counts feature ms,n corresponds to our confidence in n- tokens in the candidate that also appear in sys- grams from system s. These weights are fully tun- tem 1’s output for the sentence being combined. able. However, there is another hyperparameter: Feature m1,2 counts bigrams appearing in both the the maximum length of n-gram considered; we candidate and the translation suggested by system typically use 2 or 3 with little gain seen above this. 1. Figure 2 shows example feature values. 5.2 Language Model System 1: Supported Proposal of France We built language models for each of the five tar- get languages with the aim of using all constrained System 2: Support for the Proposal of France data. For each language, we used the provided Europarl (Koehn, 2005) except for Czech, News Candidate: Support for Proposal of France Commentary, and News monolingual corpora.