Heafield, K. and A. Lavie. "Voting on N-Grams for Machine Translation
Total Page:16
File Type:pdf, Size:1020Kb
Voting on N-grams for Machine Translation System Combination Kenneth Heafield and Alon Lavie Language Technologies Institute Carnegie Mellon University 5000 Forbes Avenue Pittsburgh, PA 15213 fheafield,[email protected] Abstract by a single weight and further current features only guide decisions at the word level. To remedy this System combination exploits differences be- situation, we propose new features that account for tween machine translation systems to form a multiword behavior, each with a separate set of sys- combined translation from several system out- tem weights. puts. Core to this process are features that reward n-gram matches between a candidate 2 Features combination and each system output. Systems differ in performance at the n-gram level de- Most combination schemes generate many hypothe- spite similar overall scores. We therefore ad- sis combinations, score them using a battery of fea- vocate a new feature formulation: for each tures, and search for a hypothesis with highest score system and each small n, a feature counts to output. Formally, the system generates hypoth- n-gram matches between the system and can- esis h, evaluates feature function f, and multiplies didate. We show post-evaluation improvement of 6.67 BLEU over the best system on NIST linear weight vector λ by feature vector f(h) to ob- T MT09 Arabic-English test data. Compared to tain score λ f(h). The score is used to rank fi- a baseline system combination scheme from nal hypotheses and to prune partial hypotheses dur- WMT 2009, we show improvement in the ing beam search. The beams contain hypotheses of range of 1 BLEU point. equal length. With the aim of improving score and therefore translation quality, this paper focuses on the structure of features f and their corresponding 1 Introduction weights λ. System combination merges the output of several The feature function f consists of the following machine translation systems to form an improved feature categories: translation. While individual systems perform sim- Length Length of the hypothesis, as in Koehn et al. ilarly overall, human evaluators report different er- (2007). This compensates, to first order, for the ror distributions for each system (Peterson et al., impact of length on other features. 2009b). For example, some systems are weak at word order while others have more trouble with LM Log probability from an SRI (Stolcke, 2002) nouns and verbs. Existing system combination tech- language model. When the language model niques (Rosti et al., 2008; Karakos et al., 2008; scores a word, it finds the longest n-gram in the Leusch et al., 2009a) ignore these distinctions by model with the same word and context. We use learning a single weight for each system. This the length n as a second feature. The purpose weight is used in word-level decisions and therefore of this second feature is to provide the scor- captures only lexical choice. We see two problems: ing model with limited control over language system behavior differs in more ways than captured model backoff penalties. System 1: Supported Proposal of France tem and across related systems. Viewed as language modeling, each cs;n is a miniature language model System 2: Support for the Proposal of France trained on the translated sentence output by system s and jointly interpolated with a traditional language Candidate: Support for Proposal of France model and with peer models. As described further in Section 4, we jointly tune Unigram Bigram Trigram the weights λ using modified minimum error rate System 1 4 2 1 training (Och, 2003). In doing so, we simultane- System 2 5 3 1 ously learn several weights for each system, one for each length n-gram output by that system. The un- Figure 1: Example match feature values with two sys- igram weight captures confidence in lexical choice tems and matches up to length l = 3. Here, “Supported” while weights on longer n-gram features capture counts because it aligns with “Support”. confidence in word order and phrasal choices. These features are most effective with a variety of hypothe- Match The main focus of this paper, this feature ses from which to choose, so in Section 5 we de- category captures how well the hypothesis cor- scribe a search space with more flexible word order. responds to the system outputs. It consists of Combined, these comprise a combination scheme several features of the form cs;n which count that differs from others in three key ways: the search n-grams in the hypothesis that match the sen- space is more flexible, the features consider matches tence output by system s. We use several in- longer than unigrams, and system weights differ by stantiations of feature cs;n, one for each system task. s and for length n ranging from 1 to some max- imum length l. The hyper parameter l is con- 3 Related Work sidered in Section 6.1. Figure 1 shows example values with two systems and l = 3. The num- System combination takes a variety of forms that ber of match features is l times the number of pair a space of hypothesis combinations with fea- systems being combined. tures to score these hypotheses. Here, we are pri- marily interested in three aspects of each combina- The match features consider only the specific tion scheme: the space of hypotheses, features that sentences being combined, all of which are reward n-gram matches with system outputs, and the translations from a single source sentence. By system weights used for those features. contrast, Leusch et al. (2009a) count matches to the entire translated corpus. While our ap- 3.1 Search Spaces proach considers the most specific data avail- Hypothesis selection (Hildebrand and Vogel, 2009) able, theirs has the advantage of gathering and minimum Bayes risk (Kumar and Byrne, 2004) more data with which to compare, including select from k-best lists output by each system. In document-level matches. As discussed in Sec- the limit case for large k, DeNero et al. (2010) adopt tion 3, we also differ significantly from Leusch the search spaces of the translation systems being et al. (2009a) in that our features have tunable combined. system and n-gram weights while theirs are hy- per parameters. Confusion networks preserve the word order of one k-best list entry called the backbone. The back- Together, these define the feature function f. We bone is chosen by hypothesis selection (Karakos et focus primarily on the new match features that cap- al., 2008; Sim et al., 2007) or jointly with decod- ture lexical and multiword agreement with each sys- ing (Leusch et al., 2009a; Rosti et al., 2008). Other tem. The weight on match count cs;n corresponds k-best entries are aligned to the backbone. The to confidence in n-grams from system s. However, search space consists of choosing each word from this weight also accounts for correlation between among the alternatives aligned with it, keeping these features, which is quite high within the same sys- in backbone order. With this search space, the im- pact of our match features is limited to selection at system weights as a hyper parameter (Heafield et the word level, multiword lexical choice, and possi- al., 2009; Hildebrand and Vogel, 2009). The hy- bly selection of the backbone. per parameter might be set to an increasing function Flexible ordering schemes use a reordering model of each system’s overall performance (Rosti et al., (He and Toutanova, 2009) or dynamically switch 2009; Zhao and Jiang, 2009); this does not account backbones (Heafield et al., 2009) to create word or- for correlation. ders not seen in any single translation. For example, Some combination schemes (Karakos et al., 2008; these translations appear in NIST MT09 (Peterson Leusch et al., 2009a; Rosti et al., 2008; Zens and et al., 2009a): “We have not the interest of control Ney, 2006) tune system weights, but only as they on the Palestinians life,” and “We do not have a de- apply to votes on unigrams. In particular, Leusch sire to control the lives of the Palestinians.” Flexible et al. (2009a) note that their system weights impact ordering schemes consider “have not” versus “do neither their trigram language model based on sys- not have” separately from “Palestinians life” ver- tem outputs nor selection of a backbone. We answer sus “lives of the Palestinians.” Our match features these issues by introducing separate system weights have the most impact here because word order is less for each n-gram length in the system output lan- constrained. This is the type of search space we use guage model and by not restricting search to the or- in our experiments. der of a single backbone. Our experiments use only 1-best outputs, so 3.2 N-gram Match Features system-level weights suffice here. For methods that Agreement is central to system combination and use k-best lists, system weight may be moderated most schemes have some form of n-gram match fea- by some decreasing function of rank in the k-best tures. Of these, the simplest consider only unigram list (Ayan et al., 2008; Zhao and He, 2009). Mini- matches (Ayan et al., 2008; Heafield et al., 2009; mum Bayes risk (Kumar and Byrne, 2004) takes the Rosti et al., 2008; Zhao and Jiang, 2009). technique a step further by using the overall system Some schemes go beyond unigrams but with fixed scores that determined the ranking. weight. Karakos (2009) uses n-gram matches to se- lect the backbone but only unigrams for decoding. 4 Parameter Tuning Kumar and Byrne (2004) use arbitrary evaluation metric to measure similarity.