Using N-Gram Based Features for Machine Translation System Combination

Using N-gram based Features for Machine Translation System Combination Yong Zhao1 Xiaodong He Georgia Institute of Technology Microsoft Research Atlanta, GA 30332, USA Redmond, WA 98052, USA [email protected] [email protected] The confidence score of a hypothesis could be Abstract assigned in various ways. Fiscus (1997) used voting by frequency of word occurrences. Mangu Conventional confusion network based et. al., (2000) computed a word posterior system combination for machine translation probability based on voting of that word in (MT) heavily relies on features that are different hypotheses. Moreover, the overall based on the measure of agreement of confidence score is usually formulated as a log- words in different translation hypotheses. linear model including extra features including This paper presents two new features that consider agreement of n-grams in different language model (LM) score, word count, etc. hypotheses to improve the performance of Features based on word agreement measure are system combination. The first one is based extensively studied in past work (Matusov, et. al., on a sentence specific online n-gram 2006, Rosti, et. al., 2007, He, et. al., 2008). language model, and the second one is However, utilization of n-gram agreement based on n-gram voting. Experiments on a information among the hypotheses has not been large scale Chinese-to-English MT task fully explored yet. Moreover, it was argued that show that both features yield significant the confusion network decoding may introduce improvements on the translation undesirable spur words that break coherent performance, and a combination of them phrases (Sim, et. al., 2007). Therefore, we would produces even better translation results. prefer the consensus translation that has better n- 1 gram agreement among outputs of single systems. 1 Introduction In the literature, Zens and Ney (2004) In past years, the confusion network based system proposed an n-gram posterior probability based combination approach has been shown with LM for MT. For each source sentence, a LM is substantial improvements in various machine trained on the n-best list produced by a single MT translation (MT) tasks (Bangalore, et. al., 2001, system and is used to re-rank that n-best list itself. Matusov, et. al., 2006, Rosti, et. al., 2007, He, On the other hand, Matusov et al. (2008) proposed et. al., 2008). Given hypotheses of multiple an “adapted” LM for system combination, where systems, a confusion network is built by aligning this “adapted” LM is trained on translation all these hypotheses. The resulting network hypotheses of the whole test corpus from all single comprises a sequence of correspondence sets, each MT systems involved in system combination. of which contains the alternative words that are Inspired by these ideas, we propose two new aligned with each other. To derive a consensus features based on n-gram agreement measure to hypothesis from the confusion network, decoding improve the performance of system combination. is performed by selecting a path with the maximum The first one is a sentence specific LM built on overall confidence score among all paths that pass translation hypotheses of multiple systems; the the confusion network (Goel, et. al., 2004). second one is n-gram-voting-based confidence. Experimental results are presented in the context of 1 The work was performed when Yong Zhao was an intern at a large-scale Chinese-English translation task. Microsoft Research 2 System Combination for MT where 푬ℎ denotes the hypothesis set, 훿 ∙,∙ denotes the Kronecker function, and 푃(퐸′ |퐹) is One of the most successful approaches for the posterior probability of translation system combination for MT is based on hypothesis 퐸′ , which is expressed as the confusion network decoding as described in weighted sum of the system specific posterior (Rosti, et. al., 2007). Given translation probabilities through the systems that contains hypotheses from multiple MT systems, one of hypothesis 퐸′ , the hypotheses is selected as the backbone for 퐾 the use of hypothesis alignment. This is usually 푃 퐸 퐹 = 푤 푃(퐸 푆 , 퐹 1(퐸 ∈ 푬 ) (3) done by a sentence-level minimum Bayes risk 푘 푘 푆푘 (MBR) re-ranking method. The confusion 푘=1 network is constructed by aligning all these where 푤푘 is the weight for the posterior hypotheses against the backbone. Words that th probability of the k system 푆푘 , and 1 ∙ is the align to each other are grouped into a indicator function. correspondence set, constituting competition Follows Rosti, et. al. (2007), system specific links of the confusion network. Each path in the posteriors are derived based on a rank-based network passes exactly one link from each scoring scheme. I.e., if translation hypothesis 퐸푟 correspondence set. The final consensus output is the rth best output in the n-best list of system relies on a decoding procedure that chooses a 푆 , posterior 푃 퐸 푆 , 퐹 is approximated as: path with the maximum confidence score among 푘 푟 푘 all paths that pass the confusion network. 1/(1 + 푟)휂 푃 퐸푟 푆푘 , 퐹 = The confidence score of a hypothesis is ||푬푆 || (4) 푘 ′ 휂 usually formalized as a log-linear sum of several 푟′ =1 1/(1 + 푟 ) feature functions. Given a source language sentence 퐹 , the total confidence of a target where η is a rank smoothing parameter. Similar to (Zens and Ney, 2004), a language hypothesis 퐸 = (푒1, … , 푒퐿) in the confusion network can be represented as: straightforward approach of using n-gram fractional counts is to formulate it as a sentence 퐿 specific online LM. Then the online LM score 푙표푃 퐸 퐹 = 푙표 푃 푒푙 푙, 퐹 of a path in the confusion network will be added 푙=1 as an additional feature in the log-linear model (1) for decoding. The online n-gram LM score is + 휆 푙표푃 퐸 1 퐿푀 computed by: + 휆 푁 (퐸) 2 푤표푟푑푠 푙 푙−1 퐶(푒푙−푛+1|퐹) 푃(푒푙 |푒푙−푛+1, 퐹) = 푙−1 (5) 퐶(푒푙−푛+1|퐹) where the feature functions include word posterior probability 푃(푒푙 |푙, 퐹), LM probability The LM score of hypothesis 퐸 is obtained by: 푃퐿푀 (퐸), and the number of real words 푁푤표푟푑푠 in 퐿 퐸 . Usually, the model parameter λi could be 푙−1 trained by optimizing an evaluation metric, e.g., 푃퐿푀 퐸 퐹 = 푃 푒푙 |푒푙−푛+1, 퐹 (6) BLEU score, on a held-out development set. 푙=푛 3 N-gram Online Language Model Since new n-grams unseen in original translation hypotheses may be proposed by the Given a source sentence 퐹, the fractional count CN decoder, LM smoothing is critical. In our 푛 푛 퐶 푒1 퐹 of an n-gram 푒1 is defined as: approach, the score of the online LM is 퐿 smoothed by taking a linear interpolation to 푛 ′ ′ 푙 푛 combine scores of different orders. 퐶 푒1 퐹 = 푃 퐸 퐹 훿(푒 푙−푛+1, 푒1 ) (2) 퐸′ ∈푬ℎ 푙=푛 푙−1 5 Evaluation 푃푠푚표표푡 ℎ 푒푙 |푒푙−푛+1, 퐹 푛 We evaluate the proposed n-gram based features 푙−1 (7) on the Chinese-to-English (C2E) test in the past = 훼푚 푃(푒푙 |푒푙−푚+1, 퐹) NIST Open MT Evaluations. The experimental 푚=1 results are reported in case sensitive BLEU score (Papineni, et. al., 2002). In our implementation, the interpolation weights The dev set, which is used for system { 훼푚 } can be learned along with other combination parameters in the same Max-BLEU combination parameter training, is the newswire training scheme via Powell's search. and newsgroup parts of NIST MT06, which contains a total of 1099 sentences. The test set is 4 N-gram-Voting-Based Confidence the "current" test set of NIST MT08, which contains 1357 sentences of newswire and web- Motivated by features based on voting of single blog data. Both dev and test sets have four word, we proposed new features based on N- reference translations per sentence. 푛 gram voting. The voting score 푉 푒1 퐹 of an n- Outputs from a total of eight single MT 푛 gram 푒1 is computed as: systems were combined for consensus 푛 ′ 푛 ′ translations. These selected systems are based 푉 푒 퐹 = ′ ℎ 푃 퐸 퐹 1(푒 ∈ 퐸 ) (8) 1 퐸 ∈푬 1 on various translation paradigms, such as phrasal, hierarchical, and syntax-based systems. It receives a vote from each hypothesis that Each system produces 10-best hypotheses per contains that n-gram, and weighted by the translation. The BLEU score range for the eight posterior probability of that hypothesis, where ′ individual systems are from 26.11% to 31.09% the posterior probability 푃 퐸 퐹 is computed by on the dev set and from 20.42% to 26.24% on (3). Unlike the fractional count, each hypothesis the test set. In our experiments, a state-of-the-art can vote no more than once on an n-gram. 푛 system combination method proposed by He, et. 푉 푒1 퐹 takes a value between 0 and 1. It al. (2008) is implemented as the baseline. The can be viewed as the confidence of the n-gram 푛 true-casing model proposed by Toutanova et al. 푒1 . Then the n-gram-voting-based confidence (2008) is used. score of a hypothesis 퐸 is computed as the Table 1 shows results of adding the online product of confidence scores of n-grams in E: LM feature. Different LM orders up to four are 푙 푃푁푉,푛 퐸 퐹 = 푃푁푉,푛 푒1 푙, 퐹 = tested. Results show that using a 2-gram online 푙−푛+1 푚+푛−1 (9) LM yields a half BLEU point gain over the 푚=1 푉(푒푚 |퐹) baseline.

Using N-Gram Based Features for Machine Translation System Combination

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support