<<

Using N-gram based Features for Machine System Combination

Yong Zhao1 Xiaodong He Georgia Institute of Microsoft Research Atlanta, GA 30332, USA Redmond, WA 98052, USA [email protected] [email protected]

The confidence score of a hypothesis could be Abstract assigned in various ways. Fiscus (1997) used voting by frequency of occurrences. Mangu Conventional confusion network based et. al., (2000) computed a word posterior system combination for machine translation based on voting of that word in (MT) heavily relies on features that are different hypotheses. Moreover, the overall based on the measure of agreement of confidence score is usually formulated as a log- in different translation hypotheses. linear model including extra features including This paper presents two new features that consider agreement of n-grams in different model (LM) score, word count, etc. hypotheses to improve the performance of Features based on word agreement measure are system combination. The first one is based extensively studied in past work (Matusov, et. al., on a sentence specific online n-gram 2006, Rosti, et. al., 2007, He, et. al., 2008). , and the second one is However, utilization of n-gram agreement based on n-gram voting. Experiments on a information among the hypotheses has not been large scale Chinese-to-English MT task fully explored yet. Moreover, it was argued that show that both features yield significant the confusion network decoding may introduce improvements on the translation undesirable spur words that break coherent performance, and a combination of them phrases (Sim, et. al., 2007). Therefore, we would produces even better translation results. prefer the consensus translation that has better n- 1 gram agreement among outputs of single systems. 1 Introduction In the literature, Zens and Ney (2004) In past years, the confusion network based system proposed an n-gram posterior probability based combination approach has been shown with LM for MT. For each source sentence, a LM is substantial improvements in various machine trained on the n-best list produced by a single MT translation (MT) tasks (Bangalore, et. al., 2001, system and is used to re-rank that n-best list itself. Matusov, et. al., 2006, Rosti, et. al., 2007, He, On the other hand, Matusov et al. (2008) proposed et. al., 2008). Given hypotheses of multiple an “adapted” LM for system combination, where systems, a confusion network is built by aligning this “adapted” LM is trained on translation all these hypotheses. The resulting network hypotheses of the whole test corpus from all single comprises a sequence of correspondence sets, each MT systems involved in system combination. of which contains the alternative words that are Inspired by these ideas, we propose two new aligned with each other. To derive a consensus features based on n-gram agreement measure to hypothesis from the confusion network, decoding improve the performance of system combination. is performed by selecting a path with the maximum The first one is a sentence specific LM built on overall confidence score among all paths that pass translation hypotheses of multiple systems; the the confusion network (Goel, et. al., 2004). second one is n-gram-voting-based confidence. Experimental results are presented in the context of 1 The work was performed when Yong Zhao was an intern at a large-scale Chinese-English translation task. Microsoft Research 2 System Combination for MT where 푬ℎ denotes the hypothesis set, 훿 ∙,∙ denotes the Kronecker function, and 푃(퐸′ |퐹) is One of the most successful approaches for the posterior probability of translation system combination for MT is based on hypothesis 퐸′ , which is expressed as the confusion network decoding as described in weighted sum of the system specific posterior (Rosti, et. al., 2007). Given translation through the systems that contains hypotheses from multiple MT systems, one of hypothesis 퐸′ , the hypotheses is selected as the backbone for 퐾 the use of hypothesis alignment. This is usually 푃 퐸 퐹 = 푤 푃(퐸 푆 , 퐹 1(퐸 ∈ 푬 ) (3) done by a sentence-level minimum Bayes risk 푘 푘 푆푘 (MBR) re-ranking method. The confusion 푘=1 network is constructed by aligning all these where 푤푘 is the weight for the posterior hypotheses against the backbone. Words that th probability of the k system 푆푘 , and 1 ∙ is the align to each other are grouped into a indicator function. correspondence set, constituting competition Follows Rosti, et. al. (2007), system specific links of the confusion network. Each path in the posteriors are derived based on a rank-based network passes exactly one link from each scoring scheme. I.e., if translation hypothesis 퐸푟 correspondence set. The final consensus output is the rth best output in the n-best list of system relies on a decoding procedure that chooses a 푆 , posterior 푃 퐸 푆 , 퐹 is approximated as: path with the maximum confidence score among 푘 푟 푘 all paths that pass the confusion network. 1/(1 + 푟)휂

The confidence score of a hypothesis is 푃 퐸푟 푆푘 , 퐹 = ||푬 || (4) 푆푘 ′ 휂 usually formalized as a log-linear sum of several 푟′ =1 1/(1 + 푟 ) feature functions. Given a source language sentence 퐹 , the total confidence of a target where η is a rank smoothing parameter. Similar to (Zens and Ney, 2004), a language hypothesis 퐸 = (푒1, … , 푒퐿) in the confusion network can be represented as: straightforward approach of using n-gram fractional counts is to formulate it as a sentence 퐿 specific online LM. Then the online LM score 푙표𝑔푃 퐸 퐹 = 푙표𝑔 푃 푒푙 푙, 퐹 of a path in the confusion network will be added 푙=1 as an additional feature in the log-linear model (1) for decoding. The online n-gram LM score is + 휆 푙표𝑔푃 퐸 1 퐿푀 computed by: + 휆 푁 (퐸) 2 푤표푟푑푠 푙 푙−1 퐶(푒푙−푛+1|퐹) 푃(푒푙 |푒푙−푛+1, 퐹) = 푙−1 (5) 퐶(푒푙−푛+1|퐹) where the feature functions include word posterior probability 푃(푒푙 |푙, 퐹), LM probability The LM score of hypothesis 퐸 is obtained by: 푃퐿푀 (퐸), and the number of real words 푁푤표푟푑푠 in 퐿 퐸 . Usually, the model parameter λi could be 푙−1 trained by optimizing an evaluation metric, e.g., 푃퐿푀 퐸 퐹 = 푃 푒푙 |푒푙−푛+1, 퐹 (6) BLEU score, on a held-out development set. 푙=푛

3 N-gram Online Language Model Since new n-grams unseen in original translation hypotheses may be proposed by the Given a source sentence 퐹, the fractional count CN decoder, LM smoothing is critical. In our 푛 푛 퐶 푒1 퐹 of an n-gram 푒1 is defined as: approach, the score of the online LM is 퐿 smoothed by taking a linear interpolation to 푛 ′ ′ 푙 푛 combine scores of different orders. 퐶 푒1 퐹 = 푃 퐸 퐹 훿(푒 푙−푛+1, 푒1 ) (2) 퐸′ ∈푬ℎ 푙=푛 푙−1 5 Evaluation 푃푠푚표표푡 ℎ 푒푙 |푒푙−푛+1, 퐹 푛 We evaluate the proposed n-gram based features 푙−1 (7) on the Chinese-to-English (C2E) test in the past = 훼푚 푃(푒푙 |푒푙−푚+1, 퐹) NIST Open MT Evaluations. The experimental 푚=1 results are reported in case sensitive BLEU score (Papineni, et. al., 2002). In our implementation, the interpolation weights The dev set, which is used for system { 훼푚 } can be learned along with other combination parameters in the same Max-BLEU combination parameter training, is the newswire training scheme via Powell's search. and newsgroup parts of NIST MT06, which contains a total of 1099 sentences. The test set is 4 N-gram-Voting-Based Confidence the "current" test set of NIST MT08, which contains 1357 sentences of newswire and web- Motivated by features based on voting of single blog data. Both dev and test sets have four word, we proposed new features based on N- reference per sentence. 푛 gram voting. The voting score 푉 푒1 퐹 of an n- Outputs from a total of eight single MT 푛 gram 푒1 is computed as: systems were combined for consensus 푛 ′ 푛 ′ translations. These selected systems are based 푉 푒 퐹 = ′ ℎ 푃 퐸 퐹 1(푒 ∈ 퐸 ) (8) 1 퐸 ∈푬 1 on various translation paradigms, such as phrasal, hierarchical, and -based systems. It receives a vote from each hypothesis that Each system produces 10-best hypotheses per contains that n-gram, and weighted by the translation. The BLEU score range for the eight posterior probability of that hypothesis, where ′ individual systems are from 26.11% to 31.09% the posterior probability 푃 퐸 퐹 is computed by on the dev set and from 20.42% to 26.24% on (3). Unlike the fractional count, each hypothesis the test set. In our experiments, a state-of-the-art can vote no more than once on an n-gram. 푛 system combination method proposed by He, et. 푉 푒1 퐹 takes a value between 0 and 1. It al. (2008) is implemented as the baseline. The can be viewed as the confidence of the n-gram 푛 true-casing model proposed by Toutanova et al. 푒1 . Then the n-gram-voting-based confidence (2008) is used. score of a hypothesis 퐸 is computed as the Table 1 shows results of adding the online product of confidence scores of n-grams in E: LM feature. Different LM orders up to four are 푙 푃푁푉,푛 퐸 퐹 = 푃푁푉,푛 푒1 푙, 퐹 = tested. Results show that using a 2-gram online 푙−푛+1 푚+푛−1 (9) LM yields a half BLEU point gain over the 푚=1 푉(푒푚 |퐹) baseline. However, the gain is saturated after a LM order of three, and fluctuates after that. where n can take the value of 2, 3, …, N. In Table 2 shows the performance of using n- order to prevent zero confidence, a small back- gram-voting-based confidence features. The best off confidence score is assigned to all n-grams result of 31.01% is achieved when up to 4-gram unseen in original hypotheses. confidence features are used. The BLEU score Augmented with the proposed n-gram based keeps improving when longer n-gram features, the final log-linear model becomes: confidence features are added. This indicates 푙표𝑔푃 퐸 퐹 that the n-gram voting based confidence feature 퐿 is robust to high order n-grams. We further experimented with incorporating

= 푙표𝑔 푃 푒푙 푙, 퐹 + 휆1푙표𝑔푃퐿푀 퐸 both features in the log-linear model and 푙=1 reported the results in Table 3. Given the (10) + 휆2푁푤표푟푑푠 퐸 + 휆3푙표𝑔푃퐿푀 퐸 퐹 observation that the n-gram voting based 푁 confidence feature is more robust to high order n-grams, we further tested using different n-

+ 휆푛+2푙표𝑔푃푁푉,푛 퐸 퐹 gram orders for them. As shown in Table 3, 푛=2 using 3-gram online LM plus 2~4-gram voting based confidence scores yields the best BLEU References scores on both dev and test sets, which are J.G. Fiscus, 1997. A post-processing system to yield 37.98% and 31.35%, respectively. This is a 0.84 reduced word error rates: Recognizer Output Voting BLEU point gain over the baseline on the MT08 Error Reduction (ROVER), in Proc. ASRU. test set. S. Bangalore, G. Bordel, and G. Riccardi, 2001. Table 1: Results of adding the n-gram online LM. Computing consensus translation from multiple BLEU % Dev Test machine translation systems, in Proc. ASRU. Baseline 37.34 30.51 E. Matusov, N. Ueffing, and H. Ney, 2006. 1-gram online LM 37.34 30.51 Computing consensus translation from multiple machine translation systems using enhanced 2-gram online LM 37.86 31.02 hypotheses alignment, in Proc. EACL. 3-gram online LM 37.87 31.08 A.-V.I. Rosti, S. Matsoukas, and R. Schwartz, 2007. 4-gram online LM 37.86 31.01 Improved Word-Level System Combination for Table 2: Results of adding n-gram voting based Machine Translation. In Proc. ACL. confidence features. X. He, M. Yang, J. Gao, P. Nguyen, and R. Moore, BLEU % Dev Test 2008. Indirect-HMM-based hypothesis alignment for Baseline 37.34 30.51 combining outputs from machine translation systems, in Proc. EMNLP. + 2-gram voting 37.58 30.88 + 2~3-gram voting 37.66 30.96 L. Mangu, E. Brill, and A. Stolcke, 2000. Finding Consensus in : Word Error + 2~4-gram voting 37.77 31.01 Minimization and Other Applications of Confusion Table 3: Results of using both n-gram online LM Networks, Computer Speech and Language, and n-gram voting based confidence features 14(4):373-400. BLEU % Dev Test R. Zens and H. Ney, 2004. N-Gram posterior Baseline 37.34 30.51 probabilities for statistical machine translation, in 2-gram LM + 2-gram voting 37.78 30.98 Proc. HLT-NAACL. 3-gram LM + 2~3-gram voting 37.89 31.21 K.C. Sim, W.J. Byrne, M.J.F. Gales, H. Sahbi and 4-gram LM + 2~4-gram voting 37.93 31.08 P.C. Woodland, 2007. Consensus network decoding for statistical machine translation system 3-gram LM + 2~4-gram voting 37.98 31.35 combination. in Proc. ICASSP. V. Goel, S. Kumar, and W. Byrne, 2004. Segmental minimum Bayes-risk decoding for automatic speech 6 Conclusion recognition. IEEE transactions on Speech and Audio Processing, vol. 12, no. 3. This work explored utilization of n-gram K. Papineni, S. Roukos, T. Ward, and W. Zhu, 2002. agreement information among translation BLEU: a method for automatic evaluation of outputs of multiple MT systems to improve the machine translation. in Proc. ACL. performance of system combination. This is an extension of an earlier idea presented at the K. Toutanova, H. Suzuki and A. Ruopp. 2008. Applying Generation Models to Machine NIPS 2008 Workshop on Speech and Language Translation. In Proc. of ACL. (Yong and He 2008). Two kinds of n-gram based features were proposed. The first is based on an Yong Zhao and Xiaodong He. 2008. System online LM using n-gram fractional counts, and Combination for Machine Translation Using N-Gram the second is a confidence feature based on n- Posterior Probabilities. NIPS 2008 WORKSHOP on Speech and Language: Learning-based Methods and gram voting scores. Our experiments on the Systems. Dec. 2008 NIST MT08 Chinese-English task showed that both methods yield nice improvements on the E. Matusov, G. Leusch, R. E. Banchs, N. Bertoldi, D. translation results, and incorporating both kinds Dechelotte, M. Federico, M. Kolss, Y. Lee, J. B. of features produced the best translation result Marino, M. Paulik, S. Roukos, H. Schwenk, and H. Ney. 2008. System Combination for Machine with a BLEU score of 31.35%, which is a 0.84% Translation of Spoken and Written Language. IEEE improvement. Transactions on Audio, Speech and Language Processing, Sept. 2008.