Automatic Evaluation of Machine Translation Quality Using Longest Com- Mon Subsequence and Skip-Bigram Statistics

Automatic Evaluation of Machine Translation Quality Using Longest Com- mon Subsequence and Skip-Bigram Statistics Chin-Yew Lin and Franz Josef Och Information Sciences Institute University of Southern California 4676 Admiralty Way Marina del Rey, CA 90292, USA {cyl,och}@isi.edu two recent large-scale machine translation evalua- Abstract tions. Recently, Turian et al. (2003) indicated that In this paper we describe two new objective standard accuracy measures such as recall, preci- automatic evaluation methods for machine sion, and the F-measure can also be used in evalua- translation. The first method is based on long- tion of machine translation. However, results based est common subsequence between a candidate on their method, General Text Matcher (GTM), translation and a set of reference translations. showed that unigram F-measure correlated best Longest common subsequence takes into ac- with human judgments while assigning more count sentence level structure similarity natu- weight to higher n-gram (n > 1) matches achieved rally and identifies longest co-occurring in- similar performance as Bleu. Since unigram sequence n-grams automatically. The second matches do not distinguish words in consecutive method relaxes strict n-gram matching to skip- positions from words in the wrong order, measures bigram matching. Skip-bigram is any pair of based on position-independent unigram matches words in their sentence order. Skip-bigram co- are not sensitive to word order and sentence level occurrence statistics measure the overlap of structure. Therefore, systems optimized for these skip-bigrams between a candidate translation unigram-based measures might generate adequate and a set of reference translations. The empiri- but not fluent target language. cal results show that both methods correlate Since BLEU has been used to report the perform- with human judgments very well in both ade- ance of many machine translation systems and it quacy and fluency. has been shown to correlate well with human judgments, we will explain BLEU in more detail 1 Introduction and point out its limitations in the next section. We Using objective functions to automatically evalu- then introduce a new evaluation method called ate machine translation quality is not new. Su et al. ROUGE-L that measures sentence-to-sentence (1992) proposed a method based on measuring edit similarity based on the longest common subse- distance (Levenshtein 1966) between candidate quence statistics between a candidate translation and reference translations. Akiba et al. (2001) ex- and a set of reference translations in Section 3. tended the idea to accommodate multiple refer- Section 4 describes another automatic evaluation ences. Nießen et al. (2000) calculated the length- method called ROUGE-S that computes skip- normalized edit distance, called word error rate bigram co-occurrence statistics. Section 5 presents (WER), between a candidate and multiple refer- the evaluation results of ROUGE-L, and ROUGE- ence translations. Leusch et al. (2003) proposed a S and compare them with BLEU, GTM, NIST, related measure called position-independent word PER, and WER in correlation with human judg- error rate (PER) that did not consider word posi- ments in terms of adequacy and fluency. We con- tion, i.e. using bag-of-words instead. Instead of clude this paper and discuss extensions of the error measures, we can also use accuracy measures current work in Section 6. that compute similarity between candidate and reference translations in proportion to the number of 2 BLEU and N-gram Co-Occurrence common words between them as suggested by To automatically evaluate machine translations Melamed (1995). An n-gram co-occurrence meas- the machine translation community recently ure, BLEU, proposed by Papineni et al. (2001) that adopted an n-gram co-occurrence scoring proce- calculates co-occurrence statistics based on n-gram dure BLEU (Papineni et al. 2001). In two recent overlaps have shown great potential. A variant of large-scale machine translation evaluations spon- BLEU developed by NIST (2002) has been used in sored by NIST, a closely related automatic evaluation method, simply called NIST score, was used. consecutive word matches and to estimate their The NIST (NIST 2002) scoring method is based on fluency, it does not consider sentence level struc- BLEU. ture. For example, given the following sentences: The main idea of BLEU is to measure the similarity between a candidate translation and a set of S1. police killed the gunman reference translations with a numerical metric. S2. police kill the gunman1 They used a weighted average of variable length n- S3. the gunman kill police gram matches between system translations and a set of human reference translations and showed We only consider BLEU with unigram and bi- that the weighted average metric correlating highly gram, i.e. N=2, for the purpose of explanation and with human assessments. call this BLEU-2. Using S1 as the reference and S2 BLEU measures how well a machine translation and S3 as the candidate translations, S2 and S3 overlaps with multiple human translations using n- would have the same BLEU-2 score, since they gram co-occurrence statistics. N-gram precision in both have one bigram and three unigram matches2. BLEU is computed as follows: However, S2 and S3 have very different meanings. Third, BLEU is a geometric mean of unigram to − N-gram precisions. Any candidate translation ∑∑Countclip (n gram) ∈∈− without a N-gram match has a per-sentence BLEU p = CC{Candidates} n gram (1) n ∑∑Count(n−gram) score of zero. Although BLEU is usually calculated CC∈∈{Candidates} n−gram over the whole test corpus, it is still desirable to have a measure that works reliably at sentence level for diagnostic and introspection purpose. Where Countclip(n-gram) is the maximum number of n-grams co-occurring in a candidate transla- To address these issues, we propose three new tion and a reference translation, and Count(n- automatic evaluation measures based on longest gram) is the number of n-grams in the candidate common subsequence statistics and skip bigram translation. To prevent very short translations that co-occurrence statistics in the following sections. try to maximize their precision scores, BLEU adds a brevity penalty, BP, to the formula: 3 Longest Common Subsequence ⎧ 1 if c > r ⎫ 3.1 ROUGE-L BP = (2) ⎨ (1−|r|/|c|) ≤ ⎬ A sequence Z = [z , z , ..., z ] is a subsequence of ⎩e if c r ⎭ 1 2 n another sequence X = [x1, x2, ..., xm], if there exists a strict increasing sequence [i , i , ..., i ] of indices Where |c| is the length of the candidate transla- 1 2 k of X such that for all j = 1, 2, ..., k, we have xij = zj tion and |r| is the length of the reference transla- (Cormen et al. 1989). Given two sequences X and tion. The BLEU formula is then written as follows: Y, the longest common subsequence (LCS) of X and Y is a common subsequence with maximum ⎛ N ⎞ length. We can find the LCS of two sequences of BLEU = BP • exp⎜ w log p ⎟ (3) ∑ n n length m and n using standard dynamic program- ⎝ n=1 ⎠ ming technique in O(mn) time. LCS has been used to identify cognate candi- The weighting factor, w , is set at 1/N. n dates during construction of N-best translation Although BLEU has been shown to correlate well lexicons from parallel text. Melamed (1995) used with human assessments, it has a few things that the ratio (LCSR) between the length of the LCS of can be improved. First the subjective application of two words and the length of the longer word of the the brevity penalty can be replaced with a recall two words to measure the cognateness between related parameter that is sensitive to reference them. He used as an approximate string matching length. Although brevity penalty will penalize can- (1- algorithm. Saggion et al. (2002) used normalized didate translations with low recall by a factor of e |r|/|c|) pairwise LCS (NP-LCS) to compare similarity be- , it would be nice if we can use the traditional tween two texts in automatic summarization recall measure that has been a well known measure evaluation. NP-LCS can be shown as a special case in NLP as suggested by Melamed (2003). Of of Equation (6) with β = 1. However, they did not course we have to make sure the resulting compos- provide the correlation analysis of NP-LCS with ite function of precision and recall is still correlates highly with human judgments. 1 This is a real machine translation output. Second, although BLEU uses high order n-gram 2 The “kill” in S2 or S3 does not match with “killed” in (n>1) matches to favor candidate sentences with S1 in strict word-to-word comparison. human judgments and its effectiveness as an auto- S1. police killed the gunman matic evaluation measure. S2. police kill the gunman To apply LCS in machine translation evaluation, S3. the gunman kill police we view a translation as a sequence of words. The intuition is that the longer the LCS of two transla- As we have shown earlier, BLEU-2 cannot differ- tions is, the more similar the two translations are. entiate S2 from S3. However, S2 has a ROUGE-L We propose using LCS-based F-measure to esti- score of 3/4 = 0.75 and S3 has a ROUGE-L score mate the similarity between two translations X of of 2/4 = 0.5, with β = 1. Therefore S2 is better than length m and Y of length n, assuming X is a refer- S3 according to ROUGE-L. This example also il- ence translation and Y is a candidate translation, as lustrated that ROUGE-L can work reliably at sen- follows: tence level.

Load more