RC22176 (W0109-022) September 17, 2001 Computer Science

IBM Research Report

Bleu: a Method for Automatic Evaluation of

Kishore Papineni, Salim Roukos, Todd Ward, Wei-Jing Zhu IBM Research Division Thomas J. Watson Research Center P.O. Box 218 Yorktown Heights, NY 10598

Research Division Almaden - Austin - Beijing - Delhi - Haifa - India - T. J. Watson - Tokyo - Zurich

LIMITED DISTRIBUTION NOTICE: This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g. , payment of royalties). Copies may be requested from IBM T. J. Watson Research Center , P. O. Box 218, Yorktown Heights, NY 10598 USA (email: [email protected]). Some reports are available on the internet at http://domino.watson.ibm.com/library/CyberDig.nsf/home . Bleu: a Method for Automatic Evaluation of Machine Translation

Kishore Papineni Salim Roukos Todd Ward Wei-Jing Zhu IBM T. J. Watson Research Center Yorktown Heights, NY 10598, USA papineni,roukos,toddward,weijing @us.ibm.com { }

Abstract 1.2 Viewpoint Human evaluations of machine translation are How does one measure translation perfor- extensive but expensive. Human evaluations mance? The closer a machine translation is to can take months to finish and involve human a professional human translation, the better it labor that can not be reused. We propose a is. This is the central idea behind our proposal. method of automatic machine translation eval- To judge the quality of a machine translation, uation that is quick, inexpensive, and language- one measures its closeness by a numerical met- independent, that correlates highly with human ric to one or more reference human translations. evaluation, and that has little marginal cost per Thus, our MT evaluation system requires two run. We present this method as an automated ingredients: understudy to skilled human judges which sub- stitutes for them when there is need for quick 1. a numerical “translation closeness” metric 1 or frequent evaluations. 2. a corpus of good quality human reference translations 1 Introduction These reference translations can be reused over 1.1 Rationale and over again and incur only a one-time startup expense. Each evaluation can be ac- Human evaluations of machine translation complished in seconds. (MT) weigh many aspects of translation, in- We fashion our closeness metric after the cluding adequacy, fidelity, and fluency of the highly successful word error rate metric used translation[1][2]. Such evaluations are exten- by the speech recognition community, appro- sive, but also expensive[1]. Moreover, they can priately modified for multiple reference trans- take weeks or months to finish. This is a big lations and allowing for legitimate differences problem because developers of machine transla- in word choice and word order. The main idea tion systems need to monitor the effect of daily is to use a weighted average of variable length changes to their systems in order to weed out phrase matches against the reference transla- bad ideas from good ideas. We believe that MT tions. This view gives rise to a family of met- progress stems from evaluation and that there rics using various weighting schemes. We have is a logjam of fruitful research ideas waiting selected a promising baseline metric from this to be released from the evaluation bottleneck. family. Developers would benefit from an inexpensive automatic evaluation that is quick, language- Although our baseline metric correlates very independent, and correlates highly with human highly with human judgments, we do know that evaluation. We propose such an evaluation there are subtleties and stylistic variations that method in this note. are better appreciated by humans than ma- chines. For the foreseeable future, we believe 1So we call our method the bilingual evaluation these subtleties will remain relatively small ef- understudy, Bleu. fects compared with other MT phenomena. We 2 present our method as a virtual apprentice or these three reference translations, while Can- understudy to skilled human judges. We call didate 2 does not. We will shortly quan- this method Bleu, for BiLingual Evaluation tify this notion of sharing. But first observe Understudy. that Candidate 1 shares "It is a guide to with Reference 1, with Ref- In Section 2, we describe the baseline metric action" "which" erence 2, with in detail. In Section 3, we evaluate the perfor- "ensures that the military" Reference 1, with References 2 and 3, mance of Bleu. In Section 4, we describe a "always" with Reference 1, and finally human evaluation experiment. In Section 5, we "commands" "of with Reference 2 (all ignoring cap- compare our baseline metric performance with the party" italization). In contrast, Candidate 2 exhibits human evaluations. far fewer matches and their extent is less. 2 The Baseline Bleu Metric For this example, it is at once clear that a program can rank Candidate 1 higher than Can- Typically, there are many “perfect” translations didate 2 simply by comparing n-gram matches of a given source sentence. These translations between each candidate translation and the ref- may vary in word choice or in word order even erence translations. Experiments over large col- when they use the same words. And yet hu- lections of translations presented in Section 5 mans can clearly distinguish a good translation show that this ranking ability is a general phe- from a bad one. For example, consider these nomenon, and not an artifact of a few examples. two candidate translations of a Chinese source leu sentence. The primary programming task in a B im- plementation is to compare n-grams of the can- Example 1: didate with the n-grams of the reference trans- lation and count the number of matches. These Candidate 1: It is a guide to action matches are position-independent. The more which ensures that the military always the matches, the better the candidate transla- obeys the commands of the party. tion. For simplicity, we first focus on computing unigram matches. Candidate 2: It is to insure the troops forever hearing the activity guidebook 2.1 Modified n-gram precision that party direct.

Although they appear to be about the same The cornerstone of our metric is the famil- subject, they differ markedly in quality. For iar precision measure. To compute precision, comparison, we provide three reference human one simply counts up the number of candi- translations of the same sentence below. date translation words (unigrams) which occur in any reference translation and then divides Reference 1: It is a guide to action that by the total number of words in the candidate ensures that the military will forever translation. Unfortunately, MT systems can heed Party commands. overgenerate “reasonable” words, resulting in improbable, but high-precision, translations like Reference 2: It is the guiding principle that of example 2 below. Intuitively the prob- which guarantees the military forces lem is clear: a reference word should be consid- always being under the command of the ered exhausted after a matching candidate word Party. is identified. We formalize this intuition as the modified unigram precision. To compute this, Reference 3: It is the practical guide one first counts the maximum number of times for the army always to heed the direc- a word occurs in any single reference transla- tions of the party. tion. Next, one clips the total count of each can- didate word by its maximum reference count, It is clear that the good translation, Candi- adds these clipped counts up, and divides by the date 1, shares many words and phrases with total (unclipped) number of candidate words. 3 Example 2. compute a modified precision score, pn, for the entire test corpus. Candidate: the the the the the the the. p = Reference 1: The cat is on the mat. n Countclip(n-gram) CandidatesP n-gramP Reference 2: There is a cat on the mat. C∈{ } ∈ C . Count(n-gram) 2 CandidatesP n-gramP Modified Unigram Precision = 2/7. C∈{ } ∈ C In other words, we use a word-weighted average In Example 1, Candidate 1 achieves a modi- of the sentence-level modified precisions rather fied unigram precision of 17/18; whereas Candi- than a sentence-weighted average. As an exam- date 2 achieves a modified unigram precision of ple, we compute word matches at the sentence 8/14. Similarly, the modified unigram precision level, but the modified unigram precision is the in Example 2 is 2/7, even though its standard fraction of words matched in the entire test cor- unigram precision is 7/7. pus. Modified n-gram precision is computed sim- ilarly for any n: all candidate n-gram counts 2.1.2 Ranking systems using only and their corresponding maximum reference modified n-gram precision counts are collected. The candidate counts are To verify that modified n-gram precision distin- clipped by their corresponding reference maxi- guishes between very good translations and bad mum value, summed, and divided by the total translations, we computed the modified preci- number of candidate n-grams. In Example 1, sion numbers on the output of a (good) human Candidate 1 achieves a modified bigram preci- translator and a standard (poor) machine trans- sion of 10/17, whereas the lower quality Candi- lation system against 4 reference translations for date 2 achieves a modified bigram precision of each of 127 source sentences. The average pre- 1/13. In Example 2, the (implausible) candi- cision results are shown in Figure 1. date achieves a modified bigram precision of 0. This sort of modified n-gram precision scoring captures two aspects of translation: adequacy Figure 1: Distinguishing Human from Machine and fluency. A translation using the same words 0. 9 (1-grams) as in the references tends to satisfy 0. 8 adequacy. The longer n-gram matches account 0. 7 n 0. 6 o for fluency. i

s 0. 5 i c 0. 4 e r

2.1.1 Modified n-gram precision on P 0. 3 blocks of text 0. 2 0. 1 How do we compute modified n-gram precision 0 1 2 3 4 on a multi-sentence test set? Although one typi- P h r a s e ( n - g r a m ) L e n g t h cally evaluates MT systems on a corpus of entire documents, our basic unit of evaluation is the sentence. A source sentence may translate to The strong signal differentiating human (high many target sentences, in which case we abuse precision) from machine (low precision) is strik- terminology and refer to the corresponding tar- ing. The difference becomes stronger as we go get sentences as a “sentence.” We first com- from unigram precision to 4-gram precision. It pute the n-gram matches sentence by sentence. appears that any single n-gram precision score Next, we add the clipped n-gram counts for all can distinguish between a good translation and the candidate sentences and divide by the num- a bad translation. To be useful, however, the ber of candidate n-grams in the test corpus to metric must also reliably distinguish between 2As a guide to the eye, we have underlined the im- translations that do not differ so greatly in portant words for computing modified precision. quality. Furthermore, it must distinguish be- 4 tween two human translations of differing qual- couraging results for the 5 systems. In this case, ity. This latter requirement ensures the con- we found that underweighting the unigram pre- tinued validity of the metric as MT approaches cision yields a score which more closely matches human translation quality. human judgments. To this end, we obtained a human transla- However, as can be seen in Figure 4, the mod- tion by someone lacking native proficiency in ified n-gram precision decays roughly exponen- both the source (Chinese) and the target lan- tially with n: the modified unigram precision is guage (English). For comparison, we acquired much larger than the modified bigram precision human translations of the same documents by which in turn is much bigger than the modi- a native English speaker. We also obtained ma- fied trigram precision. A reasonable averaging chine translations by three commercial systems. scheme must take this exponential decay into These five “systems” – two humans and three account: a weighted average of the logarithm of machines – are scored against two reference pro- the modified precisions would do so. fessional human translations. The average mod- Bleu uses the average logarithm with uni- ified n-gram precision results are shown in Fig- form weights, which is equivalent to using the ure 2. geometric mean of the modified n-gram preci- sions.3,4 As a result, the Bleu metric is now Figure 2: Machine and Human Translations more sensitive to longer n-grams. Experimen- tally, we obtain the best correlation with mono- 0. 7 lingual human judgments using a maximum n- 0. 6 gram order of 4, although 3-grams and 5-grams 0. 5 give comparable results. n o i 0. 4 s i c e r 0. 3 2.2 Sentence length P 0. 2 0. 1 A candidate translation should be neither too 0 long nor too short, and an evaluation metric 1 2 3 4 should enforce this. To some extent, the n- gram precision already accomplishes this. N- P h r a s e ( n - g r a m ) L e n g t h gram precision penalizes spurious words in the H 2 H 1 S 3 S 2 S 1 candidate that do not appear in any of the ref- erence translations. Additionally, modified pre- cision is penalized if a word occurs more fre- Each of these n-gram statistics implies the quently in a candidate translation than its max- same ranking: H2 (Human-2) is better than H1 imum reference count. This rewards using a (Human-1), and there is a big drop in quality word as many times as warranted and penalizes between H1 and S3 (Machine/System-3). S3 using a word more times than it occurs in any of appears better than S2 which in turn appears the references. However, modified n-gram pre- better than S1. Remarkably, this is the same cision alone fails to enforce the proper transla- rank order assigned to these “systems” by hu- tion length, as is illustrated in the short, absurd man judges, as we discuss later. While there example below. seems to be ample signal in any single n-gram precision, it is more robust to combine all these Example 3: signals into a single number metric. Candidate: of the 2.1.3 Combining the modified n-gram precisions 3The geometric average is harsh if any of the modified precisions vanish, but this should be an extremely rare How should we combine the modified precisions event in test corpora of reasonable size (for Nmax 4). for the various n-gram sizes? A weighted linear 4The geometric average also slightly outperforms≤ our average of the modified precisions resulted in en- best results using an arithmetic average. 5

Reference 1: It is a guide to action that order and syntax, such a computation is com- ensures that the military will forever plicated. heed Party commands. 2.2.2 Sentence brevity penalty Reference 2: It is the guiding principle Candidate translations longer than their refer- which guarantees the military forces ences are already penalized by the modified n- always being under the command of the gram precision measure: there is no need to Party. penalize them again. Consequently, we intro- Reference 3: It is the practical guide duce a multiplicative brevity penalty factor that for the army always to heed the direc- only penalizes candidates shorter than their ref- tions of the party. erence translations. With this brevity penalty in place, a high-scoring candidate translation Because this candidate is so short compared must now match the reference translations in to the proper length, one expects to find in- length, in word choice, and in word order. Note flated precisions: the modified unigram preci- that neither this brevity penalty nor the modi- sion is 2/2, and the modified bigram precision fied n-gram precision length effect directly con- is 1/1. siders the source length; instead, they consider the range of reference translation lengths in the 2.2.1 The trouble with recall target language. Traditionally, precision has been paired with The brevity penalty is a multiplicative factor recall to overcome such length-related prob- modifying the overall Bleu score. We wish to lems. However, Bleu considers multiple refer- make the penalty 1 when the candidate’s length ence translations, each of which may use a dif- is the same as any reference translation’s length. ferent word choice to translate the same source For example, if there are three references with word. Furthermore, a good candidate transla- lengths 12, 15, and 17 words and the candi- tion will only use (recall) one of these possible date translation is a terse 12 words, we want choices, but not all. Indeed, recalling all choices the brevity penalty to be 1. We call the clos- leads to a bad translation. Here is an example. est reference sentence length the “best match length.” Example 4: If we computed the brevity penalty sentence Candidate 1: I always invariably perpetu- by sentence and averaged the penalties, then ally do. length deviations on short sentences would be punished harshly. Instead, we compute the Candidate 2: I always do. brevity penalty over the entire corpus to al- low some freedom at the sentence level. We Reference 1: I always do. first compute the test corpus’ effective reference length, r, by summing the best match lengths Reference 2: I invariably do. for each candidate sentence in the corpus. The brevity penalty is a decaying exponential in r/c, Reference 3: I perpetually do. where c is the total length of the candidate translation corpus. The first candidate recalls more words from the references, but is obviously a poorer transla- 2.3 Bleu details tion than the second candidate. Thus, na¨ıve re- call computed over the set of all reference words We take the geometric mean of the test corpus’ is not a good measure. Admittedly, one could modified precision scores and then multiply the align the reference translations to discover syn- result by an exponential brevity penalty factor. onymous words and compute recall on concepts Currently, case folding is the only text normal- rather than words. But, given that reference ization performed before computing the preci- translations vary in length and differ in word sion. 6

We first compute the geometric average of the Table 2: Paired t-statistics on 20 blocks modified n-gram precisions, pn, using n-grams up to length N and positive weights wn sum- S1 S2 S3 H1 H2 ming to one. Mean 0.051 0.081 0.090 0.192 0.256 StdDev 0.017 0.025 0.020 0.030 0.039 Next, let c be the length of the candidate t — 6 3.4 24 11 translation and r be the effective reference cor- pus length. We compute the brevity penalty BP, How reliable is the difference in Bleu met- 1 if c > r • ric? BP =  (1 r/c) . e − if c r ≤ What is the variance of Bleu score? Then, • If we were to pick another random set of N • 500 sentences, would we still judge S3 to Bleu= BP exp w log p ! . X n n be better than S2? · n=1 The ranking behavior is more immediately ap- To answer these questions, we divided the test parent in the log domain, corpus into 20 blocks of 25 sentences each, and computed the Bleu metric on these blocks in- r N dividually. We thus have 20 samples of the log Bleu = min(1 , 0) + wn log pn. − c X Bleu metric for each system. We computed the n=1 means, variances, and paired t-statistics which In our baseline, we use N = 4 and uniform are displayed in Table 2. The t-statistic com- weights wn = 1/N. pares each system with its left neighbor in the table. For example, t = 6 for the pair S1 and 3 The Bleu Evaluation S2. The Bleu metric ranges from 0 to 1. Few trans- lations will attain a score of 1 unless they are Note that the numbers in Table 1 are the identical to a reference translation. For this rea- Bleu metric on an aggregate of 500 sentences, son, even a human translator will not necessar- but the means in Table 2 are averages of the ily score 1. It is important to note that the Bleu metric on aggregates of 25 sentences. As more reference translations per sentence there expected, they are close for each system and dif- are, the higher the score is. Thus, one must fer only by small finite block size effects. Since be cautious making even “rough” comparisons a paired t-statistic of 1.7 or above is 95% signifi- on evaluations with different numbers of refer- cant, the differences between the systems’ scores ence translations: on a test corpus of about 500 are statistically very significant. The reported sentences (40 general news stories), a human variance on 25-sentence blocks serves as an up- translator scored 0.3468 against four references per bound to the variance of sizeable test sets and scored 0.2571 against two references. Ta- like the 500 sentence corpus. ble 1 shows the Bleu scores of the 5 systems How many reference translations do we need? against two references on the test corpus men- Our experiments in this direction are prelimi- tioned above. nary. We simulated a single-reference test cor- pus in the following manner. For each of the Table 1: Bleu on 500 sentences 40 stories, we randomly selected one of the 4 reference translations as the single reference for S1 S2 S3 H1 H2 that story. In this way, we ensured a degree of 0.0527 0.0829 0.0930 0.1934 0.2571 stylistic variation. Then we ranked S1, S2, and S3 by Bleu computed against this reduced ref- The MT systems S2 and S3 are very close in erence corpus. The systems maintain the same this metric. Hence, several questions arise: rank order as with multiple references. We do 7 not have significance numbers on this experi- see that S2 is quite a bit better than S1 (by a ment. However, the outcome suggests that we mean opinion score difference of 0.326 on the may use a big test corpus with a single refer- 5-point scale), while S3 is judged a little better ence translation provided that the translations (by 0.114). Both differences are significant at are not all from the same translator. the 95% level.5 The human H1 is much better than the best system, though a bit worse than 4 The Human Evaluation human H2. This is not surprising given that H1 is not a native speaker of either Chinese or En- We now describe the human evaluation. We had glish, whereas H2 is a native English speaker. two groups of human judges. The first group, Again, the difference between the human trans- called the monolingual group, consisted of 10 lators is significant beyond the 95% level. native speakers of English. The second group, called the bilingual group, consisted of 10 na- Figure 3: Monolingual Judgments - pairwise dif- tive speakers of Chinese who had lived in the ferential comparison United States for the past several years. None of the human judges was a professional trans- 2.5 lator. The humans judged our 5 standard sys- tems on a Chinese sentence subset extracted at 2 random from our 500 sentence test corpus. We 1.5 paired each source sentence with each of its 5 translations, for a total of 250 pairs of Chinese 1 source and English translations. We prepared a 0.5 web page with these translation pairs randomly ordered to disperse the five translations of each 0 S2-S1 S3-S2 H1-S3 H2-H1 source sentence. All judges used this same web- page and saw the sentence pairs in the same +95% 0 . 4 0 0 0 . 1 94 1 . 94 5 0 . 6 7 0 order. They rated each translation from 1 (very bad) to 5 (very good). A description of the rat- - 95% 0 . 2 52 0 . 0 3 4 1 . 7 0 5 0 . 4 0 0 ing scheme was displayed on mouseover and is m o n o l i n g u a l 0 . 3 2 6 0 . 1 1 4 1 . 8 2 5 0 . 53 5 reproduced in the Appendix. The monolingual group made their judgments based only on the translations’ readability and fluency. 4.2 Bilingual group pairwise judgments As must be expected, some judges were much Figure 4 shows the same results for the bilingual more liberal than others. And some sentences group. They also find that S3 is slightly better were easier to translate than others. To account than S2 (at 95% confidence) though they judge for the intrinsic difference between judges and that the human translations are much closer the sentences, we compared each judge’s rating (judged as indistinguishable at 95% confidence) for a sentence across systems. We performed suggesting that the bilinguals tended to focus four pairwise t-test comparisons between adja- more on adequacy than on fluency. cent systems as ordered by their aggregate av- erage score. 5 Bleu vs The Human Evaluation 4.1 Monolingual group pairwise Figure 5 shows a linear regression of the mono- judgments lingual group scores as a function of the Bleu score for the 5 systems. We computed the Bleu We first present the pairwise comparative judg- scores using two reference translations. The ments of the monolingual group. Figure 3 shows 5 the mean difference between the scores of two The 95% confidence interval comes from t-test, as- suming that the data comes from a T-distribution with consecutive systems and the 95% confidence in- N degrees of freedom. N varied from 350 to 470 as some terval about the mean. The confidence inter- judges have skipped some sentences in their evaluation. vals are shown as vertical bars in the plot. We Thus, the distribution is close to Gaussian. 8

Figure 4: Bilingual Judgments - pairwise differ- Figure 6: Bleu predicts Bilingual Judgments ential comparison

t 3 . 5 n

2.5 e 3 m

2 e 2 . 5 g d 1.5 2 u J

1 l 1 . 5 a

u 1 0.5 g n i

l 0. 5 0 i B 0 -0.5 S2-S1 S3-S2 H1-S3 H2-H1 0 0. 1 0. 2 0. 3 Bleu score +95% 0 . 6 6 7 0 . 2 3 8 2 . 0 0 7 0 . 1 4 5 - 95% 0 . 4 3 5 0 . 0 4 2 1 . 7 59 - 0 . 0 6 9 P r e d i c t e d B i l i n g u a l G r o u p b i l i n g u a l 0 . 551 0 . 1 4 0 1 . 8 8 3 0 . 0 3 8

high correlation coefficient of 0.99 indicates that tems relative to the worst system. Recall that Bleu tracks human judgment well. Particularly the judges rated systems from 1 to 5, but the leu interesting is how well Bleu distinguishes be- B score ranges from 0 to 1. We took the leu tween S2 and S3 which are quite close. Figure 6 B , monolingual group, and bilingual group shows the comparable regression results for the scores for the 5 systems and linearly normal- bilingual group. The correlation coefficient is ized them by their corresponding range (the 0.96. maximum and minimum score across the 5 sys- tems). The normalized scores are shown in Fig- ure 7. This figure illustrates the high correla- Figure 5: Bleu predicts Monolingual Judg- tion between the Bleu score and the monolin- ments gual group. Of particular interest is the accu- racy of Bleu’s estimate of the small difference t

n 3 between S2 and S3 and the larger difference be- e

m 2 .5 tween S3 and H1. The figure also shows the rel- e

g 2 atively large gap between MT systems and hu- d 6 u man translators. In addition, we surmise that

J 1 .5

l the bilingual group was very forgiving in judg- a 1 u ing H1 relative to H2 because the monolingual g

n 0.5 i group found a rather large difference between l

o 0 the fluency of their translations. n

o -0.5 M 0 0.1 0.2 0.3 6 Conclusion

Bleu score We believe that the simplicity of Bleu is a very appealing feature for use in the research and P r e d i c t e d M o n o l i n g u a l G r o u p development cycle of machine translation tech- nology. Furthermore, given Bleu’s sensitivity in distinguishing small differences between sys-

We now take the worst system as a reference 6Crossing this chasm for Chinese-English translation point and compare the Bleu scores with the appears to be a significant challenge for the current state- human judgment scores of the remaining sys- of-the-art systems. 9

colleagues who served in the monolingual and Figure 7: Bleu vs Bilingual and Monolingual bilingual judge pools for their perseverance in Judgments judging the output of Chinese-English MT sys-

1 tems. 0. 9 0. 8 References e r

o 0. 7 c

s [1] E.H. Hovy. Toward finely differentiated

0. 6 d

e 0. 5 evaluation metrics for machine translation. z i l

a 0. 4 In Proceedings of the Eagles Workshop on m r 0. 3 Standards and Evaluation, Pisa, Italy, 1999. o N 0. 2 [2] J.S. White and T. O’Connell. The ARPA 0. 1 MT evaluation methodologies: evolution, 0 S 1 S 2 S 3 H 1 H 2 lessons, and future approaches. In Proceed- ings of the First Conference of the Associa- M o n o l i n g u a l B i l i n g u a l B l e u tion for Machine Translation in the Amer- icas, pages 193–205, Columbia, Maryland, 1994. tems, we anticipate that machine translation re- [3] James Child, Ray Clifford, and Pardee search is at the dawn of a new era in which sig- Lowe, Jr. Proficiency and performance in nificant progress occurs because researchers can language testing. Applied Language Learn- more easily home in on effective modeling ideas. ing, 4:1-2, pages 19–54, 1993. While we have performed an encouraging ini- tial study, we anticipate that a larger study with Appendix a more highly parameterized form of the brevity The human judges rated each translation penalty, modified n-gram precision, or precision from 1 to 5. The following rating scheme ex- averaging expressions could be conducted to de- planations were provided to them by mouseover velop a more accurate estimator of translation bubble help in the annotation tool. These quality. We also look forward to evaluating the were excerpted from the U.S. Government’s In- Bleu metric on other language pairs. teragency Language Roundtable Language Skill Finally, we believe that our approach of using Level Descriptions: Writing[3].7 the n-gram similarity of a candidate to a set of 1. Writing vocabulary is inadequate to ex- references has wider applicability than MT; for press anything but elementary needs. Writing example, it could be extended to the evaluation tends to be a loose collection of sentences on a of natural language generation and summariza- given topic and provides little evidence of con- tion systems. scious organization.

Acknowledgments This work was partially 2. Can write simply about a very limited supported by the Defense Advanced Research number of current events or daily situations. Projects Agency and monitored by SPAWAR 3. Errors virtually never interfere with com- under contract No. N66001-99-2-8916. The prehension and rarely disturb the native reader. views and findings contained in this material 4. Consistently able to tailor language to suit are those of the authors and do not necessarily audience and able to express subtleties and nu- reflect the position of policy of the Government ances. and no official endorsement should be inferred. 5. Has writing proficiency equal to that of a We gratefully acknowledge comments about well-educated native. the geometric mean by John Makhoul of BBN and ongoing discussions with George Dodding- 7A copy is on the web at ton of NIST. We especially wish to thank our http://fmc.utm.edu/ rpeckham/ilrwrite.html ∼