Automatic Evaluation Measures for Statistical Machine Translation System Optimization

Automatic Evaluation Measures for Statistical Machine Translation System Optimization Arne Mauser, Sasaˇ Hasan and Hermann Ney Human Language Technology and Pattern Recognition Lehrstuhl für Informatik 6, Computer Science Department RWTH Aachen University, D-52056 Aachen, Germany {mauser,hasan,ney}@cs.rwth-aachen.de Abstract Evaluation of machine translation (MT) output is a challenging task. In most cases, there is no single correct translation. In the extreme case, two translations of the same input can have completely different words and sentence structure while still both being perfectly valid. Large projects and competitions for MT research raised the need for reliable and efficient evaluation of MT systems. For the funding side, the obvious motivation is to measure performance and progress of research. This often results in a specific measure or metric taken as primarily evaluation criterion. Do improvements in one measure really lead to improved MT performance? How does a gain in one evaluation metric affect other measures? This paper is going to answer these questions by a number of experiments. 1. Introduction Using a log-linear model (Och and Ney, 2002), we obtain: Evaluation of machine translation (MT) output is a chal- M I J exp =1 λmhm(e1, f1 ) lenging task. In most cases, there is no single correct trans- I J m P r(e1|f1 ) = (3) lation. In the extreme case, two translations of the same in- M ′I′ J exp P m=1 λmhm(e 1 , f1 ) ′I′ put can have completely different words and sentence struc- e 1 ture while still both being perfectly valid. P P Large projects and competitions for MT research raised the The denominator represents a normalization factor that de- J need for reliable and efficient evaluation of MT systems. pends only on the source sentence f1 . Therefore, we can For the funding side, the obvious motivation is to measure omit it during the search process. As a decision rule, we performance and progress of research. This often results in obtain: a specific measure or metric taken as primarily evaluation M Iˆ I J criterion. eˆ1 = argmax λmhm(e1, f1 ) (4) I MT research is therefor forced to improve in these specific I,e1 (m=1 ) measures. For statistical MT (SMT), translation systems X are usually optimized for an automatic evaluation measure. This is a generalization of the source-channel approach. It A number of these methods already exist and new evalua- has the advantage that additional models h(·) can be easily tion measures are frequently proposed to the MT commu- integrated into the overall system. The process of obtaining nity. the λ values will be described in Section 2.. Do improvements in one measure really lead to improved 1.2. Phrase-based approach MT performance? How does a gain in one evaluation met- The basic idea of phrase-based translation is to segment ric affect other measures? This paper is going to answer the given source sentence into phrases, then translate each these questions by a number of experiments. phrase and finally compose the target sentence from these The work is organized as follows: in the next section, phrase translations. Formally, we define a segmentation of we describe the statistical approach to machine translation. a given sentence pair (f J , eI ) into K blocks: This is followed by the description of the parameter tuning 1 1 in Section 2.. We then look at the evaluation measures in k → sk := (ik; bk, jk), for k = 1 . K. (5) Section 3.. The task, system setup and results are discussed th in Section 4.. Here, ik denotes the last position of the k target phrase; we set i0 := 0. The pair (bk, jk) denotes the start and end 1.1. Log-linear model positions of the source phrase that is aligned to the kth tar- The problem of statistical machine translation is to find the get phrase; we set j0 := 0. Phrases are defined as nonempty I contiguous sequences of words. We constrain the segmen- translation e1 = e1 . ei . eI of a given source language J tationsso that all wordsin the sourceand the target sentence sentence f1 = f1 . fj . fJ . Among all possible target language sentences, we will choose the sentence with the are covered by exactly one phrase. Thus, there are no gaps highest probability: and there is no overlap. J I For a given sentence pair (f1 , e1) and a given segmentation K s1 , we define the bilingual phrases as: Iˆ I J eˆ1 = argmax P r(e1|f1 ) (1) I e e +1 . e (6) I,e1 ˜k := ik−1 ik ˜ (2) fk := fbk . fjk (7) 3089 K The segmentation s1 is introduced as a hidden variable in The number of edit operations is divided by the number the translation model. Therefore, it would be theoretically of words in the reference. If the hypothesis is longer than correct to sum over all possible segmentations. In practice, the reference, this can result in an error rate larger than 1. we use the maximum approximation for this sum. As a re- Therefor, WER has a bias towards shorter hypotheses. sult, the models h(·) depend not only on the sentence pair When multiple reference translations are given, the re- J I K (f1 , e1), but also on the segmentation s1 , i.e., we have ported error for a translation hypothesis is the minimum J I K models h(f1 , e1, s1 ). Note that the segmentationalso con- error over all references. tains the information on the phrase-level reordering. 3.2. Position-independent Word Error Rate (PER) 1.3. Models used during search When searching for the best translation for a given input Where WER is extreme to the fact, that it requires exactly sentence, we use a log-linearcombinationof several models the same order of the words in hypothesis and reference (also called feature functions) as decision function. the position-independent word error rate (PER) (Tillmann More specifically the models are: a phrase translation et al., 1997) neglects word order completely. It measures model, a word-based translation model, word and phrase the difference in the count of the words occurring in hy- penalty, a target language model and a reordering model. pothesis and reference. The resulting number is divided by A detailed description of the models used can be found in the number of words in the reference. (Mauser et al., 2006). 3.3. Translation Edit Rate (TER) 2. Minimum Error Rate Training The TER (Snover et al., 2006) is an error measure counts Phrase table probabilies themselves already give a fairly the number of edits required to change a system output into good represetation of our training data. Translating unseen one of the given translation references. The background data however, requires the system to be more flexible. De- is to measure the amount of human work that would be pending on the task, for example longer or shorter transla- required to post-edit the translations proposed by the sys- tion might be preferable. For lower quality bilingual data, tem into the reference. In contrast to WER, movements of we would wantthe system to rely a little more on the mono- blocks are allowed and counted as one edit with equal costs lingual target language model than on the actual translation to insertions, deletions and substitutions of single words. probabilities. The number of edit operations is divided by the average For this purpose, the scaling factors λ in equation are usu- number of reference words. ally set for a specific translation task. This is done by op- timizing the factors with respect to a loss function. If the 3.4. BLEU loss function in the subjectively perceived translation qual- Proposed by (Papineni et al., 2002), the BLEU criterion ity, we have to adjust the weights manually. If this loss measures the similarity of n-grams count vectors of the ref- function is an automatic evaluation measure, we can use erence translations in the candidate translation. If multiple an automated procedure that will find a good solution four references are present the counts are collected of all trans- our parameters. This procedure is referred to as Minimum lations. The typical length of the n-gram is 4, with shorter Error Rate Training (MERT) (Och, 2003). n-grams being also counted and then interpolated. BLEU is In the experiments, we used the downhill simplex algorithm a precision measure, higher values indicate better results. If (Press et al., 2002) to optimize the system weights for a the no n-gram of maximum length matches between trans- specific evaluation measure. lation hypothesis and reference, the BLEU score will be 3. Evaluation Measures zero. In addition, there is a brevity penalty to attenuate the strong This sections briefly describes the Evaluation measures bias towards short sentences. Variants of the brevity penalty considered in this work. We selected the most commonly exist with respect to the reference length used. The origi- used metrics which were suitable to perform the experi- nal IBM-BLEU used the length of the reference which was ments. closest in length to the translation hypothesis. This is the Most evaluation measures show a reasonable correlation variant that we use here. with human judgement but so far, it remains an open ques- tion, if an improvement in one of these measures will also 3.5. NIST lead to improvements in the translation quality. The NIST precision measure (Doddington, 2002) was in- 3.1. Word Error Rate (WER) tended as an improved version of BLEU. Unwanted effects The Edit distance or Levenshtein distance on word level of the brevity penalty of BLEU should be reduced and n- is the minimum number of word insertions, substitutions gram occurrences are weighted by their importance.

Automatic Evaluation Measures for Statistical Machine Translation System Optimization

BLEU Might Be Guilty but References Are Not Innocent

A Visualisation Tool for Inspecting and Evaluating Metric Scores of Machine Translation Output

Why Word Error Rate Is Not a Good Metric for Speech Recognizer Training for the Speech Translation Task?

Machine Translation Evaluation

Arxiv:2006.14799V2 [Cs.CL] 18 May 2021 the Same User Input

A Survey of Evaluation Metrics Used for NLG Systems

English and Arabic Speech Translation

Re-Evaluating the Role of BLEU in Machine Translation Research

Introduction to Natural Language Processing Computer Science 585—Fall 2009 University of Massachusetts Amherst

Minimum Error Rate Training in Statistical Machine Translation

LEPOR: an Augmented Machine Translation Evaluation Metric Lifeng

Bertscore: Evaluating Text Generation with Bert