Word Error Rates: Decomposition Over POS Classes and Applications for Error Analysis
Total Page:16
File Type:pdf, Size:1020Kb
Word Error Rates: Decomposition over POS Classes and Applications for Error Analysis Maja Popovic´ Hermann Ney Lehrstuhl fur¨ Informatik 6 Lehrstuhl fur¨ Informatik 6 RWTH Aachen University RWTH Aachen University Aachen, Germany Aachen, Germany [email protected] [email protected] Abstract different systems as well as for evaluating improve- ments within one system. However, these measures Evaluation and error analysis of machine do not give any details about the nature of translation translation output are important but difficult errors. Therefore some more detailed analysis of the tasks. In this work, we propose a novel generated output is needed in order to identify the method for obtaining more details about ac- main problems and to focus the research efforts. A tual translation errors in the generated output framework for human error analysis has been pro- by introducing the decomposition of Word posed in (Vilar et al., 2006), but as every human Error Rate (WER) and Position independent evaluation, this is also a time consuming task. word Error Rate (PER) over different Part- This article presents a framework for calculating of-Speech (POS) classes. Furthermore, we the decomposition of WER and PER over different investigate two possible aspects of the use POS classes, i.e. for estimating the contribution of of these decompositions for automatic er- each POS class to the overall word error rate. Al- ror analysis: estimation of inflectional errors though this work focuses on POS classes, the method and distribution of missing words over POS can be easily extended to other types of linguis- classes. The obtained results are shown to tic information. In addition, two methods for error correspond to the results of a human error analysis using the WER and PER decompositons to- analysis. The results obtained on the Euro- gether with base forms are proposed: estimation of pean Parliament Plenary Session corpus in inflectional errors and distribution of missing words Spanish and English give a better overview over POS classes. The translation corpus used for of the nature of translation errors as well as our error analysis is built in the framework of the ideas of where to put efforts for possible im- TC-STAR project (tcs, 2005) and contains the tran- provements of the translation system. scriptions of the European Parliament Plenary Ses- sions (EPPS) in Spanish and English. The translation 1 Introduction system used is the phrase-based statistical machine Evaluation of machine translation output is a very translation system described in (Vilar et al., 2005; important but difficult task. Human evaluation is Matusov et al., 2006). expensive and time consuming. Therefore a variety 2 Related Work of automatic evaluation measures have been studied over the last years. The most widely used are Word Automatic evaluation measures for machine trans- Error Rate (WER), Position independent word Error lation output are receiving more and more atten- Rate (PER), the BLEU score (Papineni et al., 2002) tion in the last years. The BLEU metric (Pap- and the NIST score (Doddington, 2002). These mea- ineni et al., 2002) and the closely related NIST met- sures have shown to be valuable tools for comparing ric (Doddington, 2002) along with WER and PER 48 Proceedings of the Second Workshop on Statistical Machine Translation, pages 48–55, Prague, June 2007. c 2007 Association for Computational Linguistics have been widely used by many machine translation taken into account. In this work, the standard WER researchers. An extended version of BLEU which and PER are decomposed and analysed. uses n-grams weighted according to their frequency estimated from a monolingual corpus is proposed 3 Decomposition of WER and PER over in (Babych and Hartley, 2004). (Leusch et al., 2005) POS classes investigate preprocessing and normalisation meth- The standard procedure for evaluating machine ods for improving the evaluation using the standard translation output is done by comparing the hypoth- measures WER,PER, BLEU and NIST. The same set esis document hyp with given reference translations of measures is examined in (Matusov et al., 2005) ref , each one consisting of K sentences (or seg- in combination with automatic sentence segmenta- ments). The reference document ref consists of tion in order to enable evaluation of translation out- R reference translations for each sentence. Let the put without sentence boundaries (e.g. translation of length of the hypothesis sentence hypk be denoted speech recognition output). A new automatic met- as Nhypk , and the reference lengths of each sentence ric METEOR (Banerjee and Lavie, 2005) uses stems Nref . Then, the total hypothesis length of the doc- and synonyms of the words. This measure counts k,r ument is Nhyp = k Nhyp , and the total reference the number of exact word matches between the out- k length is Nref = Pk N ∗ where N ∗ is defined put and the reference. In a second step, unmatched ref k ref k as the length of theP reference sentence with the low- words are converted into stems or synonyms and est sentence-level error rate as shown to be optimal then matched. The TER metric (Snover et al., 2006) in (Leusch et al., 2005). measures the amount of editing that a human would have to perform to change the system output so that 3.1 Standard word error rates (overview) it exactly matches the reference. The CDER mea- The word error rate (WER) is based on the Lev- sure (Leusch et al., 2006) is based on edit distance, enshtein distance (Levenshtein, 1966) - the mini- such as the well-known WER, but allows reordering mum number of substitutions, deletions and inser- of blocks. Nevertheless, none of these measures or tions that have to be performed to convert the gen- extensions takes into account linguistic knowledge erated text hyp into the reference text ref . A short- about actual translation errors, for example what is coming of the WER is the fact that it does not allow the contribution of verbs in the overall error rate, reorderings of words, whereas the word order of the how many full forms are wrong whereas their base hypothesis can be different from word order of the forms are correct, etc. A framework for human error reference even though it is correct translation. In analysis has been proposed in (Vilar et al., 2006) order to overcome this problem, the position inde- and a detailed analysis of the obtained results has pendent word error rate (PER) compares the words been carried out. However, human error analysis, in the two sentences without taking the word order like any human evaluation, is a time consuming task. into account. The PER is always lower than or equal Whereas the use of linguistic knowledge for im- to the WER. On the other hand, shortcoming of the proving the performance of a statistical machine PER is the fact that the word order can be impor- translation system is investigated in many publi- tant in some cases. Therefore the best solution is to cations for various language pairs (like for exam- calculate both word error rates. ple (Nießen and Ney, 2000), (Goldwater and Mc- Calculation of WER: The WER of the hypothe- Closky, 2005)), its use for the analysis of translation sis hyp with respect to the reference ref is calculated errors is still a rather unexplored area. Some auto- as: matic methods for error analysis using base forms K and POS tags are proposed in (Popovic´ et al., 2006; 1 WER = min dL(ref k,r, hypk) Popovic´ and Ney, 2006). These measures are based N ∗ r ref Xk=1 on differences between WER and PER which are cal- culated separately for each POS class using subsets where dL(ref k,r, hypk) is the Levenshtein dis- extracted from the original texts. Standard overall tance between the reference sentence ref k,r and the WER and PER of the original texts are not at all hypothesis sentence hypk. The calculation of WER 49 is performed using a dynamic programming algo- reference: rithm. Mister#N Commissioner#N ,#PUN Calculation of PER: The PER can be calcu- twenty-four#NUM hours#N lated using the counts n(e, hypk) and n(e, ref k,r) sometimes#ADV can#V be#V too#ADV of a word e in the hypothesis sentence hypk and the much#PRON time#N .#PUN reference sentence ref k,r respectively: hypothesis: Mrs#N Commissioner#N ,#PUN K 1 twenty-four#NUM hours#N is#V PER = min dPER(ref , hyp ) r k,r k sometimes#ADV too#ADV Nref∗ Xk=1 much#PRON time#N .#PUN where Table 1: Example for illustration of actual errors: a 1 POS tagged reference sentence and a corresponding dPER(ref , hyp )= |Nref − Nhyp |+ k,r k 2 k,r k hypothesis sentence |n(e, ref ) − n(e, hyp )| k,r k reference errors hypothesis errors error type Xe Mister#N Mrs#N substitution 3.2 WER decomposition over POS classes sometimes#ADV is#V substitution The dynamic programming algorithm for WER en- can#V deletion ables a simple and straightforward identification of be#V sometimes#ADV substitution each erroneous word which actually contributes to Table 2: WER errors: actual words which are partici- WER. Let err denote the set of erroneous words k pating in the word error rate and their corresponding in sentence k with respect to the best reference and POS classes p bea POS class. Then n(p, err k) is the number of errors in err k produced by words with POS class p. It should be noted that for the substitution errors, the WER(N) = 1/12 = 8.3%, of verbs WER(V) = POS class of the involved reference word is taken 2/12 = 16.7% and of adverbs WER(ADV) = into account.