Word Error Rates: Decomposition Over POS Classes and Applications for Error Analysis

Total Page:16

File Type:pdf, Size:1020Kb

Word Error Rates: Decomposition Over POS Classes and Applications for Error Analysis Word Error Rates: Decomposition over POS Classes and Applications for Error Analysis Maja Popovic´ Hermann Ney Lehrstuhl fur¨ Informatik 6 Lehrstuhl fur¨ Informatik 6 RWTH Aachen University RWTH Aachen University Aachen, Germany Aachen, Germany [email protected] [email protected] Abstract different systems as well as for evaluating improve- ments within one system. However, these measures Evaluation and error analysis of machine do not give any details about the nature of translation translation output are important but difficult errors. Therefore some more detailed analysis of the tasks. In this work, we propose a novel generated output is needed in order to identify the method for obtaining more details about ac- main problems and to focus the research efforts. A tual translation errors in the generated output framework for human error analysis has been pro- by introducing the decomposition of Word posed in (Vilar et al., 2006), but as every human Error Rate (WER) and Position independent evaluation, this is also a time consuming task. word Error Rate (PER) over different Part- This article presents a framework for calculating of-Speech (POS) classes. Furthermore, we the decomposition of WER and PER over different investigate two possible aspects of the use POS classes, i.e. for estimating the contribution of of these decompositions for automatic er- each POS class to the overall word error rate. Al- ror analysis: estimation of inflectional errors though this work focuses on POS classes, the method and distribution of missing words over POS can be easily extended to other types of linguis- classes. The obtained results are shown to tic information. In addition, two methods for error correspond to the results of a human error analysis using the WER and PER decompositons to- analysis. The results obtained on the Euro- gether with base forms are proposed: estimation of pean Parliament Plenary Session corpus in inflectional errors and distribution of missing words Spanish and English give a better overview over POS classes. The translation corpus used for of the nature of translation errors as well as our error analysis is built in the framework of the ideas of where to put efforts for possible im- TC-STAR project (tcs, 2005) and contains the tran- provements of the translation system. scriptions of the European Parliament Plenary Ses- sions (EPPS) in Spanish and English. The translation 1 Introduction system used is the phrase-based statistical machine Evaluation of machine translation output is a very translation system described in (Vilar et al., 2005; important but difficult task. Human evaluation is Matusov et al., 2006). expensive and time consuming. Therefore a variety 2 Related Work of automatic evaluation measures have been studied over the last years. The most widely used are Word Automatic evaluation measures for machine trans- Error Rate (WER), Position independent word Error lation output are receiving more and more atten- Rate (PER), the BLEU score (Papineni et al., 2002) tion in the last years. The BLEU metric (Pap- and the NIST score (Doddington, 2002). These mea- ineni et al., 2002) and the closely related NIST met- sures have shown to be valuable tools for comparing ric (Doddington, 2002) along with WER and PER 48 Proceedings of the Second Workshop on Statistical Machine Translation, pages 48–55, Prague, June 2007. c 2007 Association for Computational Linguistics have been widely used by many machine translation taken into account. In this work, the standard WER researchers. An extended version of BLEU which and PER are decomposed and analysed. uses n-grams weighted according to their frequency estimated from a monolingual corpus is proposed 3 Decomposition of WER and PER over in (Babych and Hartley, 2004). (Leusch et al., 2005) POS classes investigate preprocessing and normalisation meth- The standard procedure for evaluating machine ods for improving the evaluation using the standard translation output is done by comparing the hypoth- measures WER,PER, BLEU and NIST. The same set esis document hyp with given reference translations of measures is examined in (Matusov et al., 2005) ref , each one consisting of K sentences (or seg- in combination with automatic sentence segmenta- ments). The reference document ref consists of tion in order to enable evaluation of translation out- R reference translations for each sentence. Let the put without sentence boundaries (e.g. translation of length of the hypothesis sentence hypk be denoted speech recognition output). A new automatic met- as Nhypk , and the reference lengths of each sentence ric METEOR (Banerjee and Lavie, 2005) uses stems Nref . Then, the total hypothesis length of the doc- and synonyms of the words. This measure counts k,r ument is Nhyp = k Nhyp , and the total reference the number of exact word matches between the out- k length is Nref = Pk N ∗ where N ∗ is defined put and the reference. In a second step, unmatched ref k ref k as the length of theP reference sentence with the low- words are converted into stems or synonyms and est sentence-level error rate as shown to be optimal then matched. The TER metric (Snover et al., 2006) in (Leusch et al., 2005). measures the amount of editing that a human would have to perform to change the system output so that 3.1 Standard word error rates (overview) it exactly matches the reference. The CDER mea- The word error rate (WER) is based on the Lev- sure (Leusch et al., 2006) is based on edit distance, enshtein distance (Levenshtein, 1966) - the mini- such as the well-known WER, but allows reordering mum number of substitutions, deletions and inser- of blocks. Nevertheless, none of these measures or tions that have to be performed to convert the gen- extensions takes into account linguistic knowledge erated text hyp into the reference text ref . A short- about actual translation errors, for example what is coming of the WER is the fact that it does not allow the contribution of verbs in the overall error rate, reorderings of words, whereas the word order of the how many full forms are wrong whereas their base hypothesis can be different from word order of the forms are correct, etc. A framework for human error reference even though it is correct translation. In analysis has been proposed in (Vilar et al., 2006) order to overcome this problem, the position inde- and a detailed analysis of the obtained results has pendent word error rate (PER) compares the words been carried out. However, human error analysis, in the two sentences without taking the word order like any human evaluation, is a time consuming task. into account. The PER is always lower than or equal Whereas the use of linguistic knowledge for im- to the WER. On the other hand, shortcoming of the proving the performance of a statistical machine PER is the fact that the word order can be impor- translation system is investigated in many publi- tant in some cases. Therefore the best solution is to cations for various language pairs (like for exam- calculate both word error rates. ple (Nießen and Ney, 2000), (Goldwater and Mc- Calculation of WER: The WER of the hypothe- Closky, 2005)), its use for the analysis of translation sis hyp with respect to the reference ref is calculated errors is still a rather unexplored area. Some auto- as: matic methods for error analysis using base forms K and POS tags are proposed in (Popovic´ et al., 2006; 1 WER = min dL(ref k,r, hypk) Popovic´ and Ney, 2006). These measures are based N ∗ r ref Xk=1 on differences between WER and PER which are cal- culated separately for each POS class using subsets where dL(ref k,r, hypk) is the Levenshtein dis- extracted from the original texts. Standard overall tance between the reference sentence ref k,r and the WER and PER of the original texts are not at all hypothesis sentence hypk. The calculation of WER 49 is performed using a dynamic programming algo- reference: rithm. Mister#N Commissioner#N ,#PUN Calculation of PER: The PER can be calcu- twenty-four#NUM hours#N lated using the counts n(e, hypk) and n(e, ref k,r) sometimes#ADV can#V be#V too#ADV of a word e in the hypothesis sentence hypk and the much#PRON time#N .#PUN reference sentence ref k,r respectively: hypothesis: Mrs#N Commissioner#N ,#PUN K 1 twenty-four#NUM hours#N is#V PER = min dPER(ref , hyp ) r k,r k sometimes#ADV too#ADV Nref∗ Xk=1 much#PRON time#N .#PUN where Table 1: Example for illustration of actual errors: a 1 POS tagged reference sentence and a corresponding dPER(ref , hyp )= |Nref − Nhyp |+ k,r k 2 k,r k hypothesis sentence |n(e, ref ) − n(e, hyp )| k,r k reference errors hypothesis errors error type Xe Mister#N Mrs#N substitution 3.2 WER decomposition over POS classes sometimes#ADV is#V substitution The dynamic programming algorithm for WER en- can#V deletion ables a simple and straightforward identification of be#V sometimes#ADV substitution each erroneous word which actually contributes to Table 2: WER errors: actual words which are partici- WER. Let err denote the set of erroneous words k pating in the word error rate and their corresponding in sentence k with respect to the best reference and POS classes p bea POS class. Then n(p, err k) is the number of errors in err k produced by words with POS class p. It should be noted that for the substitution errors, the WER(N) = 1/12 = 8.3%, of verbs WER(V) = POS class of the involved reference word is taken 2/12 = 16.7% and of adverbs WER(ADV) = into account.
Recommended publications
  • BLEU Might Be Guilty but References Are Not Innocent
    BLEU might be Guilty but References are not Innocent Markus Freitag, David Grangier, Isaac Caswell Google Research ffreitag,grangier,[email protected] Abstract past, especially when comparing rule-based and statistical systems (Bojar et al., 2016b; Koehn and The quality of automatic metrics for machine translation has been increasingly called into Monz, 2006; Callison-Burch et al., 2006). question, especially for high-quality systems. Automated evaluations are however of crucial This paper demonstrates that, while choice importance, especially for system development. of metric is important, the nature of the ref- Most decisions for architecture selection, hyper- erences is also critical. We study differ- parameter search and data filtering rely on auto- ent methods to collect references and com- mated evaluation at a pace and scale that would pare their value in automated evaluation by not be sustainable with human evaluations. Au- reporting correlation with human evaluation for a variety of systems and metrics. Mo- tomated evaluation (Koehn, 2010; Papineni et al., tivated by the finding that typical references 2002) typically relies on two crucial ingredients: exhibit poor diversity, concentrating around a metric and a reference translation. Metrics gen- translationese language, we develop a para- erally measure the quality of a translation by as- phrasing task for linguists to perform on exist- sessing the overlap between the system output and ing reference translations, which counteracts the reference translation. Different overlap metrics this bias. Our method yields higher correla- have been proposed, aiming to improve correla- tion with human judgment not only for the submissions of WMT 2019 English!German, tion between human and automated evaluations.
    [Show full text]
  • A Visualisation Tool for Inspecting and Evaluating Metric Scores of Machine Translation Output
    Vis-Eval Metric Viewer: A Visualisation Tool for Inspecting and Evaluating Metric Scores of Machine Translation Output David Steele and Lucia Specia Department of Computer Science University of Sheffield Sheffield, UK dbsteele1,[email protected] Abstract cised for providing scores that can be non-intuitive and uninformative, especially at the sentence level Machine Translation systems are usually eval- (Zhang et al., 2004; Song et al., 2013; Babych, uated and compared using automated evalua- 2014). Additionally, scores across different met- tion metrics such as BLEU and METEOR to score the generated translations against human rics can be inconsistent with each other. This in- translations. However, the interaction with the consistency can be an indicator of linguistic prop- output from the metrics is relatively limited erties of the translations which should be further and results are commonly a single score along analysed. However, multiple metrics are not al- with a few additional statistics. Whilst this ways used and any discrepancies among them tend may be enough for system comparison it does to be ignored. not provide much useful feedback or a means Vis-Eval Metric Viewer (VEMV) was devel- for inspecting translations and their respective scores. Vis-Eval Metric Viewer (VEMV) is a oped as a tool bearing in mind the aforementioned tool designed to provide visualisation of mul- issues. It enables rapid evaluation of MT output, tiple evaluation scores so they can be easily in- currently employing up to eight popular metrics. terpreted by a user. VEMV takes in the source, Results can be easily inspected (using a typical reference, and hypothesis files as parameters, web browser) especially at the segment level, with and scores the hypotheses using several popu- each sentence (source, reference, and hypothesis) lar evaluation metrics simultaneously.
    [Show full text]
  • Why Word Error Rate Is Not a Good Metric for Speech Recognizer Training for the Speech Translation Task?
    WHY WORD ERROR RATE IS NOT A GOOD METRIC FOR SPEECH RECOGNIZER TRAINING FOR THE SPEECH TRANSLATION TASK? Xiaodong He, Li Deng, and Alex Acero Microsoft Research, One Microsoft Way, Redmond, WA 98052, USA ABSTRACT ASR, where the widely used metric is word error rate (WER), the translation accuracy is usually measured by the quantities Speech translation (ST) is an enabling technology for cross-lingual including BLEU (Bi-Lingual Evaluation Understudy), NIST-score, oral communication. A ST system consists of two major and Translation Edit Rate (TER) [12][14]. BLEU measures the n- components: an automatic speech recognizer (ASR) and a machine gram matches between the translation hypothesis and the translator (MT). Nowadays, most ASR systems are trained and reference(s). In [1][2], it was reported that translation accuracy tuned by minimizing word error rate (WER). However, WER degrades 8 to 10 BLEU points when the ASR output was used to counts word errors at the surface level. It does not consider the replace the verbatim ASR transcript (i.e., assuming no ASR error). contextual and syntactic roles of a word, which are often critical On the other hand, although WER is widely accepted as the de for MT. In the end-to-end ST scenarios, whether WER is a good facto metric for ASR, it only measures word errors at the surface metric for the ASR component of the full ST system is an open level. It takes no consideration of the contextual and syntactic roles issue and lacks systematic studies. In this paper, we report our of a word.
    [Show full text]
  • Machine Translation Evaluation
    Machine Translation Evaluation Sara Stymne Partly based on Philipp Koehn's slides for chapter 8 Why Evaluation? How good is a given machine translation system? Which one is the best system for our purpose? How much did we improve our system? How can we tune our system to become better? Hard problem, since many different translations acceptable ! semantic equivalence / similarity Ten Translations of a Chinese Sentence Israeli officials are responsible for airport security. Israel is in charge of the security at this airport. The security work for this airport is the responsibility of the Israel government. Israeli side was in charge of the security of this airport. Israel is responsible for the airport's security. Israel is responsible for safety work at this airport. Israel presides over the security of the airport. Israel took charge of the airport security. The safety of this airport is taken charge of by Israel. This airport's security is the responsibility of the Israeli security officials. (a typical example from the 2001 NIST evaluation set) Which translation is best? Source F¨arjetransporterna har minskat med 20,3 procent i ˚ar. Gloss The-ferry-transports have decreased by 20.3 percent in year. Ref Ferry transports are down by 20.3% in 2008. Sys1 The ferry transports has reduced by 20,3% this year. Sys2 This year, there has been a reduction of transports by ferry of 20.3 procent. Sys3 F¨arjetransporterna are down by 20.3% in 2003. Sys4 Ferry transports have a reduction of 20.3 percent in year. Sys5 Transports are down by 20.3% in year.
    [Show full text]
  • Arxiv:2006.14799V2 [Cs.CL] 18 May 2021 the Same User Input
    Evaluation of Text Generation: A Survey Evaluation of Text Generation: A Survey Asli Celikyilmaz [email protected] Facebook AI Research Elizabeth Clark [email protected] University of Washington Jianfeng Gao [email protected] Microsoft Research Abstract The paper surveys evaluation methods of natural language generation (NLG) systems that have been developed in the last few years. We group NLG evaluation methods into three categories: (1) human-centric evaluation metrics, (2) automatic metrics that require no training, and (3) machine-learned metrics. For each category, we discuss the progress that has been made and the challenges still being faced, with a focus on the evaluation of recently proposed NLG tasks and neural NLG models. We then present two examples for task-specific NLG evaluations for automatic text summarization and long text generation, and conclude the paper by proposing future research directions.1 1. Introduction Natural language generation (NLG), a sub-field of natural language processing (NLP), deals with building software systems that can produce coherent and readable text (Reiter & Dale, 2000a) NLG is commonly considered a general term which encompasses a wide range of tasks that take a form of input (e.g., a structured input like a dataset or a table, a natural language prompt or even an image) and output a sequence of text that is coherent and understandable by humans. Hence, the field of NLG can be applied to a broad range of NLP tasks, such as generating responses to user questions in a chatbot, translating a sentence or a document from one language into another, offering suggestions to help write a story, or generating summaries of time-intensive data analysis.
    [Show full text]
  • A Survey of Evaluation Metrics Used for NLG Systems
    A Survey of Evaluation Metrics Used for NLG Systems ANANYA B. SAI, Robert-Bosch Centre for Data Science and AI, Indian Institute of Technology, Madras, India AKASH KUMAR MOHANKUMAR, Indian Institute of Technology, Madras, India MITESH M. KHAPRA, Robert-Bosch Centre for Data Science and AI, Indian Institute of Technology, Madras, India The success of Deep Learning has created a surge in interest in a wide range of Natural Language Generation (NLG) tasks. Deep Learning has not only pushed the state of the art in several existing NLG tasks but has also facilitated researchers to explore various newer NLG tasks such as image captioning. Such rapid progress in NLG has necessitated the development of accurate automatic evaluation metrics that would allow us to track the progress in the field of NLG. However, unlike classification tasks, automatically evaluating NLG systems in itself is a huge challenge. Several works have shown that early heuristic-based metrics such as BLEU, ROUGE are inadequate for capturing the nuances in the different NLG tasks. The expanding number of NLG models and the shortcomings of the current metrics has led to a rapid surge in the number of evaluation metrics proposed since 2014. Moreover, various evaluation metrics have shifted from using pre-determined heuristic-based formulae to trained transformer models. This rapid change in a relatively short time has led to the need for a survey of the existing NLG metrics to help existing and new researchers to quickly come up to speed with the developments that have happened in NLG evaluation in the last few years.
    [Show full text]
  • English and Arabic Speech Translation
    Normalization for Automated Metrics: English and Arabic Speech Translation Sherri Condon*, Gregory A. Sanders†, Dan Parvaz*, Alan Rubenstein*, Christy Doran*, John Aberdeen*, and Beatrice Oshika* *The MITRE Corporation †National Institute of Standards and Technology 7525 Colshire Drive 100 Bureau Drive, Stop 8940 McLean, Virginia 22102 Gaithersburg, Maryland 20899–8940 {scondon, dparvaz, Rubenstein, cdoran, aberdeen, bea}@mitre.org / [email protected] eral different protocols and offline evaluations in Abstract which the systems process audio recordings and transcripts of interactions. Details of the The Defense Advanced Research Projects TRANSTAC evaluation methods are described in Agency (DARPA) Spoken Language Com- Weiss et al. (2008), Sanders et al. (2008) and Con- munication and Translation System for Tac- don et al. (2008). tical Use (TRANSTAC) program has Because the inputs in the offline evaluation are experimented with applying automated me- the same for each system, we can analyze transla- trics to speech translation dialogues. For trans- lations into English, BLEU, TER, and tions using automated metrics. Measures such as METEOR scores correlate well with human BiLingual Evaluation Understudy (BLEU) (Papi- judgments, but scores for translation into neni et al., 2002), Translation Edit Rate (TER) Arabic correlate with human judgments less (Snover et al., 2006), and Metric for Evaluation of strongly. This paper provides evidence to sup- Translation with Explicit word Ordering port the hypothesis that automated measures (METEOR) (Banerjee and Lavie, 2005) have been of Arabic are lower due to variation and in- developed and widely used for translations of text flection in Arabic by demonstrating that nor- and broadcast material, which have very different malization operations improve correlation properties than dialog.
    [Show full text]
  • Automatic Evaluation Measures for Statistical Machine Translation System Optimization
    Automatic Evaluation Measures for Statistical Machine Translation System Optimization Arne Mauser, Sasaˇ Hasan and Hermann Ney Human Language Technology and Pattern Recognition Lehrstuhl f¨ur Informatik 6, Computer Science Department RWTH Aachen University, D-52056 Aachen, Germany {mauser,hasan,ney}@cs.rwth-aachen.de Abstract Evaluation of machine translation (MT) output is a challenging task. In most cases, there is no single correct translation. In the extreme case, two translations of the same input can have completely different words and sentence structure while still both being perfectly valid. Large projects and competitions for MT research raised the need for reliable and efficient evaluation of MT systems. For the funding side, the obvious motivation is to measure performance and progress of research. This often results in a specific measure or metric taken as primarily evaluation criterion. Do improvements in one measure really lead to improved MT performance? How does a gain in one evaluation metric affect other measures? This paper is going to answer these questions by a number of experiments. 1. Introduction Using a log-linear model (Och and Ney, 2002), we obtain: Evaluation of machine translation (MT) output is a chal- M I J exp =1 λmhm(e1, f1 ) lenging task. In most cases, there is no single correct trans- I J m P r(e1|f1 ) = (3) lation. In the extreme case, two translations of the same in- M ′I′ J exp P m=1 λmhm(e 1 , f1 ) ′I′ put can have completely different words and sentence struc- e 1 ture while still both being perfectly valid.
    [Show full text]
  • Re-Evaluating the Role of BLEU in Machine Translation Research
    Re-evaluating the Role of BLEU in Machine Translation Research Chris Callison-Burch Miles Osborne Philipp Koehn School on Informatics University of Edinburgh 2 Buccleuch Place Edinburgh, EH8 9LW [email protected] Abstract Bleu to the extent that it currently is. In this We argue that the machine translation paper we give a number of counterexamples for community is overly reliant on the Bleu Bleu’s correlation with human judgments. We machine translation evaluation metric. We show that under some circumstances an improve- show that an improved Bleu score is nei- ment in Bleu is not sufficient to reflect a genuine ther necessary nor sufficient for achieving improvement in translation quality, and in other an actual improvement in translation qual- circumstances that it is not necessary to improve ity, and give two significant counterex- Bleu in order to achieve a noticeable improvement amples to Bleu’s correlation with human in translation quality. judgments of quality. This offers new po- We argue that Bleu is insufficient by showing tential for research which was previously that Bleu admits a huge amount of variation for deemed unpromising by an inability to im- identically scored hypotheses. Typically there are prove upon Bleu scores. millions of variations on a hypothesis translation that receive the same Bleu score. Because not all 1 Introduction these variations are equally grammatically or se- Over the past five years progress in machine trans- mantically plausible there are translations which lation, and to a lesser extent progress in natural have the same Bleu score but a worse human eval- language generation tasks such as summarization, uation.
    [Show full text]
  • Introduction to Natural Language Processing Computer Science 585—Fall 2009 University of Massachusetts Amherst
    Machine Translation: Training Introduction to Natural Language Processing Computer Science 585—Fall 2009 University of Massachusetts Amherst David Smith 1 Training Which features of data predict good translations? 2 Training: Generative/Discriminative • Generative –Maximum likelihood training: max p(data) –“Count and normalize” –Maximum likelihood with hidden structure • Expectation Maximization (EM) • Discriminative training –Maximum conditional likelihood –Minimum error/risk training –Other criteria: perceptron and max. margin 3 “Count and Normalize” ... into the programme ... ... into the disease ... • Language modeling example: ... into the disease ... ... into the correct ... assume the probability of a word ... into the next ... depends only on the previous 2 ... into the national ... ... into the integration ... words. ... into the Union ... ... into the Union ... ... into the Union ... ... into the sort ... ... into the internal ... • p(disease|into the) = 3/20 = 0.15 ... into the general ... ... into the budget ... • “Smoothing” reflects a prior belief ... into the disease ... ... into the legal … that p(breech|into the) > 0 despite ... into the various ... these 20 examples. ... into the nuclear ... ... into the bargain ... ... into the situation ... 4 Phrase Models I did not unfortunately receive an answer to this question Auf diese Frage habe ich leider keine Antwort bekom men Assume word alignments are given. 5 Phrase Models I did not unfortunately receive an answer to this question Auf diese Frage habe ich leider keine Antwort bekom men Some good phrase pairs. 6 Phrase Models I did not unfortunately receive an answer to this question Auf diese Frage habe ich leider keine Antwort bekom men Some bad phrase pairs. 7 “Count and Normalize” • Usual approach: treat relative frequencies of source phrase s and target phrase t as probabilities • This leads to overcounting when not all segmentations are legal due to unaligned words.
    [Show full text]
  • Minimum Error Rate Training in Statistical Machine Translation
    Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California 4676 Admiralty Way, Suite 1001 Marina del Rey, CA 90292 [email protected] Abstract statistical approach and the final evaluation criterion used to measure success in a task. Often, the training procedure for statisti- Ideally, we would like to train our model param- cal machine translation models is based on eters such that the end-to-end performance in some maximum likelihood or related criteria. A application is optimal. In this paper, we investigate general problem of this approach is that methods to efficiently optimize model parameters there is only a loose relation to the final with respect to machine translation quality as mea- translation quality on unseen text. In this sured by automatic evaluation criteria such as word paper, we analyze various training criteria error rate and BLEU. which directly optimize translation qual- ity. These training criteria make use of re- 2 Statistical Machine Translation with cently proposed automatic evaluation met- Log-linear Models rics. We describe a new algorithm for effi- cient training an unsmoothed error count. Let us assume that we are given a source (‘French’) ¢¡¤£¦¥ ¡¨£ £ £ § © © © © § We show that significantly better results sentence ¥ , which is can often be obtained if the final evalua- to be translated into a target (‘English’) sentence ¡ ¡ © © © © § § tion criterion is taken directly into account Among all possible as part of the training procedure. target sentences, we will choose the sentence with the highest probability:1 ¡ "!#$ +* ')( 1 Introduction Pr (1) % & Many tasks in natural language processing have evaluation criteria that go beyond simply count- The argmax operation denotes the search problem, ing the number of wrong decisions the system i.e.
    [Show full text]
  • LEPOR: an Augmented Machine Translation Evaluation Metric Lifeng
    LEPOR: An Augmented Machine Translation Evaluation Metric by Lifeng Han, Aaron Master of Science in Software Engineering 2014 Faculty of Science and Technology University of Macau LEPOR: AN AUGMENTED MACHINE TRANSLATION EVALUATION METRIC by LiFeng Han, Aaron A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Software Engineering Faculty of Science and Technology University of Macau 2014 Approved by ___________________________________________________ Supervisor __________________________________________________ __________________________________________________ __________________________________________________ Date __________________________________________________________ In presenting this thesis in partial fulfillment of the requirements for a Master's degree at the University of Macau, I agree that the Library and the Faculty of Science and Technology shall make its copies freely available for inspection. However, reproduction of this thesis for any purposes or by any means shall not be allowed without my written permission. Authorization is sought by contacting the author at Address:FST Building 2023, University of Macau, Macau S.A.R. Telephone: +(00)853-63567608 E-mail: [email protected] Signature ______________________ 2014.07.10th Date __________________________ University of Macau Abstract LEPOR: AN AUGMENTED MACHINE TRANSLATION EVALUATION METRIC by LiFeng Han, Aaron Thesis Supervisors: Dr. Lidia S. Chao and Dr. Derek F. Wong Master of Science in Software Engineering Machine translation (MT) was developed as one of the hottest research topics in the natural language processing (NLP) literature. One important issue in MT is that how to evaluate the MT system reasonably and tell us whether the translation system makes an improvement or not. The traditional manual judgment methods are expensive, time-consuming, unrepeatable, and sometimes with low agreement.
    [Show full text]