Translation Quality Assessment: a Brief Survey on Manual and Automatic Methods
Total Page:16
File Type:pdf, Size:1020Kb
Translation Quality Assessment: A Brief Survey on Manual and Automatic Methods Lifeng Han1, Gareth J. F. Jones1, and Alan F. Smeaton2 1 ADAPT Research Centre 2 Insight Centre for Data Analytics School of Computing, Dublin City University, Dublin, Ireland [email protected] Abstract tical methods (Brown et al., 1993; Och and Ney, 2003; Chiang, 2005; Koehn, 2010), to the cur- To facilitate effective translation modeling rent paradigm of neural network structures (Cho and translation studies, one of the crucial et al., 2014; Johnson et al., 2016; Vaswani et al., questions to address is how to assess trans- 2017; Lample and Conneau, 2019), MT quality lation quality. From the perspectives of ac- continue to improve. However, as MT and transla- curacy, reliability, repeatability and cost, tion quality assessment (TQA) researchers report, translation quality assessment (TQA) it- MT outputs are still far from reaching human par- self is a rich and challenging task. In this ity (Läubli et al., 2018; Läubli et al., 2020; Han work, we present a high-level and con- et al., 2020a). MT quality assessment is thus still cise survey of TQA methods, including an important task to facilitate MT research itself, both manual judgement criteria and auto- and also for downstream applications. TQA re- mated evaluation metrics, which we clas- mains a challenging and difficult task because of sify into further detailed sub-categories. the richness, variety, and ambiguity phenomena of We hope that this work will be an asset natural language itself, e.g. the same concept can for both translation model researchers and be expressed in different word structures and pat- quality assessment researchers. In addi- terns in different languages, even inside one lan- tion, we hope that it will enable practition- guage (Arnold, 2003). ers to quickly develop a better understand- In this work, we introduce human judgement ing of the conventional TQA field, and to and evaluation (HJE) criteria that have been used find corresponding closely relevant evalu- in standard international shared tasks and more ation solutions for their own needs. This broadly, such as NIST (LI, 2005), WMT (Koehn work may also serve inspire further devel- and Monz, 2006a; Callison-Burch et al., 2007a, opment of quality assessment and evalua- 2008, 2009, 2010, 2011, 2012; Bojar et al., 2013, tion methodologies for other natural lan- 2014, 2015, 2016, 2017, 2018; Barrault et al., guage processing (NLP) tasks in addition 2019, 2020), and IWSLT (Eck and Hori, 2005; to machine translation (MT), such as au- Paul, 2009; Paul et al., 2010; Federico et al., tomatic text summarization (ATS), natural 2011). We then introduce automated TQA meth- language understanding (NLU) and natu- ods, including the automatic evaluation metrics ral language generation (NLG). 1 that were proposed inside these shared tasks and 1 Introduction beyond. Regarding Human Assessment (HA) methods, we categorise them into traditional and Machine translation (MT) research, starting from advanced sets, with the first set including intelligi- the 1950s (Weaver, 1955), has been one of the bility, fidelity, fluency, adequacy, and comprehen- main research topics in computational linguis- sion, and the second set including task-oriented, tics (CL) and natural language processing (NLP), extended criteria, utilizing post-editing, segment and has influenced and been influenced by sev- ranking, crowd source intelligence (direct assess- eral other language processing tasks such as pars- ment), and revisiting traditional criteria. ing and language modeling. Starting from rule- Regarding automated TQA methods, we clas- based methods to example-based, and then statis- sify these into three categories including simple 1authors GJ and AS in alphabetic order n-gram based word surface matching, deeper lin- guistic feature integration such as syntax and se- revisiting traditional criteria mantics, and deep learning (DL) models, with the crowd source intelligence segment ranking first two regarded as traditional and the last one re- further development utilizing post-editing garded as advanced due to the recent appearance fluency, adequacy, comprehension of DL models for NLP. We further divide each extended criteria Intelligibility and fidelity task oriented of these three categories into sub-branches, each with a different focus. Of course, this classifica- tion does not have clear boundaries. For instance Traditional Advanced some automated metrics are involved in both n- gram word surface similarity and linguistic fea- tures. This paper differs from the existing works Human Assessment Methods (Dorr et al., 2009; EuroMatrix, 2007) by introduc- ing recent developments in MT evaluation mea- Figure 1: Human Assessment Methods sures, the different classifications from manual to automatic evaluation methodologies, the introduc- tion of more recently developed quality estimation tomatic language processing advisory committee (QE) tasks, and its concise presentation of these (ALPAC) (Carroll, 1966). The requirement that concepts. a translation is intelligible means that, as far as We hope that our work will shed light and of- possible, the translation should read like normal, fer a useful guide for both MT researchers and re- well-edited prose and be readily understandable in searchers in other relevant NLP disciplines, from the same way that such a sentence would be un- the similarity and evaluation point of view, to find derstandable if originally composed in the trans- useful quality assessment methods, either from the lation language. The requirement that a transla- manual or automated perspective, inspired from tion is of high fidelity or accuracy includes the re- this work. This might include, for instance, natural quirement that the translation should, as little as language generation (Gehrmann et al., 2021), nat- possible, twist, distort, or controvert the meaning ural language understanding (Ruder et al., 2021) intended by the original. and automatic summarization (Mani, 2001; Bhan- 2.1.2 Fluency, Adequacy and Comprehension dari et al., 2020). The rest of the paper is organized as follows: In 1990s, the Advanced Research Projects Agency Sections 2 and 3 present human assessment and (ARPA) created a methodology to evaluate ma- automated assessment methods respectively; Sec- chine translation systems using the adequacy, tion 4 presents some discussions and perspectives; fluency and comprehension of the MT output Section 5 summarizes our conclusions and future (Church and Hovy, 1991) which adapted in MT work. We also list some further relevant readings evaluation campaigns including (White et al., in the appendices, such as evaluating methods of 1994). TQA itself, MT QE, and mathematical formulas.2 To set upp this methodology, the human asses- sor is asked to look at each fragment, delimited 2 Human Assessment Methods by syntactic constituents and containing sufficient information, and judge its adequacy on a scale 1- In this section we introduce human judgement to-5. Results are computed by averaging the judg- methods, as reflected in Fig. 1. This categorises ments over all of the decisions in the translation these human methods as Traditional and Ad- set. vanced. Fluency evaluation is compiled in the same 2.1 Traditional Human Assessment manner as for the adequacy except that the asses- sor is to make intuitive judgments on a sentence- 2.1.1 Intelligibility and Fidelity by-sentence basis for each translation. Human as- The earliest human assessment methods for MT sessors are asked to determine whether the trans- can be traced back to around 1966. They in- lation is good English without reference to the clude the intelligibility and fidelity used by the au- correct translation. Fluency evaluation determines 2This work is based on an earlier preprint edition (Han, whether a sentence is well-formed and fluent in 2016) context. Comprehension relates to “Informativeness”, signed to the output used in the DARPA (Defense whose objective is to measure a system’s ability Advanced Research Projects Agency) 4 evaluation to produce a translation that conveys sufficient in- with a scale of language-dependent tasks, such as formation, such that people can gain necessary in- scanning, sorting, and topic identification. They formation from it. The reference set of expert develop an MT proficiency metric with a corpus translations is used to create six questions with six of multiple variants which are usable as a set of possible answers respectively including, “none of controlled samples for user judgments. The prin- above” and “cannot be determined”. cipal steps include identifying the user-performed text-handling tasks, discovering the order of text- 2.1.3 Further Development handling task tolerance, analyzing the linguistic Bangalore et al. (2000) classified accuracy into and non-linguistic translation problems in the cor- several categories including simple string accu- pus used in determining task tolerance, and devel- racy, generation string accuracy, and two cor- oping a set of source language patterns which cor- responding tree-based accuracy. Reeder (2004) respond to diagnostic target phenomena. A brief found the correlation between fluency and the introduction to task-based MT evaluation work number of words it takes to distinguish between was shown in their later work (Doyon et al., 1999). human translation and MT output. Voss and Tate (2006) introduced tasked-based The “Linguistics Data Consortium (LDC)” 3 de- MT output evaluation by the extraction of who, signed