Translation Quality Assessment: A Brief Survey on Manual and Automatic Methods

Lifeng Han1, Gareth J. F. Jones1, and Alan F. Smeaton2 1 ADAPT Research Centre 2 Insight Centre for Data Analytics School of Computing, Dublin City University, Dublin, Ireland [email protected]

Abstract tical methods (Brown et al., 1993; Och and Ney, 2003; Chiang, 2005; Koehn, 2010), to the cur- To facilitate effective translation modeling rent paradigm of neural network structures (Cho and translation studies, one of the crucial et al., 2014; Johnson et al., 2016; Vaswani et al., questions to address is how to assess trans- 2017; Lample and Conneau, 2019), MT quality lation quality. From the perspectives of ac- continue to improve. However, as MT and transla- curacy, reliability, repeatability and cost, tion quality assessment (TQA) researchers report, translation quality assessment (TQA) it- MT outputs are still far from reaching human par- self is a rich and challenging task. In this ity (Läubli et al., 2018; Läubli et al., 2020; Han work, we present a high-level and con- et al., 2020a). MT quality assessment is thus still cise survey of TQA methods, including an important task to facilitate MT research itself, both manual judgement criteria and auto- and also for downstream applications. TQA re- mated evaluation metrics, which we clas- mains a challenging and difficult task because of sify into further detailed sub-categories. the richness, variety, and ambiguity phenomena of We hope that this work will be an asset natural language itself, e.g. the same concept can for both translation model researchers and be expressed in different word structures and pat- quality assessment researchers. In addi- terns in different languages, even inside one lan- tion, we hope that it will enable practition- guage (Arnold, 2003). ers to quickly develop a better understand- In this work, we introduce human judgement ing of the conventional TQA field, and to and evaluation (HJE) criteria that have been used find corresponding closely relevant evalu- in standard international shared tasks and more ation solutions for their own needs. This broadly, such as NIST (LI, 2005), WMT (Koehn work may also serve inspire further devel- and Monz, 2006a; Callison-Burch et al., 2007a, opment of quality assessment and evalua- 2008, 2009, 2010, 2011, 2012; Bojar et al., 2013, tion methodologies for other natural lan- 2014, 2015, 2016, 2017, 2018; Barrault et al., guage processing (NLP) tasks in addition 2019, 2020), and IWSLT (Eck and Hori, 2005; to (MT), such as au- Paul, 2009; Paul et al., 2010; Federico et al., tomatic text summarization (ATS), natural 2011). We then introduce automated TQA meth- language understanding (NLU) and natu- ods, including the automatic evaluation metrics ral language generation (NLG). 1 that were proposed inside these shared tasks and 1 Introduction beyond. Regarding Human Assessment (HA) methods, we categorise them into traditional and Machine translation (MT) research, starting from advanced sets, with the first set including intelligi- the 1950s (Weaver, 1955), has been one of the bility, fidelity, fluency, adequacy, and comprehen- main research topics in computational linguis- sion, and the second set including task-oriented, tics (CL) and natural language processing (NLP), extended criteria, utilizing post-editing, segment and has influenced and been influenced by sev- ranking, crowd source intelligence (direct assess- eral other language processing tasks such as pars- ment), and revisiting traditional criteria. ing and language modeling. Starting from rule- Regarding automated TQA methods, we clas- based methods to example-based, and then statis- sify these into three categories including simple 1authors GJ and AS in alphabetic order n-gram based word surface matching, deeper lin- guistic feature integration such as syntax and se- revisiting traditional criteria mantics, and deep learning (DL) models, with the crowd source intelligence segment ranking first two regarded as traditional and the last one re- further development utilizing post-editing garded as advanced due to the recent appearance fluency, adequacy, comprehension of DL models for NLP. We further divide each extended criteria Intelligibility and fidelity task oriented of these three categories into sub-branches, each with a different focus. Of course, this classifica- tion does not have clear boundaries. For instance Traditional Advanced some automated metrics are involved in both n- gram word surface similarity and linguistic fea- tures. This paper differs from the existing works Human Assessment Methods (Dorr et al., 2009; EuroMatrix, 2007) by introduc- ing recent developments in MT evaluation mea- Figure 1: Human Assessment Methods sures, the different classifications from manual to automatic evaluation methodologies, the introduc- tion of more recently developed quality estimation tomatic language processing advisory committee (QE) tasks, and its concise presentation of these (ALPAC) (Carroll, 1966). The requirement that concepts. a translation is intelligible means that, as far as We hope that our work will shed light and of- possible, the translation should read like normal, fer a useful guide for both MT researchers and re- well-edited prose and be readily understandable in searchers in other relevant NLP disciplines, from the same way that such a sentence would be un- the similarity and evaluation point of view, to find derstandable if originally composed in the trans- useful quality assessment methods, either from the lation language. The requirement that a transla- manual or automated perspective, inspired from tion is of high fidelity or accuracy includes the re- this work. This might include, for instance, natural quirement that the translation should, as little as language generation (Gehrmann et al., 2021), nat- possible, twist, distort, or controvert the meaning ural language understanding (Ruder et al., 2021) intended by the original. and automatic summarization (Mani, 2001; Bhan- 2.1.2 Fluency, Adequacy and Comprehension dari et al., 2020). The rest of the paper is organized as follows: In 1990s, the Advanced Research Projects Agency Sections 2 and 3 present human assessment and (ARPA) created a methodology to evaluate ma- automated assessment methods respectively; Sec- chine translation systems using the adequacy, tion 4 presents some discussions and perspectives; fluency and comprehension of the MT output Section 5 summarizes our conclusions and future (Church and Hovy, 1991) which adapted in MT work. We also list some further relevant readings evaluation campaigns including (White et al., in the appendices, such as evaluating methods of 1994). TQA itself, MT QE, and mathematical formulas.2 To set upp this methodology, the human asses- sor is asked to look at each fragment, delimited 2 Human Assessment Methods by syntactic constituents and containing sufficient information, and judge its adequacy on a scale 1- In this section we introduce human judgement to-5. Results are computed by averaging the judg- methods, as reflected in Fig. 1. This categorises ments over all of the decisions in the translation these human methods as Traditional and Ad- set. vanced. Fluency evaluation is compiled in the same 2.1 Traditional Human Assessment manner as for the adequacy except that the asses- sor is to make intuitive judgments on a sentence- 2.1.1 Intelligibility and Fidelity by-sentence basis for each translation. Human as- The earliest human assessment methods for MT sessors are asked to determine whether the trans- can be traced back to around 1966. They in- lation is good English without reference to the clude the intelligibility and fidelity used by the au- correct translation. Fluency evaluation determines 2This work is based on an earlier preprint edition (Han, whether a sentence is well-formed and fluent in 2016) context. Comprehension relates to “Informativeness”, signed to the output used in the DARPA (Defense whose objective is to measure a system’s ability Advanced Research Projects Agency) 4 evaluation to produce a translation that conveys sufficient in- with a scale of language-dependent tasks, such as formation, such that people can gain necessary in- scanning, sorting, and topic identification. They formation from it. The reference set of expert develop an MT proficiency metric with a corpus translations is used to create six questions with six of multiple variants which are usable as a set of possible answers respectively including, “none of controlled samples for user judgments. The prin- above” and “cannot be determined”. cipal steps include identifying the user-performed text-handling tasks, discovering the order of text- 2.1.3 Further Development handling task tolerance, analyzing the linguistic Bangalore et al. (2000) classified accuracy into and non-linguistic translation problems in the cor- several categories including simple string accu- pus used in determining task tolerance, and devel- racy, generation string accuracy, and two cor- oping a set of source language patterns which cor- responding tree-based accuracy. Reeder (2004) respond to diagnostic target phenomena. A brief found the correlation between fluency and the introduction to task-based MT evaluation work number of words it takes to distinguish between was shown in their later work (Doyon et al., 1999). human translation and MT output. Voss and Tate (2006) introduced tasked-based The “Linguistics Data Consortium (LDC)” 3 de- MT output evaluation by the extraction of who, signed two five-point scales representing fluency when, and where three types of elements. They and adequacy for the annual NIST MT evalua- extended their work later into event understanding tion workshop. The developed scales became a (Laoudi et al., 2006). widely used methodology when manually evaluat- ing MT by assigning values. The five point scale 2.2.2 Extended Criteria for adequacy indicates how much of the mean- King et al. (2003) extend a large range of man- ing expressed in the reference translation is also ual evaluation methods for MT systems which, in expressed in a translation hypothesis; the second addition to the earlir mentioned accuracy, include five point scale indicates how fluent the translation suitability, whether even accurate results are suit- is, involving both grammatical correctness and id- able in the particular context in which the system iomatic word choices. is to be used; interoperability, whether with other Specia et al. (2011) conducted a study of MT software or hardware platforms; reliability, i.e., adequacy and broke it into four levels, from score don’t break down all the time or take a long time 4 to 1: highly adequate, the translation faithfully to get running again after breaking down; usabil- conveys the content of the input sentence; fairly ity, easy to get the interfaces, easy to learn and op- adequate, where the translation generally conveys erate, and looks pretty; efficiency, when needed, the meaning of the input sentence, there are some keep up with the flow of dealt documents; main- problems with word order or tense/voice/number, tainability, being able to modify the system in or- or there are repeated, added or non-translated der to adapt it to particular users; and portability, words; poorly adequate, the content of the input one version of a system can be replaced by a new sentence is not adequately conveyed by the trans- version, because MT systems are rarely static and lation; and completely inadequate, the content of they tend to improve over time as resources grow the input sentence is not conveyed at all by the and bugs are fixed. translation. 2.2.3 Utilizing Post-editing 2.2 Advanced Human Assessment One alternative method to assess MT quality is to compare the post-edited correct translation to 2.2.1 Task-oriented the original MT output. This type of evaluation White and Taylor (1998) developed a task- is, however, time consuming and depends on the oriented evaluation methodology for Japanese-to- skills of the human assessor and post-editing per- English translation to measure MT systems in light former. One example of a metric that is designed of the tasks for which their output might be used. in such a manner is the human translation error They seek to associate the diagnostic scores as- rate (HTER) (Snover et al., 2006). This is based on

3https://www.ldc.upenn.edu 4https://www.darpa.mil the number of editing steps, computing the editing were forced to choose is preferred. In light of steps between an automatic translation and a ref- this rationale, they proposed continuous measure- erence translation. Here, a human assessor has to ment scales (CMS) for human TQA using fluency find the minimum number of insertions, deletions, criteria. This was implemented by introducing substitutions, and shifts to convert the system out- the crowdsource platform Amazon MTurk, with put into an acceptable translation. HTER is de- some quality control methods such as the inser- fined as the sum of the number of editing steps tion of bad-reference and ask-again, and statistical divided by the number of words in the acceptable significance testing. This methodology reported translation. improved both intra-annotator and inter-annotator consistency. Detailed quality control methodolo- 2.2.4 Segment Ranking gies, including statistical significance testing were In the WMT metrics task, human assessment documented in direct assessment (DA) (Graham based on segment ranking was often used. Human et al., 2016, 2020). assessors were frequently asked to provide a com- plete ranking over all the candidate translations of 2.2.6 Revisiting Traditional Criteria the same source segment (Callison-Burch et al., Popovic´ (2020a) criticized the traditional human 2011, 2012). In the WMT13 shared-tasks (Bojar TQA methods because they fail to reflect real et al., 2013), five systems were randomised for the problems in translation by assigning scores and assessor to give a rank. Each time, the source seg- ranking several candidates from the same source. ment and the reference translation were presented Instead, Popovic´ (2020a) designed a new method- together with the candidate translations from the ology by asking human assessors to mark all five systems. The assessors ranked the systems problematic parts of candidate translations, either from 1 to 5, allowing tied scores. For each rank- words, phrases, or sentences. Two questions that ing, there was the potential to provide as many as were typically asked of the assessors related to 10 pairwise results if there were no ties. The col- comprehensibility and adequacy. The first criteria lected pairwise rankings were then used to assign considered whether the translation is understand- a corresponding score to each participating system able, or understandable but with errors; the second to reflect the quality of the automatic translations. criteria measures if the candidate translation has The assigned scores could also be used to reflect different meaning to the original text, or maintains how frequently a system was judged to be better the meaning but with errors. Both criteria take into or worse than other systems when they were com- account whether parts of the original text are miss- pared on the same source segment, according to ing in translation. Under a similar experimental the following formula: setup, Popovic´ (2020b) also summarized the most frequent error types that the annotators recognized #better pairwise ranking as misleading translations. (1) #total pairwise comparison − #ties comparisons 3 Automated Assessment Methods Manual evaluation suffers some disadvantages 2.2.5 Crowd Source Intelligence such as that it is time-consuming, expensive, not With the reported very low human inter-agreement tune-able, and not reproducible. Due to these scores from the WMT segment ranking task, re- aspects, automatic evaluation metrics have been searchers started to address this issue by exploring widely used for MT. Typically, these compare the new human assessment methods, as well as seek- output of MT systems against human reference ing reliable automatic metrics for segment level translations, but there are also some metrics that ranking (Graham et al., 2015). do not use reference translations. There are usu- Graham et al. (2013) noted that the lower agree- ally two ways to offer the human reference trans- ments from WMT human assessment might be lation, either offering one single reference or of- caused partially by the interval-level scales set up fering multiple references for a single source sen- for the human assessor to choose regarding qual- tence (Lin and Och, 2004; Han et al., 2012). ity judgement of each segment. For instance, the Automated metrics often measure the overlap in human assessor possibly corresponds to the sit- words and word sequences, as well as word or- uation where neither of the two categories they der and edit distance. We classify these kinds of Syntax POS, phrase, the fact that word ordering is not taken into ac- sentence structure count appropriately. The WER scores are very Edit distance Semantics name entity, MWEs, low when the word order of system output trans- synonym, textual Precision and Recall entailment, lation is “wrong" according to the reference. In Word order paraphrase, semantic the Levenshtein distance, the mismatches in word role, language model order require the deletion and re-insertion of the N-gram surface Deeper linguistic misplaced words. However, due to the diversity matching features of language expressions, some so-called “wrong" Deep Learning Models order sentences by WER also prove to be good translations. To address this problem, the position- Traditional Advanced independent word error rate (PER) introduced by Tillmann et al. (1997) is designed to ignore word

Automatic Quality Assessment Methods order when matching output and reference. With- out taking into account of the word order, PER Figure 2: Automatic Quality Assessment Methods counts the number of times that identical words appear in both sentences. Depending on whether the translated sentence is longer or shorter than the metrics as “simple n-gram word surface match- reference translation, the rest of the words are ei- ing”. Further developed metrics also take linguis- ther insertion or deletion ones. tic features into account such as syntax and se- Another way to overcome the unconscionable mantics, including POS, sentence structure, tex- penalty on word order in the Levenshtein distance tual entailment, paraphrase, synonyms, named en- is adding a novel editing step that allows the move- tities, multi-word expressions (MWEs), semantic ment of word sequences from one part of the out- roles and language models. We classify these met- put to another. This is something a human post- rics that utilize the linguistic features as “Deeper editor would do with the cut-and-paste function of Linguistic Features (aware)”. This classification is a word processor. In this light, Snover et al. (2006) only for easier understanding and better organiza- designed the translation edit rate (TER) metric that tion of the content. It is not easy to separate these adds block movement (jumping action) as an edit- two categories clearly since sometimes they merge ing step. The shift option is performed on a con- with each other. For instance, some metrics from tiguous sequence of words within the output sen- the first category might also use certain linguis- tence. For the edits, the cost of the block move- tic features. Furthermore, we will introduce some ment, any number of continuous words and any recent models that apply deep learning into the distance, is equal to that of the single word opera- TQA framework, as in Fig. 2. Due to space lim- tion, such as insertion, deletion and substitution. itations, we present MT quality estimation (QE) task which does not rely on reference translations 3.1.2 Precision and Recall during the automated computing procedure in the The widely used evaluation BLEU metric (Pap- appendices. ineni et al., 2002) is based on the degree of n- gram overlap between the strings of words pro- 3.1 N-gram Word Surface Matching duced by the MT output and the human translation 3.1.1 Levenshtein Distance references at the corpus level. BLEU calculates By calculating the minimum number of editing precision scores with n-grams sized from 1-to-4, steps to transform MT output to reference, Su together multiplied by the coefficient of brevity et al. (1992) introduced the word error rate (WER) penalty (BP). If there are multi-references for each metric into MT evaluation. This metric, inspired candidate sentence, then the nearest length as by Levenshtein Distance (or edit distance), takes compared to the candidate sentence is selected as word order into account, and the operations in- the effective one. In the BLEU metric, the n-gram clude insertion (adding word), deletion (dropping precision weight λn is usually selected as a uni- word) and replacement (or substitution, replace form weight. However, the 4-gram precision value one word with another), the minimum number of can be very low or even zero when the test corpus editing steps needed to match two sequences. is small. To weight more heavily those n-grams One of the weak points of the WER metric is that are more informative, Doddington (2002) pro- poses the NIST metric with the information weight on precision and recall criteria, but with a position added. Furthermore, Doddington (2002) replaces difference penalty coefficient attached. The word the geometric mean of co-occurrences with the choice is assessed by matching word forms at vari- arithmetic average of n-gram counts, extends the ous linguistic levels, including surface form, stem, n-gram into 5-gram (N = 5), and selects the aver- sound and sense, and further by weighing the in- age length of reference translations instead of the formativeness of each word. nearest length. Partially inspired by this, our work LEPOR ROUGE (Lin and Hovy, 2003) is a recall- (Han et al., 2012) is designed as a combina- oriented evaluation metric, which was initially de- tion of augmented evaluation factors including veloped for summaries, and inspired by BLEU and n-gram based word order penalty in addition to NIST. ROUGE has also been applied in automated precision, recall, and enhanced sentence-length TQA in later work (Lin and Och, 2004). penalty. The LEPOR metric (including hLEPOR) The F-measure is the combination of precision was reported with top performance on the English- (P ) and recall (R), which was firstly employed in to-other (Spanish, German, French, Czech and information retrieval (IR) and latterly adopted by Russian) language pairs in ACL-WMT13 metrics the information extraction (IE) community, MT shared tasks for system level evaluation (Han et al., evaluations, and others. Turian et al. (2006) car- 2013d). The n-gram based variant nLEPOR (Han ried out experiments to examine how standard et al., 2014) was also analysed by MT researchers measures such as precision, recall and F-measure as one of the three best performing segment level can be applied to TQA and showed the compar- automated metrics (together with METEOR and isons of these standard measures with some alter- sentBLEU-MOSES) that correlated with human native evaluation methodologies. judgement at a level that was not significantly out- Banerjee and Lavie (2005) designed METEOR performed by any other metrics, on Spanish-to- as a novel evaluation metric. METEOR is based English, in addition to an aggregated set of overall on the general concept of flexible unigram match- tested language pairs (Graham et al., 2015). ing, precision and recall, including the match be- tween words that are simple morphological vari- 3.2 Deeper Linguistic Features ants of each other with identical word stems and Although some of the previously outlined metrics words that are synonyms of each other. To mea- incorporate linguistic information, e.g. synonyms sure how well-ordered the matched words in the and in METEOR and part of speech candidate translation are in relation to the human (POS) in LEPOR, the simple n-gram word sur- reference, METEOR introduces a penalty coeffi- face matching methods mainly focus on the exact cient, different to what is done in BLEU, by em- matches of the surface words in the output trans- ploying the number of matched chunks. lation. The advantages of the metrics based on the first category (simple n-gram word matching) 3.1.3 Revisiting Word Order are that they perform well in capturing translation The right word order plays an important role to fluency (Lo et al., 2012), are very fast to com- ensure a high quality translation output. How- pute and have low cost. On the other hand, there ever, language diversity also allows different ap- are also some weaknesses, for instance, syntactic pearances or structures of a sentence. How to suc- information is rarely considered and the underly- cessfully achieve a penalty on really wrong word ing assumption that a good translation is one that order, i.e. wrongly structured sentences, instead shares the same word surface lexical choices as of on “correct” different order, i.e. the candidate the reference translations is not justified seman- sentence that has different word order to the ref- tically. Word surface lexical similarity does not erence, but is well structured, has attracted a lot adequately reflect similarity in meaning. Transla- of interest from researchers. In fact, the Leven- tion evaluation metrics that reflect meaning sim- shtein distance (Section 3.1.1) and n-gram based ilarly need to be based on similarity of semantic measures also contain word order information. structure and not merely flat lexical similarity. Featuring the explicit assessment of word order and word choice, Wong and yu Kit (2009) devel- 3.2.1 Syntactic Similarity oped the evaluation metric ATEC (assessment of Syntactic similarity methods usually employ text essential characteristics). This is also based the features of morphological POS information, phrase categories, phrase decompositionality or ticular languages. To address the overall good- sentence structure generated by linguistic tools ness of a translated sentence’s structure, Liu and such as a language parser or chunker. Gildea (2005) employ constituent labels and head- In grammar, a POS is a linguistic category of modifier dependencies from a language parser as words or lexical items, which is generally defined syntactic features for MT evaluation. They com- by the syntactic or morphological behaviour of pute the similarity of dependency trees. Their ex- the lexical item. Common linguistic categories periments show that adding syntactic information of lexical items include noun, verb, adjective, ad- can improve evaluation performance, especially verb, and preposition. To reflect the syntactic for predicting the fluency of translation hypothe- quality of automatically translated sentences, re- ses. Other works that use syntactic information in searchers employ POS information into their eval- evaluation include (Lo and Wu, 2011a) and (Lo uations. Using the IBM model one, Popovic´ et al. et al., 2012) that use an automatic shallow parser (2011) evaluate translation quality by calculating and the RED metric (Yu et al., 2014) that applies the similarity scores of source and target (trans- dependency trees. lated) sentences without using a reference transla- 3.2.2 Semantic Similarity tion, based on the morphemes, 4-gram POS and As a contrast to syntactic information, which cap- lexicon probabilities. Dahlmeier et al. (2011) de- tures overall grammaticality or sentence structure veloped the TESLA evaluation metrics, combin- similarity, the semantic similarity of automatic ing the synonyms of bilingual phrase tables and translations and the source sentences (or refer- POS information in the matching task. Other simi- ences) can be measured by employing semantic lar work using POS information include (Gime´nez features. and Ma´rquez, 2007; Popovic and Ney, 2007; Han To capture the semantic equivalence of sen- et al., 2014). tences or text fragments, named entity knowl- In linguistics, a phrase may refer to any group edge is taken from the literature on named-entity of words that form a constituent, and so functions recognition, which aims to identify and classify as a single unit in the syntax of a sentence. To atomic elements in a text into different entity cate- measure an MT system’s performance in trans- gories (Marsh and Perzanowski, 1998; Guo et al., lating new text types, such as in what ways the 2009). The most commonly used entity cate- system itself could be extended to deal with new gories include the names of persons, locations, or- text types, Povlsen et al. (1998) carried out work ganizations and time (Han et al., 2013a). In the focusing on the study of an English-to-Danish MEDAR2011 evaluation campaign, one baseline MT system. The syntactic constructions are ex- system based on Moses (Koehn et al., 2007) uti- plored with more complex linguistic knowledge, lized an Open NLP toolkit to perform named en- such as the identification of fronted adverbial sub- tity detection, in addition to other packages. The ordinate clauses and prepositional phrases. As- low performances from the perspective of named suming that similar grammatical structures should entities causes a drop in fluency and adequacy. In occur in both source and translations, Avramidis the quality estimation of the MT task in WMT et al. (2011) perform evaluation on source (Ger- 2012, (Buck, 2012) introduced features including man) and target (English) sentences employing named entity, in addition to discriminative word the features of sentence length ratio, unknown lexicon, neural networks, back off behavior (Ray- words, phrase numbers including noun phrase, baud et al., 2011) and edit distance. Experiments verb phrase and prepositional phrase. Other simi- on individual features showed that, from the per- lar work using phrase similarity includes (Li et al., spective of the increasing the correlation score 2012) that uses noun phrases and verb phrases with human judgments, the named entity feature from chunking, (Echizen-ya and Araki, 2010) that contributed the most to the overall performance, only uses the noun phrase chunking in automatic in comparisons to the impacts of other features. evaluation, and (Han et al., 2013c) that designs a Multi-word Expressions (MWEs) set obsta- universal phrase tagset for French to English MT cles for MT models due to their complexity in pre- evaluation. sentation as well as idiomaticity (Sag et al., 2002; Syntax is the study of the principles and pro- Han et al., 2020b,a; Han et al., 2021). To investi- cesses by which sentences are constructed in par- gate the effect of MWEs in MT evaluation (MTE), Salehi et al. (2015) focused on the compositional- Shaked, 1988; Barzilay and Lee, 2003). Snover ity of noun compounds. They identify the noun et al. (2006) describe a new evaluation metric compounds first from the system outputs and ref- TER-Plus (TERp). Sequences of words in the ref- erence with Stanford parser. The matching scores erence are considered to be paraphrases of a se- of the system outputs and reference sentences are quence of words in the hypothesis if that phrase then recalculated, adding up to the Tesla metric, pair occurs in the TERp phrase table. by considering the predicated compositionality of identified noun compound phrases. Our own re- Semantic roles are employed by researchers as cent work in this area (Han et al., 2020a) provides linguistic features in MT evaluation. To utilize an extensive investigation into various MT errors semantic roles, sentences are usually first shal- caused by MWEs. low parsed and entity tagged. Then the seman- tic roles are used to specify the arguments and Synonyms are words with the same or close adjuncts that occur in both the candidate trans- meanings. One of the most widely used syn- lation and reference translation. For instance, onym databases in the NLP literature is WordNet the semantic roles introduced by Gime´nez and (Miller et al., 1990), which is an English lexi- Ma´rquez (2007); Gime´ne and Ma´rquez (2008) in- cal database grouping English words into sets of clude causative agent, adverbial adjunct, direc- synonyms. WordNet classifies words mainly into tional adjunct, negation marker, and predication four kinds of POS categories; Noun, Verb, Adjec- adjunct, etc. In a further development, Lo and tive, and Adverb, without prepositions, determin- Wu (2011a,b) presented the MEANT metric de- ers, etc. Synonymous words or phrases are orga- signed to capture the predicate-argument relations nized using the unit of synsets. Each synset is a as structural relations in semantic frames, which hierarchical structure with the words at different are not reflected in the flat semantic role label fea- levels according to their semantic relations. tures in the work of Gime´nez and Ma´rquez (2007). Textual entailment is usually used as a direc- Furthermore, instead of using uniform weights, Lo tive relation between text fragments. If the truth et al. (2012) weight the different types of seman- of one text fragment TA follows another text frag- tic roles as empirically determined by their relative ment TB, then there is a directional relation be- importance to the adequate preservation of mean- tween TA and TB (TB ⇒ TA). Instead of the pure ing. Generally, semantic roles account for the se- logical or mathematical entailment, textual entail- mantic structure of a segment and have proved ef- ment in natural language processing (NLP) is usu- fective in assessing adequacy of translation. ally performed with a relaxed or loose definition (Dagan et al., 2006). For instance, according to Language models are also utilized by MT eval- text fragment TB, if it can be inferred that the text uation researchers. A statistical language model fragment TA is most likely to be true then the re- usually assigns a probability to a sequence of lationship TB ⇒ TA is also established. Since the words by means of a probability distribution. Ga- relation is directive, it means that the inverse infer- mon et al. (2005) propose the LM-SVM, language ence (TA ⇒ TB) is not ensured to be true (Dagan model, and support vector machine methods in- and Glickman, 2004). Castillo and Estrella (2012) vestigating the possibility of evaluating MT qual- present a new approach for MT evaluation based ity and fluency in the absence of reference trans- on the task of “Semantic Textual Similarity". This lations. They evaluate the performance of the sys- problem is addressed using a textual entailment tem when used as a classifier for identifying highly engine based on WordNet semantic features. dis-fluent and ill-formed sentences. Paraphrase is to restate the meaning of a pas- Generally, the linguistic features mentioned sage of text but utilizing other words, which can be above, including both syntactic and semantic fea- seen as bidirectional textual entailment (Androut- tures, are combined in two ways, either by fol- sopoulos and Malakasiotis, 2010). Instead of the lowing a machine learning approach (Albrecht and literal translation, word by word and line by line Hwa, 2007; Leusch and Ney, 2009), or trying used by meta-phrases, a paraphrase represents a to combine a wide variety of metrics in a more dynamic equivalent. Further knowledge of para- simple and straightforward way, such as (Gime´ne phrases from the aspect of linguistics is introduced and Ma´rquez, 2008; Specia and Gime´nez, 2010; in the works by (McKeown, 1979; Meteer and Comelles et al., 2012). 3.3 Neural Networks for TQA The second major aspect is that MT qual- We briefly list some works that have applied deep ity estimation (QE) tasks are different to tradi- learning and neural networks for TQA which are tional MT evaluation in several ways, such as ex- promising for further exploration. For instance, tracting reference-independent features from input Guzmán et al. (2015); Guzmn et al. (2017) use sentences and their translation, obtaining quality neural networks (NNs) for TQA for pair wise scores based on models produced from training modeling to choose the best hypothetical transla- data, predicting the quality of an unseen translated tion by comparing candidate translations with a text at system run-time, filtering out sentences reference, integrating syntactic and semantic in- which are not good enough for post processing, formation into NNs. Gupta et al. (2015b) proposed and selecting the best translation among multi- LSTM networks based on dense vectors to con- ple systems. This means that with so many chal- duct TQA, while Ma et al. (2016) designed a new lenges, the topic will continuously attract many re- metric based on bi-directional LSTMs, which is searchers. similar to the work of Guzmán et al. (2015) but Thirdly, some advanced or challenging tech- with less complexity by allowing the evaluation of nologies that can be further applied to MT eval- a single hypothesis with a reference, instead of a uation include the deep learning models (Gupta pairwise situation. et al., 2015a; Zhang and Zong, 2015), semantic logic form, and decipherment model. 4 Discussion and Perspectives In this section, we examine several topics that can be considered for further development of MT eval- 5 Conclusions and Future Work uation fields. The first aspect is that development should in- In this paper we have presented a survey of the volve both n-gram word surface matching and the state-of-the-art in translation quality assessment deeper linguistic features. Because natural lan- methodologies from the viewpoints of both man- guages are expressive and ambiguous at different ual judgements and automated methods. This levels (Gime´nez and Ma´rquez, 2007), simple n- work differs from conventional MT evaluation re- gram word surface similarity based metrics limit view work by its concise structure and inclusion of their scope to the lexical dimension and are not some recently published work and references. Due sufficient to ensure that two sentences convey the to space limitations, in the main content, we fo- same meaning or not. For instance, (Callison- cused on conventional human assessment methods Burch et al., 2006a) and (Koehn and Monz, 2006b) and automated evaluation metrics with reliance on report that simple n-gram matching metrics tend reference translations. However, we also list some to favor automatic statistical MT systems. If the interesting and related work in the appendices, evaluated systems belong to different types that such as the quality estimation in MT when the ref- include rule based, human aided, and statistical erence translation is not presented during the esti- systems, then the simple n-gram matching met- mation, and the evaluating methodology for TQA rics, such as BLEU, give a strong disagreement methods themselves. However, this arrangement between these ranking results and those of the hu- does not affect the overall understanding of this man assessors. So deeper linguistic features are paper as a self contained overview. We believe very important in the MT evaluation procedure. this work can help both MT and NLP researchers However, inappropriate utilization, or abundant and practitioners in identifying appropriate qual- or abused utilization, of linguistic features will re- ity assessment methods for their work. We also sult in limited popularity of measures incorporat- expect this work might shed some light on evalua- ing linguistic features. In the future, how to utilize tion methodologies in other NLP tasks, due to the the linguistic features in a more accurate, flexible similarities they share, such as text summarization and simplified way, will be one challenge in MT (Mani, 2001; Bhandari et al., 2020), natural lan- evaluation. Furthermore, the MT evaluation from guage understanding (Ruder et al., 2021), natural the aspects of semantic similarity is more reason- language generation (Gehrmann et al., 2021), as able and reaches closer to the human judgments, well as programming language (code) generation so it should receive more attention. (Liguori et al., 2021). Acknowledgments Findings of the 2019 conference on machine trans- lation (WMT19). In Proceedings of the Fourth Con- We appreciate the comments from Derek F. Wong, ference on Machine Translation (Volume 2: Shared editing help from Ying Shi (Angela), and the Task Papers, Day 1), pages 1–61, Florence, Italy. anonymous reviewers for their valuable reviews Association for Computational Linguistics. and feedback. The ADAPT Centre for Digital Regina Barzilay and Lillian Lee. 2003. Learning Content Technology is funded under the SFI Re- to paraphrase: an unsupervised approach using search Centres Programme (Grant 13/RC/2106) multiple-sequence alignment. In Proceedings of and is co-funded under the European Regional De- NAACL 2003. velopment Fund. The input of Alan Smeaton is Manik Bhandari, Pranav Narayan Gour, Atabak Ash- part-funded by Science Foundation Ireland under faq, Pengfei Liu, and Graham Neubig. 2020. Re- grant number SFI/12/RC/2289 (Insight Centre). evaluating evaluation in text summarization. In Pro- ceedings of the 2020 Conference on Empirical Meth- ods in Natural Language Processing (EMNLP), pages 9347–9359, Online. Association for Compu- References tational Linguistics. J. Albrecht and R. Hwa. 2007. A re-examination of machine learning approaches for sentence-level mt Ondrˇej Bojar, Christian Buck, Chris Callison-Burch, evaluation. In Proceedings of the 45th Annual Meet- Christian Federmann, Barry Haddow, Philipp ing of the ACL, Prague, Czech Republic. Koehn, Christof Monz, Matt Post, Radu Soricut, and Lucia Specia. 2013. Findings of the 2013 workshop Jon Androutsopoulos and Prodromos Malakasiotis. on statistical machine translation. In Proceedings of 2010. A survey of paraphrasing and textual entail- WMT 2013. ment methods. Journal of Artificial Intelligence Re- search, 38:135–187. Ondrej Bojar, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Johannes Leveling, D. Arnold. 2003. Computers and Translation: A trans- Christof Monz, Pavel Pecina, Matt Post, Herve lator’s guide-Chap8 Why translation is difficult for Saint-Amand, Radu Soricut, Lucia Specia, and Aleš computers. Benjamins Translation Library. Tamchyna. 2014. Findings of the 2014 workshop on statistical machine translation. In Proceedings of the Eleftherios Avramidis, Maja Popovic, David Vilar, and Ninth Workshop on Statistical Machine Translation, Aljoscha Burchardt. 2011. Evaluate with confidence pages 12–58, Baltimore, Maryland, USA. Associa- estimation: Machine ranking of translation outputs tion for Computational Linguistics. using grammatical features. In Proceedings of WMT 2011. Ondrejˇ Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Shujian Huang, Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An Matthias Huck, Philipp Koehn, Qun Liu, Varvara automatic metric for mt evaluation with improved Logacheva, Christof Monz, Matteo Negri, Matt correlation with human judgments. In Proceedings Post, Raphael Rubino, Lucia Specia, and Marco of the ACL 2005. Turchi. 2017. Findings of the 2017 conference on machine translation (WMT17). In Proceedings Srinivas Bangalore, Owen Rambow, and Steven Whit- of the Second Conference on Machine Translation, taker. 2000. Evaluation metrics for generation. In pages 169–214, Copenhagen, Denmark. Association Proceedings of INLG 2000. for Computational Linguistics.

Loïc Barrault, Magdalena Biesialska, Ondrejˇ Bojar, Ondrejˇ Bojar, Christian Federmann, Mark Fishel, Marta R. Costa-jussà, Christian Federmann, Yvette Yvette Graham, Barry Haddow, Matthias Huck, Graham, Roman Grundkiewicz, Barry Haddow, Philipp Koehn, and Christof Monz. 2018. Find- Matthias Huck, Eric Joanis, Tom Kocmi, Philipp ings of the 2018 conference on machine translation Koehn, Chi-kiu Lo, Nikola Ljubešic,´ Christof (wmt18). In Proceedings of the Third Conference Monz, Makoto Morishita, Masaaki Nagata, Toshi- on Machine Translation, Volume 2: Shared Task Pa- aki Nakazawa, Santanu Pal, Matt Post, and Marcos pers, pages 272–307, Belgium, Brussels. Associa- Zampieri. 2020. Findings of the 2020 conference on tion for Computational Linguistics. machine translation (WMT20). In Proceedings of the Fifth Conference on Machine Translation, pages Ondrejˇ Bojar, Yvette Graham, Amir Kamran, and 1–55, Online. Association for Computational Lin- Miloš Stanojevic.´ 2016. Results of the WMT16 guistics. metrics shared task. In Proceedings of the First Con- ference on Machine Translation: Volume 2, Shared Loïc Barrault, Ondrejˇ Bojar, Marta R. Costa-jussà, Task Papers, pages 199–231, Berlin, Germany. As- Christian Federmann, Mark Fishel, Yvette Gra- sociation for Computational Linguistics. ham, Barry Haddow, Matthias Huck, Philipp Koehn, Shervin Malmasi, Christof Monz, Mathias Müller, Ondrejˇ Bojar, Rajen Chatterjee, Christian Federmann, Santanu Pal, Matt Post, and Marcos Zampieri. 2019. Barry Haddow, Matthias Huck, Chris Hokamp, Philipp Koehn, Varvara Logacheva, Christof Monz, John B. Carroll. 1966. An experiment in evaluating the Matteo Negri, Matt Post, Carolina Scarton, Lucia quality of translation. Mechanical Translation and Specia, and Marco Turchi. 2015. Findings of the Computational Linguistics, 9(3-4):67–75. 2015 workshop on statistical machine translation. In Proceedings of the Tenth Workshop on Statistical Julio Castillo and Paula Estrella. 2012. Semantic tex- Machine Translation, pages 1–46, Lisbon, Portugal. tual similarity for MT evaluation. In Proceedings of Association for Computational Linguistics. the Seventh Workshop on Statistical Machine Trans- lation, pages 52–58, Montréal, Canada. Association Peter F. Brown, Stephen A. Della Pietra, Vincent J. for Computational Linguistics. Della Pietra, and Robert L. Mercer. 1993. The math- ematics of statistical machine translation: Parameter David Chiang. 2005. A hierarchical phrase-based estimation. Computational Linguistics, 19(2):263– model for statistical machine translation. In Pro- 311. ceedings of the 43rd Annual Meeting of the As- sociation for Computational Linguistics (ACL’05), Christian Buck. 2012. Black box features for the wmt pages 263–270, Ann Arbor, Michigan. Association 2012 quality estimation shared task. In Proceedings for Computational Linguistics. of WMT 2012. KyungHyun Cho, Bart van Merrienboer, Dzmitry Bah- Chris Callison-Burch, Cameron Fordyce, Philipp danau, and Yoshua Bengio. 2014. On the properties Koehn, Christof Monz, and Josh Schroeder. 2007a. of neural machine translation: Encoder-decoder ap- (meta-) evaluation of machine translation. In Pro- proaches. CoRR, abs/1409.1259. ceedings of WMT 2007. Kenneth Church and Eduard Hovy. 1991. Good ap- Chris Callison-Burch, Cameron Fordyce, Philipp plications for crummy machine translation. In Pro- Koehn, Christof Monz, and Josh Schroeder. 2007b. ceedings of the Natural Language Processing Sys- (meta-) evaluation of machine translation. In Pro- tems Evaluation Workshop. ceedings of the Second Workshop on Statistical Ma- chine Translation, pages 64–71. Association for Jasob Cohen. 1960. A coefficient of agreement for Computational Linguistics. nominal scales. Educational and Psychological Measurement, 20(1):3746. Chris Callison-Burch, Cameron Fordyce, Philipp Koehn, Christof Monz, and Josh Schroeder. 2008. Elisabet Comelles, Jordi Atserias, Victoria Arranz, and Further meta-evaluation of machine translation. In Irene Castello´n. 2012. Verta: Linguistic features in Proceedings of WMT 2008 . mt evaluation. In LREC, pages 3944–3950. Chris Callison-Burch, Philipp Koehn, Christof Monz, Kay Peterson, Mark Przybocki, and Omar F. Zari- Ido Dagan and Oren Glickman. 2004. Probabilistic dan. 2010. Findings of the 2010 joint workshop on textual entailment: Generic applied modeling of lan- Learning Methods for Text Un- statistical machine translation and metrics for ma- guage variability. In derstanding and Mining workshop chine translation. In Proceedings of the WMT 2010. .

Chris Callison-Burch, Philipp Koehn, Christof Monz, Ido Dagan, Oren Glickman, and Bernardo Magnini. Matt Post, Radu Soricut, and Lucia Specia. 2012. 2006. The pascal recognising textual entailment Findings of the 2012 workshop on statistical ma- challenge. Machine Learning Challenges:LNCS, chine translation. In Proceedings of WMT 2012. 3944:177–190.

Chris Callison-Burch, Philipp Koehn, Christof Monz, Daniel Dahlmeier, Chang Liu, and Hwee Tou Ng. and Josh Schroeder. 2009. Findings of the 2009 2011. Tesla at wmt2011: Translation evaluation and workshop on statistical machine translation. In Pro- tunable metric. In Proceedings of WMT 2011. ceedings of the 4th WMT 2009. George Doddington. 2002. Automatic evaluation Chris Callison-Burch, Philipp Koehn, Christof Monz, of machine translation quality using n-gram co- and Omar F. Zaridan. 2011. Findings of the 2011 occurrence statistics. In HLT Proceedings. workshop on statistical machine translation. In Pro- ceedings of WMT 2011. Bonnie Dorr, Matt Snover, and etc. Nitin Madnani. 2009. Part 5: Machine translation evaluation. In Chris Callison-Burch, Philipp Koehn, and Miles Os- Bonnie Dorr edited DARPA GALE program report. borne. 2006a. Improved statistical machine trans- lation using paraphrases. In Proceedings of HLT- Jennifer B. Doyon, John S. White, and Kathryn B. Tay- NAACL 2006. lor. 1999. Task-based evaluation for machine trans- lation. In Proceedings of MT Summit 7. Chris Callison-Burch, Miles Osborne, and Philipp Koehn. 2006b. Re-evaluating the role of bleu in ma- H. Echizen-ya and K. Araki. 2010. Automatic eval- chine translation research. In Proceedings of EACL uation method for machine translation using noun- 2006, volume 2006, pages 249–256. phrase chunking. In Proceedings of the ACL 2010. Matthias Eck and Chiori Hori. 2005. Overview of the Jesu´s Gime´nez and Llu´is Ma´rquez. 2007. Linguistic iwslt 2005 evaluation campaign. In In proceeding of features for automatic evaluation of heterogenous mt International Workshop on Spoken Language Trans- systems. In Proceedings of WMT 2007. lation (IWSLT). Yvette Graham, Timothy Baldwin, and Nitika Mathur. Project EuroMatrix. 2007. 1.3: Survey of machine 2015. Accurate evaluation of segment-level ma- translation evaluation. In EuroMatrix Project Re- chine translation metrics. In NAACL HLT 2015, port, Statistical and Hybrid MT between All Euro- The 2015 Conference of the North American Chap- pean Languages, co-ordinator: Prof. Hans Uszkor- ter of the Association for Computational Linguistics: eit. Human Language Technologies, Denver, Colorado, USA, May 31 - June 5, 2015, pages 1183–1191. Marcello Federico, Luisa Bentivogli, Michael Paul, and Sebastian Stu¨ker. 2011. Overview of the iwslt Yvette Graham, Timothy Baldwin, Alistair Moffat, and 2011 evaluation campaign. In In proceeding of In- Justin Zobel. 2013. Continuous measurement scales ternational Workshop on Spoken Language Transla- in human evaluation of machine translation. In Pro- tion (IWSLT). ceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, pages 33–41, Marina Fomicheva, Shuo Sun, Lisa Yankovskaya, Sofia, Bulgaria. Association for Computational Lin- Frédéric Blain, Francisco Guzmán, Mark Fishel, guistics. Nikolaos Aletras, Vishrav Chaudhary, and Lucia Specia. 2020. Unsupervised quality estimation for Yvette Graham, Timothy Baldwin, Alistair Moffat, and neural machine translation. Transactions of the As- Justin Zobel. 2016. Can machine translation sys- sociation for Computational Linguistics, 8:539–555. tems be evaluated by the crowd alone. Natural Lan- Erick Fonseca, Lisa Yankovskaya, André F. T. Martins, guage Engineering, FirstView:1–28. Mark Fishel, and Christian Federmann. 2019. Find- ings of the WMT 2019 shared tasks on quality es- Yvette Graham, Barry Haddow, and Philipp Koehn. timation. In Proceedings of the Fourth Conference 2020. Statistical power and translationese in ma- on Machine Translation (Volume 3: Shared Task Pa- chine translation evaluation. In Proceedings of the pers, Day 2), pages 1–10, Florence, Italy. Associa- 2020 Conference on Empirical Methods in Natural tion for Computational Linguistics. Language Processing (EMNLP), pages 72–81, On- line. Association for Computational Linguistics. Michael Gamon, Anthony Aue, and Martine Smets. 2005. Sentence-level mt evaluation without refer- Yvette Graham and Qun Liu. 2016. Achieving ac- ence translations beyond language modelling. In curate conclusions in evaluation of automatic ma- Proceedings of EAMT, pages 103–112. chine translation metrics. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of Sebastian Gehrmann, Tosin Adewumi, Karmanya Ag- the Association for Computational Linguistics: Hu- garwal, Pawan Sasanka Ammanamanchi, Aremu man Language Technologies, San Diego California, Anuoluwapo, Antoine Bosselut, Khyathi Raghavi USA, June 12-17, 2016, pages 1–10. Chandu, Miruna Clinciu, Dipanjan Das, Kaus- tubh D. Dhole, Wanyu Du, Esin Durmus, Ondrejˇ Jiafeng Guo, Gu Xu, Xueqi Cheng, and Hang Li. 2009. Dušek, Chris Emezue, Varun Gangal, Cristina Gar- Named entity recognition in query. In Proceeding of bacea, Tatsunori Hashimoto, Yufang Hou, Yacine SIGIR. Jernite, Harsh Jhamtani, Yangfeng Ji, Shailza Jolly, Mihir Kale, Dhruv Kumar, Faisal Ladhak, Aman Rohit Gupta, Constantin Orasan, and Josef van Gen- Madaan, Mounica Maddela, Khyati Mahajan, Saad abith. 2015a. Machine translation evaluation using Mahamood, Bodhisattwa Prasad Majumder, Pe- recurrent neural networks. In Proceedings of the dro Henrique Martins, Angelina McMillan-Major, Tenth Workshop on Statistical Machine Translation, Simon Mille, Emiel van Miltenburg, Moin Nadeem, pages 380–384, Lisbon, Portugal. Association for Shashi Narayan, Vitaly Nikolaev, Rubungo An- Computational Linguistics. dre Niyongabo, Salomey Osei, Ankur Parikh, Laura Perez-Beltrachini, Niranjan Ramesh Rao, Rohit Gupta, Constantin Orasan, and Josef van Gen- Vikas Raunak, Juan Diego Rodriguez, Sashank abith. 2015b. Reval: A simple and effective ma- Santhanam, João Sedoc, Thibault Sellam, Samira chine translation evaluation metric based on recur- Shaikh, Anastasia Shimorina, Marco Antonio So- rent neural networks. In Proceedings of the 2015 brevilla Cabezudo, Hendrik Strobelt, Nishant Sub- Conference on Emperical Methods in Natural Lan- ramani, Wei Xu, Diyi Yang, Akhila Yerukola, and guage Processing, pages 1066–1072. Association Jiawei Zhou. 2021. The GEM Benchmark: Nat- for Computational Linguistics, o.A. ural Language Generation, its Evaluation and Met- rics. arXiv e-prints, page arXiv:2102.01672. Francisco Guzmán, Shafiq Joty, Lluís Màrquez, and Preslav Nakov. 2015. Pairwise neural machine Jesu´s Gime´ne and Llu´is Ma´rquez. 2008. A smorgas- translation evaluation. In Proceedings of the 53rd bord of features for automatic mt evaluation. In Pro- Annual Meeting of the Association for Computa- ceedings of WMT 2008, pages 195–198. tional Linguistics and The 7th International Joint Conference of the Asian Federation of Natural Lan- Lifeng Han, Derek Fai Wong, Lidia Sam Chao, Liang- guage Processing (ACL’15), pages 805–814, Bei- eye He, and Yi Lu. 2014. Unsupervised quality jing, China. Association for Computational Linguis- estimation model for english to german translation tics. and its application in extensive supervised evalua- tion. In The Scientific World Journal. Issue: Recent Francisco Guzmn, Shafiq Joty, Llus Mrquez, and Advances in Information Technology, pages 1–12. Preslav Nakov. 2017. Machine translation evalua- tion with neural networks. Comput. Speech Lang., Lifeng Han, Derek Fai Wong, Lidia Sam Chao, Yi Lu, 45(C):180–200. Liangye He, Yiming Wang, and Jiaji Zhou. 2013d. A description of tunable machine translation evalua- Anders Hald. 1998. A History of Mathematical Statis- tion systems in wmt13 metrics task. In Proceedings tics from 1750 to 1930. ISBN-10: 0471179124. of WMT 2013, pages 414–421. Wiley-Interscience; 1 edition. Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Aaron L-F Han, Derek F Wong, and Lidia S Chao. Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Tho- 2013a. Chinese named entity recognition with con- rat, Fernanda B. Viégas, Martin Wattenberg, Greg ditional random fields in the light of chinese char- Corrado, Macduff Hughes, and Jeffrey Dean. 2016. acteristics. In Language Processing and Intelligent Google’s multilingual neural machine translation Information Systems, pages 57–68. Springer. system: Enabling zero-shot translation. CoRR, abs/1611.04558. Lifeng Han. 2014. LEPOR: An Augmented Ma- chine Translation Evaluation Metric. University of Maurice G. Kendall. 1938. A new measure of rank cor- Macau, Macao. relation. Biometrika, 30:81–93. Lifeng Han. 2016. Machine Translation Evaluation Maurice G. Kendall and Jean Dickinson Gibbons. Resources and Methods: A Survey. arXiv e-prints, 1990. Rank Correlation Methods. Oxford Univer- page arXiv:1605.04515. sity Press, New York. Lifeng Han, Gareth Jones, and Alan Smeaton. 2020a. Marrgaret King, Andrei Popescu-Belis, and Eduard AlphaMWE: Construction of multilingual parallel Hovy. 2003. Femti: Creating and using a framework corpora with MWE annotations. In Proceedings of for mt evaluation. In Proceedings of the Machine the Joint Workshop on Multiword Expressions and Translation Summit IX. Electronic Lexicons, pages 44–57, online. Associa- tion for Computational Linguistics. Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of Lifeng Han, Gareth Jones, and Alan Smeaton. 2020b. EMNLP. MultiMWE: Building a multi-lingual multi-word Philipp Koehn. 2010. Statistical Machine Translation. expression (MWE) parallel corpora. In Proceed- ings of the 12th Language Resources and Evaluation Cambridge University Press. Conference, pages 2970–2979, Marseille, France. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris European Language Resources Association. Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Lifeng Han, Gareth J. F. Jones, Alan F. Smeaton, and Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Paolo Bolzoni. 2021. Chinese Character Decompo- Constantin, and Evan Herbst. 2007. Moses: Open sition for Neural MT with Multi-Word Expressions. source toolkit for statistical machine translation. In arXiv e-prints , page arXiv:2104.04497. Proceedings of Conference on Association of Com- Lifeng Han, Derek F. Wong, Lidia S. Chao, Liangye putational Linguistics. He, Yi Lu, Junwen Xing, and Xiaodong Zeng. Philipp Koehn and Christof Monz. 2006a. Manual and 2013b. Language-independent model for machine automatic evaluation of machine translation between translation evaluation with reinforced factors. In european languages. In Proceedings on the Work- Machine Translation Summit XIV, pages 215–222. shop on Statistical Machine Translation, pages 102– International Association for Machine Translation. 121, New York City. Association for Computational Lifeng Han, Derek Fai Wong, and Lidia Sam Chao. Linguistics. 2012. A robust evaluation metric for machine trans- Philipp Koehn and Christof Monz. 2006b. Manual and lation with augmented factors. In Proceedings of automatic evaluation of machine translation between COLING. european languages. In Proceedings of WMT 2006. Lifeng Han, Derek Fai Wong, Lidia Sam Chao, Liang- Guillaume Lample and Alexis Conneau. 2019. Cross- eye He, Shuo Li, and Ling Zhu. 2013c. Phrase tagset lingual language model pretraining. CoRR, mapping for french and english and its ap- abs/1901.07291. plication in machine translation evaluation. In In- ternational Conference of the German Society for J. Richard Landis and Gary G. Koch. 1977. The mea- Computational Linguistics and Language Technol- surement of observer agreement for categorical data. ogy, LNAI Vol. 8105, pages 119–131. Biometrics, 33(1):159–174. Jamal Laoudi, Ra R. Tate, and Clare R. Voss. 2006. Chi Kiu Lo and Dekai Wu. 2011b. Structured vs. flat Task-based mt evaluation: From who/when/where semantic role representations for machine transla- extraction to event understanding. In in Proceedings tion evaluation. In Proceedings of the 5th Work- of LREC-06, pages 2048–2053. shop on Syntax and Structure in StatisticalTransla- tion (SSST-5). Mirella Lapata. 2003. Probabilistic text structuring: Experiments with sentence ordering. In Proceed- Samuel Läubli, Sheila Castilho, Graham Neubig, ings of ACL 2003. Rico Sennrich, Qinlan Shen, and Antonio Toral. 2020. A set of recommendations for assessing hu- Samuel Läubli, Rico Sennrich, and Martin Volk. 2018. man–machine parity in language translation. Jour- Has machine translation achieved human parity? a nal of Artificial Intelligence Research, 67. case for document-level evaluation. In Proceed- ings of the 2018 Conference on Empirical Methods Qingsong Ma, Fandong Meng, Daqi Zheng, Mingxuan in Natural Language Processing, pages 4791–4796, Wang, Yvette Graham, Wenbin Jiang, and Qun Liu. Brussels, Belgium. Association for Computational 2016. Maxsd: A neural machine translation evalua- Linguistics. tion metric optimized by maximizing similarity dis- tance. In Natural Language Understanding and In- Alon Lavie. 2013. Automated metrics for mt evalua- telligent Applications - 5th CCF Conference on Nat- tion. Machine Translation, 11:731. ural Language Processing and Chinese Computing, NLPCC 2016, and 24th International Conference Guy Lebanon and John Lafferty. 2002. Combining on Computer Processing of Oriental Languages, IC- rankings using conditional probability models on CPOL 2016, Kunming, China, December 2-6, 2016, permutations. In Proceeding of the ICML. Proceedings, pages 153–161.

Gregor Leusch and Hermann Ney. 2009. Edit distances Qingsong Ma, Johnny Wei, Ondrejˇ Bojar, and Yvette with block movements and error rate confidence es- Graham. 2019. Results of the WMT19 metrics timates. Machine Translation, 23(2-3). shared task: Segment-level and strong MT sys- tems pose big challenges. In Proceedings of the A. LI. 2005. Results of the 2005 nist machine transla- Fourth Conference on Machine Translation (Volume tion evaluation. In Proceedings of WMT 2005. 2: Shared Task Papers, Day 1), pages 62–90, Flo- rence, Italy. Association for Computational Linguis- Liang You Li, Zheng Xian Gong, and Guo Dong Zhou. tics. 2012. Phrase-based evaluation for machine transla- tion. In Proceedings of COLING, pages 663–672. I. Mani. 2001. Summarization evaluation: An overview. In NTCIR. Pietro Liguori, Erfan Al-Hossami, Domenico Cotro- neo, Roberto Natella, Bojan Cukic, and Samira Elaine Marsh and Dennis Perzanowski. 1998. Muc-7 Shaikh. 2021. Shellcode_IA32: A Dataset for Au- evaluation of ie technology: Overview of results. In Proceedingsof Message Understanding Conference tomatic Shellcode Generation. arXiv e-prints, page (MUC-7) arXiv:2104.13100. . Kathleen R. McKeown. 1979. Paraphrasing using Chin-Yew Lin and E. H. Hovy. 2003. Automatic eval- given and new information in a question-answer sys- uation of summaries using n-gram co-occurrence tem. In Proceedings of ACL 1979. statistics. In Proceedings of NAACL 2003. Marie Meteer and Varda Shaked. 1988. Microsoft re- Chin-Yew Lin and Franz Josef Och. 2004. Auto- search treelet translation system: Naacl 2006 eu- matic evaluation of machine translation quality us- roparl evaluation. In Proceedings of COLING. ing longest common subsequence and skip-bigram statistics. In Proceedings of ACL 2004. G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. J. Miller. 1990. Wordnet: an on-line lex- Ding Liu and Daniel Gildea. 2005. Syntactic features ical database. International Journal of Lexicogra- for evaluation of machine translation. In Proceed- phy, 3(4):235–244. ingsof theACL Workshop on Intrinsic and Extrin- sic Evaluation Measures for Machine Translation Douglas C. Montgomery and George C. Runger. 2003. and/or Summarization. Applied statistics and probability for engineers, third edition. John Wiley and Sons, New York. Chi Kiu Lo, Anand Karthik Turmuluru, and Dekai Wu. 2012. Fully automatic semantic mt evaluation. In Erwan Moreau and Carl Vogel. 2013. Weakly super- Proceedings of WMT 2012. vised approaches for quality estimation. Machine Translation, 27(3–4):257–280. Chi Kiu Lo and Dekai Wu. 2011a. Meant: An inexpen- sive, high- accuracy, semi-automatic metric for eval- Erwan Moreau and Carl Vogel. 2014. Limitations of uating translation utility based on semantic roles. In MT quality estimation supervised systems: The tails Proceedings of ACL 2011. prediction problem. In Proceedings of COLING 2014, the 25th International Conference on Compu- Sebastian Ruder, Noah Constant, Jan Botha, Aditya tational Linguistics: Technical Papers, pages 2205– Siddhant, Orhan Firat, Jinlan Fu, Pengfei Liu, Jun- 2216, Dublin, Ireland. Dublin City University and jie Hu, Graham Neubig, and Melvin Johnson. 2021. Association for Computational Linguistics. XTREME-R: Towards More Challenging and Nu- anced Multilingual Evaluation. arXiv e-prints, page Franz Josef Och and Hermann Ney. 2003. A systematic arXiv:2104.07412. comparison of various statistical alignment models. Computational Linguistics, 29(1):19–51. Ivan A. Sag, Timothy Baldwin, Francis Bond, Ann Copestake, and Dan Flickinger. 2002. Multiword Kishore Papineni, Salim Roukos, Todd Ward, and expressions: A pain in the neck for nlp. In Com- Wei Jing Zhu. 2002. Bleu: a method for automatic putational Linguistics and Intelligent Text Process- evaluation of machine translation. In Proceedings of ing, pages 1–15, Berlin, Heidelberg. Springer Berlin ACL 2002. Heidelberg.

M. Paul. 2009. Overview of the iwslt 2009 evaluation Bahar Salehi, Nitika Mathur, Paul Cook, and Timothy campaign. In Proceeding of IWSLT. Baldwin. 2015. The impact of multiword expression compositionality on machine translation evaluation. Michael Paul, Marcello Federico, and Sebastian In Proceedings of the 11th Workshop on Multiword Stu¨ker. 2010. Overview of the iwslt 2010 evalua- Expressions, pages 54–59, Denver, Colorado. Asso- tion campaign. In Proceeding of IWSLT. ciation for Computational Linguistics.

Karl Pearson. 1900. On the criterion that a given sys- Mattthew Snover, Bonnie J. Dorr, Richard Schwartz, tem of deviations from the probable in the case of Linnea Micciulla, and John Makhoul. 2006. A study a correlated system of variables is such that it can of translation edit rate with targeted human annota- be reasonably supposed to have arisen from random tion. In Proceeding of AMTA. sampling. Philosophical Magazine, 50(5):157–175. L. Specia and J. Gime´nez. 2010. Combining con- Maja Popovic´, David Vilar, Eleftherios Avramidis, and fidence estimation and reference-based metrics for Aljoscha Burchardt. 2011. Evaluation without ref- segment-level mt evaluation. In The Ninth Confer- erences: Ibm1 scores as evaluation metrics. In Pro- ence of the Association for Machine Translation in ceedings of WMT 2011. the Americas (AMTA). M. Popovic and Hermann Ney. 2007. Word error rates: Decomposition over pos classes and applications for Lucia Specia, Frédéric Blain, Marina Fomicheva, Erick error analysis. In Proceedings of WMT 2007. Fonseca, Vishrav Chaudhary, Francisco Guzmán, and André F. T. Martins. 2020. Findings of the Maja Popovic.´ 2020a. Informative manual evalua- WMT 2020 shared task on quality estimation. In tion of machine translation output. In Proceedings Proceedings of the Fifth Conference on Machine of the 28th International Conference on Compu- Translation, pages 743–764, Online. Association for tational Linguistics, pages 5059–5069, Barcelona, Computational Linguistics. Spain (Online). International Committee on Compu- tational Linguistics. Lucia Specia, Frédéric Blain, Varvara Logacheva, Ramón F. Astudillo, and André F. T. Martins. 2018. Maja Popovic.´ 2020b. Relations between compre- Findings of the WMT 2018 shared task on quality hensibility and adequacy errors in machine trans- estimation. In Proceedings of the Third Conference lation output. In Proceedings of the 24th Confer- on Machine Translation: Shared Task Papers, pages ence on Computational Natural Language Learning, 689–709, Belgium, Brussels. Association for Com- pages 256–264, Online. Association for Computa- putational Linguistics. tional Linguistics. Lucia Specia, Naheh Hajlaoui, Catalina Hallett, and Claus Povlsen, Nancy Underwood, Bradley Music, and Wilker Aziz. 2011. Predicting machine translation Anne Neville. 1998. Evaluating text-type suitability adequacy. In Machine Translation Summit XIII. for machine translation a case study on an english- danish system. In Proceeding LREC. Lucia Specia, Dhwaj Raj, and Marco Turchi. 2010. Machine translation evaluation versus quality esti- Sylvain Raybaud, David Langlois, and Kamel Sma¨ili. mation. Machine translation. 2011. ”this sentence is wrong.” detecting errors in machine-translated sentences. Machine Translation, Lucia Specia, Kashif Shah, Jose G.C. de Souza, and 25(1):1–34. Trevor Cohn. 2013. QuEst - a translation quality es- timation framework. In Proceedings of the 51st An- Florence Reeder. 2004. Investigation of intelligibil- nual Meeting of the Association for Computational ity judgments. In Machine Translation: From Real Linguistics: System Demonstrations, pages 79–84, Users to Research, pages 227–235, Berlin, Heidel- Sofia, Bulgaria. Association for Computational Lin- berg. Springer Berlin Heidelberg. guistics. Keh-Yih Su, Wu Ming-Wen, and Chang Jing-Shin. explore this problem, Koehn (2004) presents an in- 1992. A new quantitative quality measure for ma- vestigation statistical significance testing for MT chine translation systems. In Proceeding of COL- evaluation. The bootstrap re-sampling method is ING. used to compute the statistical significance inter- Christoph Tillmann, Stephan Vogel, Hermann Ney, vals for evaluation metrics on small test sets. Sta- Arkaitz Zubiaga, and Hassan Sawaf. 1997. Accel- tistical significance usually refers to two separate erated dp based search for statistical translation. In notions, one of which is the p-value, the probabil- Proceeding of EUROSPEECH. ity that the observed data will occur by chance in a Joseph P Turian, Luke Shea, and I Dan Melamed. 2006. given single null hypothesis. The other one is the Evaluation of machine translation and its evaluation. “Type I" error rate of a statistical hypothesis test, Technical report, DTIC Document. which is also called “false positive" and measured Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob by the probability of incorrectly rejecting a given Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz null hypothesis in favour of a second alternative Kaiser, and Illia Polosukhin. 2017. Attention is all hypothesis (Hald, 1998). you need. In Conference on Neural Information Processing System, pages 6000–6010. A.2: Evaluating Human Judgment Clare R. Voss and Ra R. Tate. 2006. Task-based eval- Since human judgments are usually trusted as the uation of machine translation (mt) engines: Measur- gold standards that automatic MT evaluation met- ing how well people extract who, when, where-type elements in mt output. In In Proceedings of 11th rics should try to approach, the reliability and co- Annual Conference of the European Association for herence of human judgments is very important. Machine Translation (EAMT-2006, pages 203–212. Cohen’s kappa agreement coefficient is one of the most commonly used evaluation methods (Cohen, Warren Weaver. 1955. Translation. Machine Transla- tion of Languages: Fourteen Essays. 1960). For the problem of nominal scale agree- ment between two judges, there are two relevant John S. White, Theresa O’ Connell, and Francis O’ quantities p0 and pc. p0 is the proportion of units Mara. 1994. The arpa mt evaluation methodologies: Evolution, lessons, and future approaches. In Pro- in which the judges agreed and pc is the propor- ceeding of AMTA. tion of units for which agreement is expected by chance. The coefficient k is simply the proportion John S. White and Kathryn B. Taylor. 1998. A task- of chance-expected disagreements which do not oriented evaluation metric for machine translation. In Proceeding LREC. occur, or alternatively, it is the proportion of agree- ment after chance agreement is removed from con- Billy Wong and Chun yu Kit. 2009. Atec: automatic sideration: evaluation of machine translation via word choice and word order. Machine Translation, 23(2-3):141– p0 − pc 155. k = (2) 1 − pc Hui Yu, Xiaofeng Wu, Jun Xie, Wenbin Jiang, Qun where p0 − pc represents the proportion of the Liu, and Shouxun Lin. 2014. RED: A reference de- cases in which beyond-chance agreement occurs pendency based MT evaluation metric. In COLING 2014, 25th International Conference on Computa- and is the numerator of the coefficient (Landis and tional Linguistics, Proceedings of the Conference: Koch, 1977). Technical Papers, August 23-29, 2014, Dublin, Ire- land, pages 2042–2051. A.3: Correlating Manual and Automatic Score Jiajun Zhang and Chengqing Zong. 2015. Deep neu- In this section, we introduce three correlation co- ral networks in machine translation: An overview. efficient algorithms that have been widely used at IEEE Intelligent Systems, (5):16–25. recent WMT workshops to measure the closeness of automatic evaluation and manual judgments. Appendices The choice of correlation algorithm depends on Appendix A: Evaluating TQA whether scores or ranks schemes are utilized. A.1: Statistical Significance Pearson Correlation If different MT systems produce translations with Pearson’s correlation coefficient (Pearson, 1900) different qualities on a dataset, how can we ensure is commonly represented by the Greek letter ρ. that they indeed own different system quality? To The correlation between random variables X and Y denoted as ρXY is measured as follows (Mont- reference order. More concretely, Lapata (2003) gomery and Runger, 2003). proposed the use of Kendall’s τ, a measure of rank correlation, to estimate the distance between cov(X,Y ) σXY a system-generated and a human-generated gold- ρXY = p = (3) V (X)V (Y ) σX σY standard order.

Because the standard deviations of variable X A.4: Metrics Comparison and Y are higher than 0 (σX > 0 and σY > 0), if There are researchers who did some work about the covariance σXY between X and Y is positive, negative or zero, the correlation score between X the comparisons of different types of metrics. For and Y will correspondingly result in positive, neg- example, Callison-Burch et al. (2006b, 2007b); ative or zero, respectively. Based on a sample of Lavie (2013) mentioned that, through some quali- tative analysis on some standard data set, BLEU paired data (X,Y ) as (xi, yi), i = 1 to n , the Pearson correlation coefficient is calculated as: cannot reflect MT system performance well in many situations, i.e. higher BLEU score cannot ensure better translation outputs. There are some Pn i=1(xi − µx)(yi − µy) recently developed metrics that can perform much ρXY = pPn 2pPn 2 i=1(xi − µx) i=1(yi − µy) better than the traditional ones especially on chal- (4) lenging sentence-level evaluation, though they are where µx and µy specify the means of discrete ran- not popular yet such as nLEPOR and SentBLEU- dom variable X and Y respectively. Moses (Graham et al., 2015; Graham and Liu, 2016). Such comparison will help MT researchers Spearman rank Correlation to select th appropriate metrics to use for specialist Spearman rank correlation coefficient, a simpli- tasks. fied version of Pearson correlation coefficient, is another algorithm to measure the correlations of Appendix B: MT QE automatic evaluation and manual judges, e.g. in In past years, some MT evaluation methods that WMT metrics task (Callison-Burch et al., 2008, do not use manually created gold reference trans- 2009, 2010, 2011). When there are no ties, Spear- lations were proposed. These are referred to as man rank correlation coefficient, which is some- “Quality Estimation (QE)". Some of the related times specified as (rs) is calculated as: works have already been introduced in previous sections. The most recent quality estimation tasks 6 Pn d2 i=1 i can be found at WMT12 to WMT20 (Callison- rsϕ(XY ) = 1 − 2 (5) n(n − 1) Burch et al., 2012; Bojar et al., 2013, 2014, 2015; where di is the difference-value (D-value) between Specia et al., 2018; Fonseca et al., 2019; Specia the two corresponding rank variables (xi − yi) in et al., 2020). These defined a novel evaluation X~ = {x1, x2, ..., xn} and Y~ = {y1, y2, ..., yn} metric that provides some advantages over the tra- describing the system ϕ. ditional ranking metrics. The DeltaAvg metric as- sumes that the reference test set has a number as- Kendall’s τ sociated with each entry that represents its extrin- Kendall’s τ (Kendall, 1938) has been used in re- sic value. Given these values, their metric does not cent years for the correlation between automatic need an explicit reference ranking, the way that order and reference order (Callison-Burch et al., Spearman ranking correlation does. The goal of 2010, 2011, 2012). It is defined as: the DeltaAvg metric is to measure how valuable a proposed ranking is according to the extrinsic val- num concordant pairs − num discordant pairs ues associated with the test entries. τ = total pairs (6) n−1 P The latest version of Kendall’s τ is intro- V (S1,k) k=1 duced in (Kendall and Gibbons, 1990). Lebanon DeltaAvgv[n] = − V (S) (7) and Lafferty (2002) give an overview work for n − 1 Kendall’s τ showing its application in calculat- For scoring, two task evaluation metrics were ing how much the system orders differ from the used that have traditionally been used for measur- ing performance in regression tasks: Mean Abso- lute Error (MAE) as a primary metric, and Root substitution+insertion+deletion of Mean Squared Error (RMSE) as a secondary WER = . (13) referencelength metric. For a given test set S with entries si, 1 6 i 6 |S|, H(si) is the proposed score for entry si correct − max(0, outputlength − referencelength) (hypothesis), and V (si) is the reference value for PER = 1 − . referencelength entry si (gold-standard value). (14)

PN i−1|H(si) − V (si)| MAE = (8) #of edit N TER = (15) s #of average reference words PN 2 (H(si) − V (si)) RMSE = i−1 (9) N Section 3.1.2 - Precision and Recall: where N = |S|. Both these metrics are N X non-parametric, automatic and deterministic (and BLEU = BP × exp λn log Precisionn, (16) therefore consistent), and extrinsically inter- n=1 ( pretable. 1 if c > r, Some further readings on MT QE are the com- BP = 1− r (17) e c if c <= r. parison between MT evaluation and QE Specia et al. (2010) and the QE framework model QuEst where c is the total length of candidate translation, (Specia et al., 2013); the weakly supervised ap- and r refers to the sum of effective reference sen- proaches for quality estimation and the limitations tence length in the corpus. Bellow is from NIST analysis of QE Supervised Systems (Moreau and metric, then F-measure, METEOR and LEPOR: Vogel, 2013, 2014), and unsupervised QE models (Fomicheva et al., 2020); the recent shared tasks #occurrence of w1, ··· , wn−1 on QE (Fonseca et al., 2019; Specia et al., 2020). Info = log ( ) (18) 2 #occurrence of w , ··· , w In very recent years, the two shared tasks, i.e. 1 n MT quality estimation and traditional MT evalua- tion metrics, have tried to integrate into each other 2 PR Fβ = (1 + β ) (19) and benefit from both knowledge. For instance, in R + β2P WMT2019 shared task, there were 10 reference- less evaluation metrics which were used for the #chunks 3 QE task, “QE as a Metric", as well (Ma et al., Penalty = LP 0.5 × ( ) #matched unigrams 2019). (20) 10PR Appendix C: Mathematical Formulas MEREOR = × (1 − Penalty) (21) R + 9P Some mathematical formulas that are related to aforementioned metrics: Section 2.1.2 - Fluency / Adequacy / Compre- LEPOR = LP × NP osP enal × Harmonic(αR, βP ) hension: (22)

#Cottect Comprehension = (10) hLEPOR = Harmonic(wLP LP , 6 Judgment point−1 wNP osP enalNP osP enal, wHPRHPR) Fluency = S−1 (11) #Sentences in passage Judgment point−1 nLEPOR = LP × NP osP enal S−1 Adequacy = (12) N #Fragments in passage X ×exp( wnlogHP R) Section 3.1.1 - Editing Distance: n=1 where, in our own metric LEPOR and its vari- ations, nLEPOR (n-gram precision and recall LEPOR) and hLEPOR (harmonic LEPOR), P and R are for precision and recall, LP for length penalty, NPosPenal for n-gram position difference penalty, and HPR for harmonic mean of precision and recall, respectively (Han et al., 2012, 2013b; Han, 2014; Han et al., 2014).