Arxiv:2108.09484V1 [Cs.CL] 21 Aug 2021

Total Page:16

File Type:pdf, Size:1020Kb

Arxiv:2108.09484V1 [Cs.CL] 21 Aug 2021 cushLEPOR: Customised hLEPOR Metric Using LABSE Distilled Knowledge Model to Improve Agreement with Human Judgements Lifeng Han1, Irina Sorokina2, Gleb Erofeev2, and Serge Gladkoff2 1 ADAPT Research Centre, DCU, Ireland 2 Logrus Global, Translation & Localization gleberof, irina.sorokina, [email protected] Abstract aspects of timely and high quality evaluations, as well as reflecting the translation errors that MT Human evaluation has always been expen- systems can take advantages of for further improve- sive while researchers struggle to trust the automatic metrics. To address this, we pro- ment (Han et al., 2021b). On the one hand, human pose to customise traditional metrics by tak- evaluations have long been criticised as expensive ing advantages of the pre-trained language and unrepeatable. Furthermore, the inter and intra- models (PLMs) and the limited available hu- agreement levels from Human raters always strug- man labelled scores. We first re-introduce gle to achieve a relatively high and reliable score. the hLEPOR metric factors, followed by the On the other hand, automatic evaluation metrics Python portable version we developed which have reached high performances in the category achieved the automatic tuning of the weight- ing parameters in hLEPOR metric. Then of system level evaluations of MT systems (Fre- we present the customised hLEPOR (cush- itag et al., 2021). However, the segment level per- LEPOR) which uses LABSE distilled knowl- formance is still a large gap from human experts’ edge model to improve the metric agreement expectation. with human judgements by automatically op- In the meantime, many pre-trained language timised factor weights regarding the exact models have been proposed and developed in very MT language pairs that cushLEPOR is de- ployed to. We also optimise cushLEPOR to- recent years and showing big advantages in differ- wards human evaluation data based on MQM ent NLP tasks, for instance, BERT (Devlin et al., and pSQM framework on English-German 2019) and its further developed variants (Feng et al., and Chinese-English language pairs. The ex- 2020). In this work, we take the advantages of both perimental investigations show cushLEPOR high performing automatic metric and pre-trained boosts hLEPOR performances towards better language model, aiming at one step further towards agreements to PLMs like LABSE with much higher quality performing automatic MT evalua- lower cost, and better agreements to human tion metric from both system level and segment evaluations including MQM and pSQM scores, and yields much better performances than level perspectives. BLEU (data available at https://github. Among the evaluation metrics developed recent com/poethan/cushLEPOR). years, hLEPOR (Han et al., 2013b,a) is one of the comprehensive category that include many evalua- 1 Introduction tion factors including precision, recall, word order arXiv:2108.09484v2 [cs.CL] 30 Aug 2021 Machine Translation (MT) is a rapidly developing (via position difference factor), and sentence length. research field that plays an important role in NLP It has also been applied by many researchers from area. MT started from 1950s as one of the earli- different NLP field including natural language gen- est artificial intelligence (AI) research topics and eration (NLG) (Novikova et al., 2017; Gehrmann gained a large improvement in the output quality in et al., 2021), natural language understanding (NLU) large resourced language pairs after the introduc- (Ruder et al., 2021), automatic text summarization tion of Neural MT (NMT) in recent years (Kalch- (ATS) (Bhandari et al., 2020), and searching (Liu brenner and Blunsom, 2013; Cho et al., 2014; Bah- et al., 2021), in addition to MT evaluation (Mar- danau et al., 2014). However, the challenge still zouk, 2021). remains in achieving human parity MT output (Han However, there are some disadvantages from et al., 2021a). Thus MTE continues to play an im- hLEPOR including the manual tuning of its param- portant role in aiding MT development from the eter weights which costs lots of human efforts. We choose hLEPOR (Han et al., 2013b,a) as our base- reference translations, in the same language set- line model, and use the very recent language model ting, based on the word surface level tokens. The LABSE (Feng et al., 2020) to achieve automatic hybrid hLEPOR metric also carries out similarity tuning of its parameters thus aiming at reducing calculation based on POS sequences from system- the evaluation cost and further boosting the per- output and reference text. To do this, POS tagging formance. This system description paper is based is needed as the first step, then hLEPOR(POS) cal- on our earlier work, especially the training models cualation uses the same algrithms used for the word (Erofeev et al., 2021). level similarity score hLEPOR(word). Finally, hy- The rest of the paper is organised as bellow: Sec- brid hLEPOR is a combination of both word level tion2 revisits hLEPOR metric, its factors, advan- and POS level score. In this system submission tages and disadvantages, Section3 introduces our work, with the time limitations, to make an eas- Python ported version of hLEPOR and the further ier to use customised hLEPOR, we take the basic customised hLEPOR (cushLEPOR) using language version of hLEPOR, i.e. the word level simialrity models, Section4 presents our experimental devel- calculation and leave the hybrid hLEPOR into the opment and evaluation that we carried out on cush- future work. LEPOR metric using WMT historical data, Section The weighting parameters for the three main 5 reserves space for our submission to this year factors in original hLEPOR metric, i.e the (wLP , WMT21 metrics task, and Section6 finishes this wNP osP enal, wHPR) set, in addition to the other paper with discussions of our findings and possible parameters inside each factor, was tuned by manual future work. work based on development data. This is very time consuming, tedious, and costly. In this work, we 2 Revisiting hLEPOR will introduce an automated tuning model for hLE- hLEPOR is a further developed variant of LEPOR POR to customise it regarding deployed language (Han et al., 2012) metric which was firstly pro- pairs, which we name as cushLEPOR. posed in 2013 including all evaluation factors from 3 Proposed Model LEPOR but using harmonic mean for grouping factors to produce final calculation score (Han 3.1 Python portable hLEPOR et al., 2013b). Its submission to WMT2013 metrics Original hLEPOR was published as Perl code 1, task achieved system level highest average corre- in a non-portable format, which is not very suit- lating scores to human judgement on English-to- able for modern AI/NLP applications, since they other (French, Spanish, Russian, German, Czech) are using almost exclusively Python. Python is a language pairs by Pearson correlation coefficient programming language of choice for AI/ML tasks, (0.854) (Han, 2014; Machácekˇ and Bojar, 2013). thanks to its amazing ecosystem of open source or Other MT researchers also analysed LEPOR metric simply free libraries available to researchers and variant as one of the best performing segment level developers. However, hLEPOR was not available metric that was not significantly outperformed by in NLTK (Bird et al., 2009) or any other public other metrics using WMT shared task data (Gra- Python libraries. We therefore took original pub- ham et al., 2015). hLEPOR is calculated by: lished Perl code and ported it to Python, carefully comparing the logic of original paper and the Perl implementation. During this work we run both hLEPOR = Harmonic(w LP ; LP Perl code to reproduce the results of original code, wNP osP enalNP osP enal; wHPRHPR) and the new Python implementation. This work where LP is a sentence length penalty factor which helped us to spot and fix at least three minor errors was extended from briefty penalty utilised in BLEU which did not significantly affected the score, but metric, NPosPenal is for n-gram position difference nevertheless we fixed the bugs of the Perl code. penalty which captures the word order information, While doing the porting we did also notice that and HPR is the harmonic mean of Precision and hLEPOR parameter values were taken empirically Recall values. We refer the work (Han, 2014) for and never explained in detailed except for the sug- detailed factor calculation with examples there. gested parameter setting table in the paper (Han The basic version of hLEPOR carries out simi- et al., 2013b,a) for eight language pairs that were larity calculation between MT system outputs and 1https://github.com/poethan/LEPOR tested for the WMT2013 shared task, including EN- 3.2 cushLEPOR: customised hLEPOR CZ/DE/FR/ES and the opposite direction. They With the recent development of pre-trained neural were: language models and their effective applications in • alpha: the tunable weight for recall different NLP tasks, including question answering, language inference and MT, it becomes a natural • beta: the tunable weight for precision question that why do not we apply them in MT evaluation as well. • n: words count before and after matched word Very recent work from Google team verified that in npd calculation MQM (Multi-dimension quality metric) (Lommel et al., 2014) and SQM (Scalar Quality Metrics) • weight_elp: tunable weight of enhanced (Freitag et al., 2021) have good agreement with length penalty each other when they were carried out both by • weight_pos: tunable weight of n-gram posi- professional translators. However, this does not tion difference penalty correlate to Mechanical Turk based crowd-sourced human evaluation that was carried out by general • weight_pr: tunable weight of harmonic mean researchers or untrained online workers. It also of precision and recall reflected that crowd-sourced evaluation tends to favour very literal translations instead of better The parameter values for hLEPOR as published translations with more diverse meaning equivalent in the publicly available Perl code were (from CZ- lexical choices.
Recommended publications
  • Unsupervised Quality Estimation Model for English to German Translation and Its Application in Extensive Supervised Evaluation
    Hindawi Publishing Corporation e Scientific World Journal Volume 2014, Article ID 760301, 12 pages http://dx.doi.org/10.1155/2014/760301 Research Article Unsupervised Quality Estimation Model for English to German Translation and Its Application in Extensive Supervised Evaluation Aaron L.-F. Han, Derek F. Wong, Lidia S. Chao, Liangye He, and Yi Lu Natural Language Processing & Portuguese-Chinese Machine Translation Laboratory, Department of Computer and Information Science, University of Macau, Macau Correspondence should be addressed to Aaron L.-F. Han; [email protected] Received 30 August 2013; Accepted 2 December 2013; Published 28 April 2014 Academic Editors: J. Shu and F. Yu Copyright © 2014 Aaron L.-F. Han et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. With the rapid development of machine translation (MT), the MT evaluation becomes very important to timely tell us whether the MT system makes any progress. The conventional MT evaluation methods tend to calculate the similarity between hypothesis translations offered by automatic translation systems and reference translations offered by professional translators. There are several weaknesses in existing evaluation metrics. Firstly, the designed incomprehensive factors result in language-bias problem, which means they perform well on some special language pairs but weak on other language pairs. Secondly, they tend to use no linguistic features or too many linguistic features, of which no usage of linguistic feature draws a lot of criticism from the linguists and too many linguistic features make the model weak in repeatability.
    [Show full text]
  • LEPOR: a Robust Evaluation Metric for Machine Translation with Augmented Factors
    LEPOR: A Robust Evaluation Metric for Machine Translation with Augmented Factors Aaron L. F. HAN Derek F. WONG Lidia S. CHAO Natural Language Processing & Portuguese-Chinese Machine Translation Laboratory Department of Computer and Information Science University of Macau Macau S.A.R., China [email protected], [email protected], [email protected] ABSTRACT In the conventional evaluation metrics of machine translation, considering less information about the translations usually makes the result not reasonable and low correlation with human judgments. On the other hand, using many external linguistic resources and tools (e.g. Part-of- speech tagging, morpheme, stemming, and synonyms) makes the metrics complicated, time- consuming and not universal due to that different languages have the different linguistic features. This paper proposes a novel evaluation metric employing rich and augmented factors without relying on any additional resource or tool. Experiments show that this novel metric yields the state-of-the-art correlation with human judgments compared with classic metrics BLEU, TER, Meteor-1.3 and two latest metrics (AMBER and MP4IBM1), which proves it a robust one by employing a feature-rich and model-independent approach. KEYWORDS : Machine translation, Evaluation metric, Context-dependent n-gram alignment, Modified length penalty, Precision, Recall. Proceedings of COLING 2012: Posters, pages 441–450, COLING 2012, Mumbai, December 2012. 441 1 Introduction Since IBM proposed and realized the system of BLEU (Papineni et al., 2002) as the automatic metric for Machine Translation (MT) evaluation, many other methods have been proposed to revise or improve it. BLEU considered the n-gram precision and the penalty for translation which is shorter than that of references.
    [Show full text]
  • International Journal of Computer Sciences and Engineering Open Access Research Paper Vol-6, Special Issue-5, Jun 2018 E-ISSN: 2347-2693
    International Journal of Computer Sciences and Engineering Open Access Research Paper Vol-6, Special Issue-5, Jun 2018 E-ISSN: 2347-2693 A Study of metrics for evaluation of Machine translation K. Sourabh1, S. M Aaqib2, V. Mansotra3 1Dept. of Computer Science, GGM Science College, Jammu, India 2Dept. of Computer Science, Amar Singh Science College, Srinagar, India 3Dept. of Computer Science and IT, University of Jammu, Jammu, India Available online at: www.ijcseonline.org Abstract: Machine Translation has gained popularity over the years and has become one of the promising areas of research in computer science. Due to a consistent growth of internet users across the world information is now more versatile and dynamic available in almost all popular spoken languages throughout the world. From Indian perspective importance of machine translation become very obvious because Hindi is a language that is widely used across India and whole world. Many initiatives have been taken to facilitate Indian users so that information may be accessed in Hindi by converting it from one language to other. In this paper we have studied various available automatic metrics that evaluate the quality of translation correlation with human judgments. Keywords: Machine Translation, Corpus, bleu, Nist, Meteor, wer, ter, gtm. I. Introduction II. Manual Translation Evaluation Machine translation is a sub-field of computational Evaluation plays a very important role in examining the linguistics that investigates the use of software to translate quality of MT output. Manual evaluation is very time text or speech from one language to another. As internet is consuming and prejudiced, hence use of automatic metrics is now flooded with multilingual information for global made most of the times.
    [Show full text]
  • An Experimental Study of Automatic Metrics for Machine Translation Evaluation K.Sourabh1, S.M Aaqib2, Minu Bala3 1Assistant Professor, Dept
    An Experimental Study of Automatic Metrics for Machine Translation Evaluation K.Sourabh1, S.M Aaqib2, Minu Bala3 1Assistant Professor, Dept. of Computer Science, GGM Science College, Jammu 2Assistant Professor, Dept. of Computer Science, Amar Singh Science College, Srinagar 3Sr. Assistant Professor Dept. of Computer Science GGM Science College ABSTRACT Machine Translation has gained popularity over the years and has become one of the promising areas of research in computer science. Due to a consistent growth of internet users across the world information is now more versatile and dynamic available in almost all popular spoken languages throughout the world. From Indian perspective importance of machine translation become very obvious because Hindi is a language that is widely used across India and whole world. Many initiatives have been taken to facilitate Indian users so that information may be accessed in Hindi by converting it from one language to other. In this paper we have studied various available automatic metrics that evaluate the quality of translation correlation with human judgments. Also we have tested various automatic metrics available for their performance against human evaluation. Keywords: Machine Translation, Evaluation, Bleu, Meteor, Ter, GTM. I. INTRODUCTION Machine translation is a sub-field of computational linguistics that investigates the use of software to translate text or speech from one language to another. As internet is now flooded with multilingual information for global community, research and development giving space to Machine Evaluation plays a major role in the field of Natural Language Processing. Many MT tools like Google translate, Bing, Systran SDL etc are providing online services to translate text from one language to another.
    [Show full text]
  • LEPOR: an Augmented Machine Translation Evaluation Metric Lifeng
    LEPOR: An Augmented Machine Translation Evaluation Metric by Lifeng Han, Aaron Master of Science in Software Engineering 2014 Faculty of Science and Technology University of Macau LEPOR: AN AUGMENTED MACHINE TRANSLATION EVALUATION METRIC by LiFeng Han, Aaron A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Software Engineering Faculty of Science and Technology University of Macau 2014 Approved by ___________________________________________________ Supervisor __________________________________________________ __________________________________________________ __________________________________________________ Date __________________________________________________________ In presenting this thesis in partial fulfillment of the requirements for a Master's degree at the University of Macau, I agree that the Library and the Faculty of Science and Technology shall make its copies freely available for inspection. However, reproduction of this thesis for any purposes or by any means shall not be allowed without my written permission. Authorization is sought by contacting the author at Address:FST Building 2023, University of Macau, Macau S.A.R. Telephone: +(00)853-63567608 E-mail: [email protected] Signature ______________________ 2014.07.10th Date __________________________ University of Macau Abstract LEPOR: AN AUGMENTED MACHINE TRANSLATION EVALUATION METRIC by LiFeng Han, Aaron Thesis Supervisors: Dr. Lidia S. Chao and Dr. Derek F. Wong Master of Science in Software Engineering Machine translation (MT) was developed as one of the hottest research topics in the natural language processing (NLP) literature. One important issue in MT is that how to evaluate the MT system reasonably and tell us whether the translation system makes an improvement or not. The traditional manual judgment methods are expensive, time-consuming, unrepeatable, and sometimes with low agreement.
    [Show full text]
  • Translation Quality Assessment: a Brief Survey on Manual and Automatic Methods
    Translation Quality Assessment: A Brief Survey on Manual and Automatic Methods Lifeng Han1, Gareth J. F. Jones1, and Alan F. Smeaton2 1 ADAPT Research Centre 2 Insight Centre for Data Analytics School of Computing, Dublin City University, Dublin, Ireland [email protected] Abstract tical methods (Brown et al., 1993; Och and Ney, 2003; Chiang, 2005; Koehn, 2010), to the cur- To facilitate effective translation modeling rent paradigm of neural network structures (Cho and translation studies, one of the crucial et al., 2014; Johnson et al., 2016; Vaswani et al., questions to address is how to assess trans- 2017; Lample and Conneau, 2019), MT quality lation quality. From the perspectives of ac- continue to improve. However, as MT and transla- curacy, reliability, repeatability and cost, tion quality assessment (TQA) researchers report, translation quality assessment (TQA) it- MT outputs are still far from reaching human par- self is a rich and challenging task. In this ity (Läubli et al., 2018; Läubli et al., 2020; Han work, we present a high-level and con- et al., 2020a). MT quality assessment is thus still cise survey of TQA methods, including an important task to facilitate MT research itself, both manual judgement criteria and auto- and also for downstream applications. TQA re- mated evaluation metrics, which we clas- mains a challenging and difficult task because of sify into further detailed sub-categories. the richness, variety, and ambiguity phenomena of We hope that this work will be an asset natural language itself, e.g. the same concept can for both translation model researchers and be expressed in different word structures and pat- quality assessment researchers.
    [Show full text]
  • Unsupervised Quality Estimation Model for English to German Translation and Its Application in Extensive Supervised Evaluation
    Hindawi Publishing Corporation e Scientific World Journal Volume 2014, Article ID 760301, 12 pages http://dx.doi.org/10.1155/2014/760301 Research Article Unsupervised Quality Estimation Model for English to German Translation and Its Application in Extensive Supervised Evaluation Aaron L.-F. Han, Derek F. Wong, Lidia S. Chao, Liangye He, and Yi Lu Natural Language Processing & Portuguese-Chinese Machine Translation Laboratory, Department of Computer and Information Science, University of Macau, Macau Correspondence should be addressed to Aaron L.-F. Han; [email protected] Received 30 August 2013; Accepted 2 December 2013; Published 28 April 2014 Academic Editors: J. Shu and F. Yu Copyright © 2014 Aaron L.-F. Han et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. With the rapid development of machine translation (MT), the MT evaluation becomes very important to timely tell us whether the MT system makes any progress. The conventional MT evaluation methods tend to calculate the similarity between hypothesis translations offered by automatic translation systems and reference translations offered by professional translators. There are several weaknesses in existing evaluation metrics. Firstly, the designed incomprehensive factors result in language-bias problem, which means they perform well on some special language pairs but weak on other language pairs. Secondly, they tend to use no linguistic features or too many linguistic features, of which no usage of linguistic feature draws a lot of criticism from the linguists and too many linguistic features make the model weak in repeatability.
    [Show full text]
  • Automatic Error Detection and Correction in Neural Machine Translation
    Automatic Error Detection and Correction in Neural Machine Translation A comparative study of Swedish to English and Greek to English Anthi Papadopoulou Uppsala universitet Department of Linguistics and Philology Master’s Programme in Language Technology Master’s Thesis in Language Technology June 10, 2019 Supervisors: Eva Pettersson, Uppsala University Anna Sågvall Hein, Convertus AB Sebastian Schleussner, Convertus AB Abstract Automatic detection and automatic correction of machine translation output are important steps to ensure an optimal quality of the final output. In this work, we compared the output of neural machine translation of two different language pairs, Swedish to English and Greek to English. This comparison was made using common machine translation metrics (BLEU, METEOR, TER) and syntax-related ones (POSBLEU, WPF, WER on POS classes). It was found that neither common metrics nor purely syntax-related ones were able to capture the quality of the machine translation output accurately, but the decomposition of WER over POS classes was the most informative one. A sample of each language was taken, so as to aid in the comparison between manual and automatic error categorization of five error categories, namely reordering errors, inflectional errors, missing and extra words, and incorrect lexical choices. Both Spearman’s r and Pearson’s r showed that there is a good correlation with human judgment with values above 0.9. Finally, based on the results of this error categorization, automatic post editing rules were implemented and applied, and their performance was checked against the sample, and the rest of the data set, showing varying results. The impact on the sample was greater, showing improvement in all metrics, while the impact on the rest of the data set was negative.
    [Show full text]
  • A Critical Analysis of Metrics Used for Measuring Progress in Artificial Intelligence
    A critical analysis of metrics used for measuring progress in artificial intelligence Kathrin Blagec 1, Georg Dorffner 1, Milad Moradi 1, Matthias Samwald 1* * Corresponding author Affiliations 1 Institute for Artificial Intelligence and Decision Support, Center for Medical Statistics, Informatics, and Intelligent Systems, Medical University of Vienna, Vienna, Austria. Abstract Comparing model performances on benchmark datasets is an integral part of measuring and driving progress in artificial intelligence. A model’s performance on a benchmark dataset is commonly assessed based on a single or a small set of performance metrics. While this enables quick comparisons, it may also entail the risk of inadequately reflecting model performance if the metric does not sufficiently cover all performance characteristics. Currently, it is unknown to what extent this might impact current benchmarking efforts. To address this question, we analysed the current landscape of performance metrics based on data covering 3867 machine learning model performance results from the web-based open platform ‘Papers with Code’. Our results suggest that the large majority of metrics currently used to evaluate classification AI benchmark tasks have properties that may result in an inadequate reflection of a classifiers’ performance, especially when used with imbalanced datasets. While alternative metrics that address problematic properties have been proposed, they are currently rarely applied as performance metrics in benchmarking tasks. Finally, we noticed that the reporting of metrics was partly inconsistent and partly unspecific, which may lead to ambiguities when comparing model performances. Introduction Benchmarking, i.e. the process of measuring and comparing model performance on a specific task or set of tasks, is an important driver of progress in artificial intelligence research.
    [Show full text]