Arxiv:2108.09484V1 [Cs.CL] 21 Aug 2021
Total Page:16
File Type:pdf, Size:1020Kb
cushLEPOR: Customised hLEPOR Metric Using LABSE Distilled Knowledge Model to Improve Agreement with Human Judgements Lifeng Han1, Irina Sorokina2, Gleb Erofeev2, and Serge Gladkoff2 1 ADAPT Research Centre, DCU, Ireland 2 Logrus Global, Translation & Localization gleberof, irina.sorokina, [email protected] Abstract aspects of timely and high quality evaluations, as well as reflecting the translation errors that MT Human evaluation has always been expen- systems can take advantages of for further improve- sive while researchers struggle to trust the automatic metrics. To address this, we pro- ment (Han et al., 2021b). On the one hand, human pose to customise traditional metrics by tak- evaluations have long been criticised as expensive ing advantages of the pre-trained language and unrepeatable. Furthermore, the inter and intra- models (PLMs) and the limited available hu- agreement levels from Human raters always strug- man labelled scores. We first re-introduce gle to achieve a relatively high and reliable score. the hLEPOR metric factors, followed by the On the other hand, automatic evaluation metrics Python portable version we developed which have reached high performances in the category achieved the automatic tuning of the weight- ing parameters in hLEPOR metric. Then of system level evaluations of MT systems (Fre- we present the customised hLEPOR (cush- itag et al., 2021). However, the segment level per- LEPOR) which uses LABSE distilled knowl- formance is still a large gap from human experts’ edge model to improve the metric agreement expectation. with human judgements by automatically op- In the meantime, many pre-trained language timised factor weights regarding the exact models have been proposed and developed in very MT language pairs that cushLEPOR is de- ployed to. We also optimise cushLEPOR to- recent years and showing big advantages in differ- wards human evaluation data based on MQM ent NLP tasks, for instance, BERT (Devlin et al., and pSQM framework on English-German 2019) and its further developed variants (Feng et al., and Chinese-English language pairs. The ex- 2020). In this work, we take the advantages of both perimental investigations show cushLEPOR high performing automatic metric and pre-trained boosts hLEPOR performances towards better language model, aiming at one step further towards agreements to PLMs like LABSE with much higher quality performing automatic MT evalua- lower cost, and better agreements to human tion metric from both system level and segment evaluations including MQM and pSQM scores, and yields much better performances than level perspectives. BLEU (data available at https://github. Among the evaluation metrics developed recent com/poethan/cushLEPOR). years, hLEPOR (Han et al., 2013b,a) is one of the comprehensive category that include many evalua- 1 Introduction tion factors including precision, recall, word order arXiv:2108.09484v2 [cs.CL] 30 Aug 2021 Machine Translation (MT) is a rapidly developing (via position difference factor), and sentence length. research field that plays an important role in NLP It has also been applied by many researchers from area. MT started from 1950s as one of the earli- different NLP field including natural language gen- est artificial intelligence (AI) research topics and eration (NLG) (Novikova et al., 2017; Gehrmann gained a large improvement in the output quality in et al., 2021), natural language understanding (NLU) large resourced language pairs after the introduc- (Ruder et al., 2021), automatic text summarization tion of Neural MT (NMT) in recent years (Kalch- (ATS) (Bhandari et al., 2020), and searching (Liu brenner and Blunsom, 2013; Cho et al., 2014; Bah- et al., 2021), in addition to MT evaluation (Mar- danau et al., 2014). However, the challenge still zouk, 2021). remains in achieving human parity MT output (Han However, there are some disadvantages from et al., 2021a). Thus MTE continues to play an im- hLEPOR including the manual tuning of its param- portant role in aiding MT development from the eter weights which costs lots of human efforts. We choose hLEPOR (Han et al., 2013b,a) as our base- reference translations, in the same language set- line model, and use the very recent language model ting, based on the word surface level tokens. The LABSE (Feng et al., 2020) to achieve automatic hybrid hLEPOR metric also carries out similarity tuning of its parameters thus aiming at reducing calculation based on POS sequences from system- the evaluation cost and further boosting the per- output and reference text. To do this, POS tagging formance. This system description paper is based is needed as the first step, then hLEPOR(POS) cal- on our earlier work, especially the training models cualation uses the same algrithms used for the word (Erofeev et al., 2021). level similarity score hLEPOR(word). Finally, hy- The rest of the paper is organised as bellow: Sec- brid hLEPOR is a combination of both word level tion2 revisits hLEPOR metric, its factors, advan- and POS level score. In this system submission tages and disadvantages, Section3 introduces our work, with the time limitations, to make an eas- Python ported version of hLEPOR and the further ier to use customised hLEPOR, we take the basic customised hLEPOR (cushLEPOR) using language version of hLEPOR, i.e. the word level simialrity models, Section4 presents our experimental devel- calculation and leave the hybrid hLEPOR into the opment and evaluation that we carried out on cush- future work. LEPOR metric using WMT historical data, Section The weighting parameters for the three main 5 reserves space for our submission to this year factors in original hLEPOR metric, i.e the (wLP , WMT21 metrics task, and Section6 finishes this wNP osP enal, wHPR) set, in addition to the other paper with discussions of our findings and possible parameters inside each factor, was tuned by manual future work. work based on development data. This is very time consuming, tedious, and costly. In this work, we 2 Revisiting hLEPOR will introduce an automated tuning model for hLE- hLEPOR is a further developed variant of LEPOR POR to customise it regarding deployed language (Han et al., 2012) metric which was firstly pro- pairs, which we name as cushLEPOR. posed in 2013 including all evaluation factors from 3 Proposed Model LEPOR but using harmonic mean for grouping factors to produce final calculation score (Han 3.1 Python portable hLEPOR et al., 2013b). Its submission to WMT2013 metrics Original hLEPOR was published as Perl code 1, task achieved system level highest average corre- in a non-portable format, which is not very suit- lating scores to human judgement on English-to- able for modern AI/NLP applications, since they other (French, Spanish, Russian, German, Czech) are using almost exclusively Python. Python is a language pairs by Pearson correlation coefficient programming language of choice for AI/ML tasks, (0.854) (Han, 2014; Machácekˇ and Bojar, 2013). thanks to its amazing ecosystem of open source or Other MT researchers also analysed LEPOR metric simply free libraries available to researchers and variant as one of the best performing segment level developers. However, hLEPOR was not available metric that was not significantly outperformed by in NLTK (Bird et al., 2009) or any other public other metrics using WMT shared task data (Gra- Python libraries. We therefore took original pub- ham et al., 2015). hLEPOR is calculated by: lished Perl code and ported it to Python, carefully comparing the logic of original paper and the Perl implementation. During this work we run both hLEPOR = Harmonic(w LP ; LP Perl code to reproduce the results of original code, wNP osP enalNP osP enal; wHPRHPR) and the new Python implementation. This work where LP is a sentence length penalty factor which helped us to spot and fix at least three minor errors was extended from briefty penalty utilised in BLEU which did not significantly affected the score, but metric, NPosPenal is for n-gram position difference nevertheless we fixed the bugs of the Perl code. penalty which captures the word order information, While doing the porting we did also notice that and HPR is the harmonic mean of Precision and hLEPOR parameter values were taken empirically Recall values. We refer the work (Han, 2014) for and never explained in detailed except for the sug- detailed factor calculation with examples there. gested parameter setting table in the paper (Han The basic version of hLEPOR carries out simi- et al., 2013b,a) for eight language pairs that were larity calculation between MT system outputs and 1https://github.com/poethan/LEPOR tested for the WMT2013 shared task, including EN- 3.2 cushLEPOR: customised hLEPOR CZ/DE/FR/ES and the opposite direction. They With the recent development of pre-trained neural were: language models and their effective applications in • alpha: the tunable weight for recall different NLP tasks, including question answering, language inference and MT, it becomes a natural • beta: the tunable weight for precision question that why do not we apply them in MT evaluation as well. • n: words count before and after matched word Very recent work from Google team verified that in npd calculation MQM (Multi-dimension quality metric) (Lommel et al., 2014) and SQM (Scalar Quality Metrics) • weight_elp: tunable weight of enhanced (Freitag et al., 2021) have good agreement with length penalty each other when they were carried out both by • weight_pos: tunable weight of n-gram posi- professional translators. However, this does not tion difference penalty correlate to Mechanical Turk based crowd-sourced human evaluation that was carried out by general • weight_pr: tunable weight of harmonic mean researchers or untrained online workers. It also of precision and recall reflected that crowd-sourced evaluation tends to favour very literal translations instead of better The parameter values for hLEPOR as published translations with more diverse meaning equivalent in the publicly available Perl code were (from CZ- lexical choices.