A Visualisation Tool for Inspecting and Evaluating Metric Scores of Machine Translation Output

Vis-Eval Metric Viewer: A Visualisation Tool for Inspecting and Evaluating Metric Scores of Machine Translation Output David Steele and Lucia Specia Department of Computer Science University of Sheffield Sheffield, UK dbsteele1,[email protected] Abstract cised for providing scores that can be non-intuitive and uninformative, especially at the sentence level Machine Translation systems are usually eval- (Zhang et al., 2004; Song et al., 2013; Babych, uated and compared using automated evalua- 2014). Additionally, scores across different met- tion metrics such as BLEU and METEOR to score the generated translations against human rics can be inconsistent with each other. This in- translations. However, the interaction with the consistency can be an indicator of linguistic prop- output from the metrics is relatively limited erties of the translations which should be further and results are commonly a single score along analysed. However, multiple metrics are not al- with a few additional statistics. Whilst this ways used and any discrepancies among them tend may be enough for system comparison it does to be ignored. not provide much useful feedback or a means Vis-Eval Metric Viewer (VEMV) was devel- for inspecting translations and their respective scores. Vis-Eval Metric Viewer (VEMV) is a oped as a tool bearing in mind the aforementioned tool designed to provide visualisation of mul- issues. It enables rapid evaluation of MT output, tiple evaluation scores so they can be easily in- currently employing up to eight popular metrics. terpreted by a user. VEMV takes in the source, Results can be easily inspected (using a typical reference, and hypothesis files as parameters, web browser) especially at the segment level, with and scores the hypotheses using several popu- each sentence (source, reference, and hypothesis) lar evaluation metrics simultaneously. Scores clearly presented in interactive score tables, along are produced at both the sentence and dataset level and results are written locally to a se- with informative statistical graphs. No server or ries of HTML files that can be viewed on a internet connection is required. Only readily avail- web browser. The individual scored sentences able packages or libraries are used locally. can easily be inspected using powerful search Ultimately VEMV is an accessible utility that and selection functions and results can be vi- can be run quickly and easily on all the main plat- sualised with graphical representations of the forms. scores and distributions. Before describing the technical specification of 1 Introduction the VEMV tool and its features in Section3, we give an overview of existing metric visualisation Automatic evaluation of Machine Translation tools in Section2. (MT) hypotheses is key for system development and comparison. Even though human assessment 2 Related Tools ultimately provides more reliable and insight- ful information, automatic evaluation is faster, Several tools have been developed to visualise the cheaper, and often considered more consistent. output of MT evaluation metrics that go beyond Many metrics have been proposed for MT that displaying just single scores and/or a few statistics. compare system translations against human refer- Despite its criticisms and limitations, BLEU is ences, with the most popular being BLEU (Pa- still regarded as the de facto evaluation metric used pineni et al., 2002), METEOR (Denkowski and for rating and comparing MT systems. It was one Lavie, 2014), TER (Snover et al., 2006), and, more of the earliest metrics to assert a high enough cor- recently, BEER (Stanojevic and Sima’an, 2014). relation with human judgments. These and other automatic metrics are often criti- Interactive BLEU (iBleu) (Madnani, 2011) is 71 Proceedings of NAACL-HLT 2018: Demonstrations, pages 71–75 New Orleans, Louisiana, June 2 - 4, 2018. c 2018 Association for Computational Linguistics a visual and interactive scoring environment that tactic and semantic parsers. The online tool2 aims uses BLEU. Users select the source, reference, and to offer a more practical solution, where users can hypothesis files using a graphical user interface upload their translations. The tool offers a mod- (GUI) and these are scored. The dataset BLEU ule for sentence-level inspection through interac- score is shown alongside a bar chart of sentence tive tables. Some basic dataset-level graphs are scores. Users can select one of the sentences by also displayed and can be used to compare system clicking on the individual bars in the chart. When scores. a sentence is selected its source and hypothesis In comparison to the other software described translation is also shown, along with the standard here, VEMV is a light yet powerful utility, which BLEU statistics (e.g. score and n-gram informa- offers a wide enough range of metrics and can tion for the segment). Whilst iBLEU does provide be easily extended to add other metrics. It has some interactivity, using the graph itself to choose a very specific purpose in that it is designed for the sentences is not very intuitive. In addition the rapid and simple use locally, without the need for tool provides results for only one metric. servers, access to the internet, uploads, or large METEOR is another popular metric used to installs. Users can quickly get evaluation scores compute sentence and dataset-level scores based from a number of mainstream metrics and view on reference and hypothesis files. One of its them immediately in easily navigable interactive main components is to word-align the words in score tables. We contend that currently there is no the reference and hypothesis. The Meteor-X-Ray other similar tool that is lightweight and offers this tool generates graphical output with visualisation functionality and simplicity. of word alignments and scores. The alignments and score distributions are used to generate simple 3 Vis-Eval Metric Viewer Software & graphs (output to PDF). Whilst the graphs do pro- Features vide extra information there is little in the way of interactivity. This section provides an overview of the VEMV MT-ComparEval (Klejch et al., 2015) is a dif- software and outlines the required input param- ferent evaluation visualisation tool, available to eters, technical specifications, and highlights a be used online1 or downloaded locally. Its pri- number of the useful features. mary function is to enable users, via a GUI, to compare two (or more) MT system outputs, us- 3.1 The Software ing BLEU as the evaluation metric. It shows re- VEMV is essentially a multi-metric evaluation sults at both the sentence and dataset level high- tool that uses three tokenised text files (source, ref- lighting confirmed, improving, and worsening n- erence, and hypothesis) as input parameters and grams for each MT system with respect to the scores the hypothesis translation (MT system out- other. Sentence-level metrics (also n-gram) input) using up to eight popular metrics: BLEU, clude precision, recall, and F-Measure informa- MT-Eval3 (MT NIST & MT BLEU), METEOR, tion as well as score differences between MT sys- BEER, TER, Word Error Rate (WER), and Edit tems for a given sentence. Users can upload their Distance (E-Dist).4 All results are displayed via own datasets to view sentence-level and dataset easily navigable web pages that include details scores, albeit with a very limited choice of met- of all sentences and scores (shown in interactive rics. The GUI provides some interaction with the score tables - Figure1). A number of graphs show- evaluation results and users can make a number of ing various score distributions are also created. preference selections via check boxes. The key aims of VEMV are to make the eval- The Asiya Toolkit (Gimenez´ et al., 2010) is a uation of MT system translations easy to under- visualisation tool that can be used online or as a take and to provide a wide range of feedback that stand-alone tool. It offers a comprehensive suite of helps the user to inspect how well their system per- metrics, including many linguistically motivated formed, both at the sentence and dataset level. ones. Unless the goal is to run a large number of metrics, the download version is not very practi- 2At the time of writing the online version did not work. cal. It relies on many external tools such as syn- 3https://www.nist.gov/ 4A WER like metric that calculates the Levenshtein (edit) 1http://wmt.ufal.cz/ distance between two strings, but at the character level. 72 Figure 1: A screenshot of an interactive score table showing two example sentences and their respective scores. 3.2 Input and Technical Specification 3.3 Main Features VEMV is written in Python 3 (also compatible Here we outline some key features of the Vis-Eval with Python 2.7). To run the tool, the following Metric Viewer tool: software needs to be installed: Scoring with multiple evaluation metrics Python >= 2.7 (required) • Currently VEMV uses eight evaluation met- NLTK5 >= 3.2.4 (required) • rics to score individual sentences and the whole document. All results are shown side by side for Numpy (required) • comparison purposes and can be inspected at a Matplotlib / Seaborn (optional - for graphs) granular level (Figure1). • A glance at the two sentences in Figure1 al- Perl (optional - for MT BLEU, MT NIST) ready provides numerous points for analysis. For • example, the MT in sentence 2 is a long way Java (optional - for METEOR, BEER, TER) from the reference and receives low metric scores. • However, whilst not identical to the reference, the With the minimum required items installed the MT is correct and could be interchanged with the software will generate scores for standard BLEU, reference without losing meaning. For sentence 1 WER, and E-Dist. The optional items enable a the MT is only a single word away from the refer- user to run a wider range of metrics and produce ence and receives good scores, (much higher than nearly 200 graphs during evaluation.

A Visualisation Tool for Inspecting and Evaluating Metric Scores of Machine Translation Output

BLEU Might Be Guilty but References Are Not Innocent

Why Word Error Rate Is Not a Good Metric for Speech Recognizer Training for the Speech Translation Task?

Machine Translation Evaluation

Arxiv:2006.14799V2 [Cs.CL] 18 May 2021 the Same User Input

A Survey of Evaluation Metrics Used for NLG Systems

English and Arabic Speech Translation

Automatic Evaluation Measures for Statistical Machine Translation System Optimization

Re-Evaluating the Role of BLEU in Machine Translation Research

Introduction to Natural Language Processing Computer Science 585—Fall 2009 University of Massachusetts Amherst

Minimum Error Rate Training in Statistical Machine Translation

LEPOR: an Augmented Machine Translation Evaluation Metric Lifeng

Bertscore: Evaluating Text Generation with Bert