A Gold Standard Methodology for Evaluating Accuracy in Data-To
Total Page:16
File Type:pdf, Size:1020Kb
A Gold Standard Methodology for Evaluating Accuracy in Data-To-Text Systems Craig Thomson Ehud Reiter University of Aberdeen University of Aberdeen Aberdeen Aberdeen UK UK [email protected] [email protected] Abstract (hundreds of words) which communicate com- plex data and possibly insights (eg, trends and Most Natural Language Generation systems best/worst) derived from the source data. This is need to produce accurate texts. We propose more challenging task; longer texts can have con- a methodology for high-quality human evalua- tion of the accuracy of generated texts, which textual errors which are rare in short texts, and is intended to serve as a gold-standard for ac- checking accuracy of insights derived from com- curacy evaluations of data-to-text systems. We plex data is harder than checking whether a small use our methodology to evaluate the accuracy number of attributes, for a single entity, are accu- of computer generated basketball summaries. rately communicated. We then show how our gold standard evalua- In this paper we specifically focus on finding ac- tion can be used to validate automated metrics. curacy mistakes in English language sports stories. 1 Introduction However, we believe the techniques could also be applied to other types of texts produced by data- In most contexts, it is essential that texts pro- to-text systems, including financial reports and duced by data-to-text Natural Language Genera- business intelligence, which are very important in tion (NLG) systems accurately communicate in- commercial data-to-text applications (Elliot et al., put data. Hallucination and other forms of inac- 2020). curacy are unacceptable in NLG application con- Finding accuracy mistakes in a 300-word sports texts such as journalism, financial reporting, and story using our methodology costs on the order medical patient information. For example, it is not of US$30 in Mechanical Turk worker payments acceptable to give a doctor incorrect information and Amazon fees, plus 30 minutes of experimenter about a patient. This means that it is essential that time. Workers were screened with a qualifica- NLG developers be able to evaluate whether texts tion task. We intend our methodology to be a produced by their systems are accurate or not. high quality gold standard for accuracy evaluation, We propose here a methodology (protocol) for and encourage other researchers to find alternative high-quality human evaluation of the accuracy of and cheaper evaluation techniques which correlate arXiv:2011.03992v1 [cs.CL] 8 Nov 2020 generated texts. The methodology focuses on iden- well with the gold-standard presented here. tifying and categorising specific accuracy errors in In other words, researchers developing metrics a text; hence it is quite different from protocols for measuring accuracy can compare the results of which ask people to assess the overall accuracy of their metrics with our gold-standard accuracy eval- a text on a scale uation, and use this comparison to validate their Existing work on detecting mistakes and hal- metrics and assess how effective their metrics are lucinations in NLG texts has largely focused on at measuring accuracy. short texts which communicate relatively simple Figure 1 shows example sentences from a sports data. For example, Duˇsek et al. (2019) looked at story annotated by our methodology. The text is an slot-error-rate in the E2E challenge (Duˇsek et al., example, constructed from fragments from the out- 2020), which involved generating short sentences put of different systems, with some manual adjust- (13 words on average) which communicated 8 at- ment to keep the example simple. The materials tributes. Our goal is to develop techniques which used to perform the evaluation described below, as can be used to evaluate accuracy in longer texts well as the small corpus of 21 accuracy-annotated 1 sports stories has been released on GitHub . The Memphis Grizzlies (5-2) defeated the Phoenix Suns (3 - 2) Monday 102-91 at the Talking Stick 2 Related Work Resort Arena in Phoenix. The Grizzlies had a NLG systems can be evaluated using ei- strong first half where they out-scored the Suns ther automatic metrics or human evaluation 59-42. Marc Gasol scored 18 points, leading the (Celikyilmaz et al., 2020). Automatic metrics Grizzlies. Isaiah Thomas added 15 points, he is av- such as BLEU are not very meaningful in NLG eraging 19 points on the season so far. The Suns’ (Reiter, 2018), especially when assessing accu- next game will be on the road against the Boston racy (Reiter and Belz, 2009). Even in machine Celtics on Friday. translation, BLEU and related metrics are not meaningful unless the differences in metric scores List of errors: is quite large, much larger than reported in most • 2: incorrect number, should be 0. academic papers (Mathur et al., 2020). Human evaluation of NLG systems is • Monday: incorrect named entity, should be usually done using Likert scales or ratings Wednesday. (van der Lee et al., 2019). In the context of evaluating accuracy, human evaluators are usually • Talking Stick Resort Arena: incorrect named asked to assess the overall accuracy of a generated entity, should be US Airways Center. text (Reiter and Belz, 2009), or to compare two • strong: incorrect word, the Grizzlies did not texts and say which text is overall most accurate do well in the first half. (Reiter et al., 2005; Novikova et al., 2018). The Pyramid method • out-scored: incorrect word, the Suns had a (Nenkova and Passonneau, 2004) in text sum- higher score in first half. marisation is a complex technique for evaluating the quality of a summary from a content per- • 59: incorrect number, should be 46. spective. It originally required substantial human • 42: incorrect number, should be 52 . input, but recently there have been attempts to automate PYRAMID analysis (Yang et al., • leading: incorrect word, Marc Gasol did not 2016). However, PYRAMID focuses on checking lead the Grizzlies, Mike Conley did with 24 whether expected content is present, not finding points. mistakes in unexpected content. In the context of evaluating computer-generated • Isaiah Thomas added: context error, Thomas sports stories, Wiseman et al. (2017) showed sen- played for the Suns, but context here implies tences (not complete stories) to human subjects, he played for the Grizzlies and added to their and asked the subjects to count how many facts in score. the sentence were supported by game data and how • averaging 19 points in the season so far: Not many contradicted the game data. These results checkable. Data sources report performance were then compared to metrics based on informa- per season and per game, not performance at tion extraction techniques. This was repeated by a particular point in a season. Puduppully et al. (2019) and extended to other do- mains by Dhingra et al. (2019). • on the road: incorrect word, The Suns will Another metric which semantically analyses play at home. generated text and compares this to the source data is SPICE (Anderson et al., 2016), which uses • Boston Celtics: incorrect named entity, the this approach to evaluate the quality of computer- Suns will play the Sacramento Kings generated image captions, Accuracy-checking is also an issue in fact Figure 1: Example text with error annotations. Each checking and verification. The FEVER workshops annotation includes an error type and a correction. and shared tasks (Thorne et al., 2018b, 2019) Annotators can add explanations where useful. Box score data for this game is available at https:// 1https://github.com/nlgcat/evaluating_ www.basketball-reference.com/boxscores/ accuracy 201411050PHO.html . asked participants to develop techniques to iden- but would only be considered an error under the tify factual errors in manually ’mutated’ versions ”real-world error” approach if the game was actu- of Wikipedia articles (Thorne et al., 2018a). ally played somewhere else (which is rare). We believe that the ”real-world error” approach 3 Methodology is a better fit to what users want, so we use it. But In summary, our approach is to ask multiple an- we realise that others have different views, and are notators to identify specific errors in a text, and happy to discuss this. Its also worth noting that categorise the errors into one of a small number of from a pragmatic perspective its probably easier types. We also ask annotators to provide correc- for annotators who have domain expertise to de- tions and optionally explanations of the error. An tect real-world errors. They do not need to check example is shown in Figure 1. We then integrate whether things they already know to be are true are the annotations into a single gold standard, based present in the input data, and they can use exist- on majority opinion of our annotators. ing resources (tools, websites, etc) which they are The methodology described below has been re- familiar with to find out what actually happened, fined based on results of pilot annotation exercises without worrying about whether all the informa- performed with a different group of participants. tion in the resource is present in the NLG system’s input data. 3.1 Real-world error vs not in the data? It is possible that the systems being evaluated used differing input data. For example Gong et al. We ask annotators to mark up places where the (2019) include data from games beyond the one text says things which are not true. An alterna- which is the focus of the summary. This means tive approach is to annotate places where the texts that from a practical perspective we would need say things which are not in the system’s input data. to create multiple user interfaces to present data These two approaches often agree but sometimes to the annotators if we want annotators to check disagree.