A Gold Standard Methodology for Evaluating Accuracy in Data-To-Text Systems

A Gold Standard Methodology for Evaluating Accuracy in Data-To-Text Systems Craig Thomson Ehud Reiter University of Aberdeen University of Aberdeen Aberdeen Aberdeen UK UK [email protected] [email protected] Abstract (hundreds of words) which communicate complex data and possibly insights (eg, trends and Most Natural Language Generation systems best/worst) derived from the source data. This is need to produce accurate texts. We propose more challenging task; longer texts can have con- a methodology for high-quality human evaluation of the accuracy of generated texts, which textual errors which are rare in short texts, and is intended to serve as a gold-standard for ac- checking accuracy of insights derived from com- curacy evaluations of data-to-text systems. We plex data is harder than checking whether a small use our methodology to evaluate the accuracy number of attributes, for a single entity, are accu- of computer generated basketball summaries. rately communicated. We then show how our gold standard evalua- In this paper we specifically focus on finding tion can be used to validate automated metrics. accuracy mistakes in English language sports sto- 1 Introduction ries. However, we believe the techniques could also be applied to other types of texts produced In most contexts, it is essential that texts pro- by data-to-text systems, including financial reports duced by data-to-text Natural Language Generation and business intelligence, which are very impor- (NLG) systems accurately communicate input data. tant in commercial data-to-text applications (Elliot Hallucination and other forms of inaccuracy are et al., 2020). unacceptable in NLG application contexts such as Finding accuracy mistakes in a 300-word sports journalism, financial reporting, and medical patient story using our methodology costs on the order information. For example, it is not acceptable to of US$30 in Mechanical Turk worker payments give a doctor incorrect information about a patient. and Amazon fees, plus 30 minutes of experimenter This means that it is essential that NLG developers time. Workers were screened with a qualification be able to evaluate whether texts produced by their task. We intend our methodology to be a high systems are accurate or not. quality gold standard for accuracy evaluation, and We propose here a methodology (protocol) for encourage other researchers to find alternative and high-quality human evaluation of the accuracy of cheaper evaluation techniques which correlate well generated texts. The methodology focuses on iden- with the gold-standard presented here. tifying and categorising specific accuracy errors in In other words, researchers developing metrics a text; hence it is quite different from protocols for measuring accuracy can compare the results which ask people to assess the overall accuracy of of their metrics with our gold-standard accuracy a text on a scale evaluation, and use this comparison to validate Existing work on detecting mistakes and hal- their metrics and assess how effective their metrics lucinations in NLG texts has largely focused on are at measuring accuracy. short texts which communicate relatively simple Figure 1 shows example sentences from a sports data. For example, Dusekˇ et al.(2019) looked at story annotated by our methodology. The text is slot-error-rate in the E2E challenge (Dusekˇ et al., an example, constructed from fragments from the 2020), which involved generating short sentences output of different systems, with some manual ad- (13 words on average) which communicated 8 at- justment to keep the example simple. The materials tributes. Our goal is to develop techniques which used to perform the evaluation described below, as can be used to evaluate accuracy in longer texts well as the small corpus of 21 accuracy-annotated 158 Proceedings of The 13th International Conference on Natural Language Generation, pages 158–168, Dublin, Ireland, 15-18 December, 2020. c 2020 Association for Computational Linguistics 1 sports stories has been released on GitHub . The Memphis Grizzlies (5-2) defeated the Phoenix Suns (3 - 2) Monday 102-91 at the Talking Stick Re- 2 Related Work sort Arena in Phoenix. The Grizzlies had a strong NLG systems can be evaluated using either auto- first half where they out-scored the Suns 59-42. matic metrics or human evaluation (Celikyilmaz Marc Gasol scored 18 points, leading the Grizzlies. et al., 2020). Automatic metrics such as BLEU are Isaiah Thomas added 15 points, he is averaging 19 not very meaningful in NLG (Reiter, 2018), espe- points on the season so far. The Suns’ next game cially when assessing accuracy (Reiter and Belz, will be on the road against the Boston Celtics on 2009). Even in machine translation, BLEU and Friday. related metrics are not meaningful unless the differ- ences in metric scores is quite large, much larger List of errors: than reported in most academic papers (Mathur • 2: incorrect number, should be 0. et al., 2020). Human evaluation of NLG systems is usually • Monday: incorrect named entity, should be done using Likert scales or ratings (van der Lee Wednesday. et al., 2019). In the context of evaluating accuracy, human evaluators are usually asked to assess the • Talking Stick Resort Arena: incorrect named overall accuracy of a generated text (Reiter and entity, should be US Airways Center. Belz, 2009), or to compare two texts and say which • strong: incorrect word, the Grizzlies did not text is overall most accurate (Reiter et al., 2005; do well in the first half. Novikova et al., 2018). The Pyramid method (Nenkova and Passonneau, • out-scored: incorrect word, the Suns had a 2004) in text summarisation is a complex technique higher score in first half. for evaluating the quality of a summary from a content perspective. It originally required substantial • 59: incorrect number, should be 46. human input, but recently there have been attempts • 42: incorrect number, should be 52 . to automate PYRAMID analysis (Yang et al., 2016). However, PYRAMID focuses on checking whether • leading: incorrect word, Marc Gasol did not expected content is present, not finding mistakes in lead the Grizzlies, Mike Conley did with 24 unexpected content. points. In the context of evaluating computer-generated sports stories, Wiseman et al.(2017) showed sen- • Isaiah Thomas added: context error, Thomas tences (not complete stories) to human subjects, played for the Suns, but context here implies and asked the subjects to count how many facts in he played for the Grizzlies and added to their the sentence were supported by game data and how score. many contradicted the game data. These results • averaging 19 points in the season so far: Not were then compared to metrics based on informa- checkable. Data sources report performance tion extraction techniques. This was repeated by per season and per game, not performance at Puduppully et al.(2019) and extended to other do- a particular point in a season. mains by Dhingra et al.(2019). Another metric which semantically analyses gen- • on the road: incorrect word, The Suns will erated text and compares this to the source data play at home. is SPICE (Anderson et al., 2016), which uses this approach to evaluate the quality of computer- • Boston Celtics: incorrect named entity, the generated image captions, Suns will play the Sacramento Kings Accuracy-checking is also an issue in fact checking and verification. The FEVER workshops and Figure 1: Example text with error annotations. shared tasks (Thorne et al., 2018b, 2019) asked par- Each annotation includes an error type and a cor- ticipants to develop techniques to identify factual rection. Annotators can add explanations where useful. Box score data for this game is available 1https://github.com/nlgcat/evaluating_ at https://www.basketball-reference.com/ accuracy boxscores/201411050PHO.html . 159 errors in manually ’mutated’ versions of Wikipedia considered an error under the ”real-world error” ap- articles (Thorne et al., 2018a). proach if the game was actually played somewhere else (which is rare). 3 Methodology We believe that the ”real-world error” approach is a better fit to what users want, so we use it. But In summary, our approach is to ask multiple an- we realise that others have different views, and are notators to identify specific errors in a text, and happy to discuss this. Its also worth noting that categorise the errors into one of a small number of from a pragmatic perspective its probably easier types. We also ask annotators to provide correc- for annotators who have domain expertise to de- tions and optionally explanations of the error. An tect real-world errors. They do not need to check example is shown in Figure 1. We then integrate whether things they already know to be are true are the annotations into a single gold standard, based present in the input data, and they can use exist- on majority opinion of our annotators. ing resources (tools, websites, etc) which they are The methodology described below has been re- familiar with to find out what actually happened, fined based on results of pilot annotation exercises without worrying about whether all the information performed with a different group of participants. in the resource is present in the NLG system’s input data. 3.1 Real-world error vs not in the data? It is possible that the systems being evaluated We ask annotators to mark up places where the used differing input data. For example Gong et al. text says things which are not true. An alterna- (2019) include data from games beyond the one tive approach is to annotate places where the texts which is the focus of the summary. This means say things which are not in the system’s input data. that from a practical perspective we would need to These two approaches often agree but sometimes create multiple user interfaces to present data to the disagree.

Load more