A Gold Standard Methodology for Evaluating Accuracy in Data-To-Text Systems Craig Thomson Ehud Reiter University of Aberdeen University of Aberdeen Aberdeen Aberdeen UK UK
[email protected] [email protected] Abstract (hundreds of words) which communicate com- plex data and possibly insights (eg, trends and Most Natural Language Generation systems best/worst) derived from the source data. This is need to produce accurate texts. We propose more challenging task; longer texts can have con- a methodology for high-quality human evalua- tion of the accuracy of generated texts, which textual errors which are rare in short texts, and is intended to serve as a gold-standard for ac- checking accuracy of insights derived from com- curacy evaluations of data-to-text systems. We plex data is harder than checking whether a small use our methodology to evaluate the accuracy number of attributes, for a single entity, are accu- of computer generated basketball summaries. rately communicated. We then show how our gold standard evalua- In this paper we specifically focus on finding ac- tion can be used to validate automated metrics. curacy mistakes in English language sports stories. 1 Introduction However, we believe the techniques could also be applied to other types of texts produced by data- In most contexts, it is essential that texts pro- to-text systems, including financial reports and duced by data-to-text Natural Language Genera- business intelligence, which are very important in tion (NLG) systems accurately communicate in- commercial data-to-text applications (Elliot et al., put data. Hallucination and other forms of inac- 2020).