Evaluation Guidelines to Deal with Implicit Phenomena to Assess Factuality in Data-to-Text Generation Roy Eisenstadt Michael Elhadad Ben Gurion University Ben Gurion University Computer Science Department Computer Science Department Be’er Sheva, Israel Be’er Sheva, Israel
[email protected] [email protected] Abstract Yet, recent work has shown that neural genera- tion models risk to create text that is not faithful Data-to-text generation systems are trained to the input data specification (Wang, 2019; Chen on large datasets, such as WebNLG, Ro- toWire, E2E or DART. Beyond traditional et al., 2019a), by either introducing content which token-overlap evaluation metrics (BLEU or is not warranted by the input or failing to express METEOR), a key concern faced by recent gen- some of the content which is part of the input. In erators is to control the factuality of the gen- fact, studies indicate that the training datasets also erated text with respect to the input data spec- suffer from problematic alignment between data ification. We report on our experience when and text (e.g., (Rebuffel et al., 2020) and (Dusekˇ developing an automatic factuality evaluation et al., 2019)), because large-scale data collection is system for data-to-text generation that we are complex. testing on WebNLG and E2E data. We aim to prepare gold data annotated manually to iden- In response, new evaluation metrics are being de- tify cases where the text communicates more veloped to assess the factuality of text with respect information than is warranted based on the in- to the input data.