Revisiting Challenges in Data-To-Text Generation with Fact Grounding
Total Page:16
File Type:pdf, Size:1020Kb
Revisiting Challenges in Data-to-Text Generation with Fact Grounding Hongmin Wang University of California Santa Barbara hongmin [email protected] Abstract summaries from the large box- and line-score ta- bles2. It poses great challenges, requiring capabil- Data-to-text generation models face chal- ities to select what to say (content selection) from lenges in ensuring data fidelity by referring two levels: what entity and which attribute, and to to the correct input source. To inspire stud- determine how to say on both discourse (content ies in this area, Wiseman et al.(2017) in- planning) and token (surface realization) levels. troduced the RotoWire corpus on generat- ing NBA game summaries from the box- and Although this excellent resource has received line-score tables. However, limited attempts great research attention, very few works (Li and have been made in this direction and the chal- Wan, 2018; Puduppully et al., 2019a,b; Iso et al., lenges remain. We observe a prominent bot- 2019) have attempted to tackle the challenges on tleneck in the corpus where only about 60% ensuring data fidelity. This intrigues us to inves- of the summary contents can be grounded to tigate the reason behind and we identify a ma- the boxscore records. Such information de- jor culprit undermining researchers’ interests: the ficiency tends to misguide a conditioned lan- guage model to produce unconditioned ran- ungrounded contents in the human-written sum- dom facts and thus leads to factual halluci- maries impedes a model to learn to generate accu- nations. In this work, we restore the infor- rate fact-grounded statements and leads to possi- mation balance and revamp this task to focus bly misleading evaluation results when the models on fact-grounded data-to-text generation. We are compared against each other. introduce a purified and larger-scale dataset, Specifically, we observe that about 40% of RotoWire-FG (Fact-Grounding), with 50% the game summary contents cannot be directly more data from the year 2017-19 and en- mapped to any input boxscore records, as exem- riched input tables, hoping to attract more re- search focuses in this direction. Moreover, we plified by Table 1. Written by professional sports achieve improved data fidelity over the state- journalists, these statements incorporate domain of-the-art models by integrating a new form expertise and background knowledge consolidated of table reconstruction as an auxiliary task to from heterogeneous sources that are often hard to boost the generation quality. trace. The resulting information imbalance hin- ders a model to produce texts fully conditioned on 1 Introduction the inputs and the uncontrolled randomness causes factual hallucinations, especially for the mod- Data-to-text generation aims at automatically pro- ern encoder-decoder framework (Sutskever et al., ducing descriptive natural language texts to con- 2014; Cho et al., 2014). However, data fidelity is vey the messages embodied in structured data for- crucial for data-to-text generation besides fluency. mats, such as database records (Chisholm et al., In this real-world application, mistaken statements 2017), knowledge graphs (Gardent et al., 2017a), are detrimental to the document quality no matter and tables (Lebret et al., 2016; Wiseman et al., how human-like they appear to be. 2017). Table 1 shows an example from the Apart from the popular BLEU (Papineni et al., RotoWire1 (RW) corpus illustrating the task of 2002) metric for text generation, Wiseman et al. generating document-level NBA basketball game 2Box- and line-score tables contain player and team statis- 1https://github.com/harvardnlp/ tics respectively. For simplicity, we call the combined input boxscore-data the boxscore table unless otherwise specified. 311 Proceedings of The 12th International Conference on Natural Language Generation, pages 311–322, Tokyo, Japan, 28 Oct - 1 Nov, 2019. c 2019 Association for Computational Linguistics TEAM WIN LOSS PTS FG PCT BLK ... The Houston Rockets (18-5) defeated the Denver Nuggets (10-13) Rockets 18 5 108 44 7 108-96 on Saturday. Houston has won 2 straight games and 6 of Nuggets 10 13 96 38 7 their last 7. Dwight Howard returned to action Saturday after miss- PLAYER H/A PTS RB AST MIN ... ing the Rockets ’ last 11 games with a knee injury. He was supposed James Harden H 24 10 10 38 ... to be limited to 24 minutes in the game, but Dwight Howard perse- Dwight Howard H 26 13 2 30 ... vered to play 30 minutes and put up a monstrous double-double of JJ Hickson A 14 10 2 22 ... 26 points and 13 rebounds. Joining Dwight Howard in on the fun Column names : was James Harden with a triple-double of 24 points, 10 rebounds H/A: home/away, PTS: points, RB: rebounds, and 10 assists in 38 minutes. The Rockets ’ formidable defense AST: assists, MIN: minutes, BLK: blocks, held the Nuggets to just 38 percent shooting from the field. Hous- FG PCT: field goals percentage ton will face the Nuggets again in their next game, going on the road to Denver for their game on Wednesday. Denver has lost 4 An example hallucinated statement : of their last 5 games as they struggle to find footing during a tough After going into halftime down by eight , the Rockets part of their schedule ... Denver will begin a 4 - game homestead came out firing in the third quarter and out - scored the Nuggets 59 - 42 to seal the victory on the road hosting the San Antonio Spurs on Sunday. Table 1: An example from the RotoWire corpus. Partial box- and line-score tables are on the top left. Grounded entities and numerical facts are in bold. Yellow sentences contain red ungrounded numerical facts, and team game schedule related statements. A system-generated statement with multiple hallucinations on the bottom left. (2017) also formalized a set of post-hoc infor- Furthermore, by expending with 50% more games mation extraction (IE) based evaluations to as- between the years 2017-19, we obtain the more fo- sess the data fidelity. Using the boxscore table cused RotoWire-FG (RW-FG) dataset. schema, a sequence of (entity, value, type) records This leads to more accurate evaluations and col- mentioned in a system-generated summary are ex- lectively paves the way for future works by pro- tracted as the content plan. They are then vali- viding a more user-friendly alternative. With this dated for accuracy against the boxscore table and refurbished setup, the existing models are then re- similarity with the one extracted from the human- assessed on their abilities to ensure data fidelity. written summary. However, any hallucinated facts We discover that by only purifying the RW dataset, may unrealistically boost the BLEU score while the models can generate more precise facts with- not penalized by the data fidelity metrics since out sacrificing fluency. Furthermore, we propose no records can be identified from the ungrounded a new form of table reconstruction as an auxiliary contents. Thus the possibly misleading evaluation task to improve fact grounding. By incorporating results inhibit systems to demonstrate excellence it into the state-of-the-art Neural Content Planning on this task. (NCP) (Puduppully et al., 2019a) model, we estab- lished a benchmark on the RW-FG dataset with a These two aspects potentially undermine peo- 24.41 BLEU score and 95.7% factual accuracy. ple’s interests in this data fidelity oriented table- Finally, these insights lead us to summarize sev- to-text generation task. Therefore, in this work, eral fine-grained future challenges based on con- we revamp the task emphasizing this core aspect to crete examples, regarding factual accuracy and further enable research in this direction. First, we intra- and inter- sentence coherence. restore the information balance by trimming the summaries of ungrounded contents and replenish Our contributions include: the boxscore table to compensate for missing in- 1. We introduce a purified, enlarged and en- puts. This requires the non-trivial extraction of riched new dataset to support the more fo- the latent gold standard content plans with high- cused fact-grounded table-to-text generation quality. Thus, we take the efforts to design sophis- task. We provide high-quality summary facts ticated heuristics and achieved an estimated 98% to table records mappings (content plan) and precision and 95% recall of the true content plans, a more user-friendly experimental setup. All retaining 74% of numerical words in the sum- codes and data are freely available3. maries. This yields better content plans as com- 2. We re-investigate existing methods with pared to the 94% precision, 80% recall by Pudup- more insights, establish a new benchmark on pully et al.(2019b) and 60% retainment by Wise- this task, and uncover more fine-grained chal- man et al.(2017) respectively. Guided by the high- lenges to encourage future research. quality content plans, only fact-grounded contents are identified and retained as shown in Table 1. 3https://github.com/wanghm92/rw_fg 312 Type His Sch Agg Game Inf thus defeat the purpose of developing and evalu- Count 69 33 9 23 23 ating models for fact-grounded table-to-text gen- Percent 43.9 21.0 5.7 14.7 14.7 eration. Thus, we emphasize on this core aspect Table 2: Types of ungrounded contents about statis- of the task by trimming contents not licensed by tics related to His: history (e.g. recent-game/career the boxscore table, which we show later still en- high/average) Sch: team schedule (e.g. what is next compasses many fine-grained challenges awaiting game); Agg: aggregation of statistics from multiple to be resolved. While fully restoring all desired players (e.g. the duo of two stars combined scoring ...) ; inputs is also an interesting research challenge, it Game: during the game (e.g.