Arxiv:2003.08612V3 [Cs.CL] 4 Apr 2020 Used

Boosting Factual Correctness of Abstractive Summarization Chenguang Zhu1, William Hinthorn1, Ruochen Xu1, Qingkai Zeng2, Michael Zeng1, Xuedong Huang1, Meng Jiang2 1Microsoft Speech and Dialogue Research Group 2University of Notre Dame fchezhu,wihintho,ruox,nzeng,[email protected] fqzeng,[email protected] Abstract Geoffrey Lewis, a prolific character actor who appeared opposite frequent collaborator A commonly observed problem with abstrac- Clint Eastwood as his pal Orville Boggs in “Every Which Way But Loose” and its tive summarization is the distortion or fabrica- Article sequel, has died... Lewis, the father of tion of factual information in the article. This Oscar-nominated actress Juliette Lewis, died inconsistency between summary and original Tuesday. Lewis began his long association text has led to various concerns over its ap- with Eastwood in “High Plains Drifter”... plicability. In this paper, we firstly propose Geoffrey Lewis began his long association OTTOM P with Eastwood in “Every which Way but a Fact-Aware Summarization model, FASUM, B U Loose”... which extracts factual relations from the arti- It’s the father of Oscar-nominated actress cle and integrates this knowledge into the de- FASUM Juliette Lewis. Lewis began his long coding process via neural graph computation. (Ours) association with Eastwood in “High Plains Then, we propose a Factual Corrector model, Drifter”... FC, that can modify abstractive summaries Oscar-nominated actress Juliette Lewis SEQ2SEQ dies at 79. Lewis starred in “High Plains generated by any model to improve factual cor- Drifter”... rectness. Empirical results show that FASUM generates summaries with significantly higher Table 1: Example article and summary excerpts from factual correctness compared with state-of-the- CNN/DailyMail dataset with summaries generated by art abstractive summarization systems, both BOTTOMUP and our model FASUM.SEQ2SEQ indi- under an independently trained factual correct- cates our model without the knowledge graph compo- ness evaluator and human evaluation. And FC nent. Factual errors in summaries are marked in red improves the factual correctness of summaries and the corresponding facts in the article are marked in generated by various models via only modify- green. ing several entity tokens. 1 Introduction et al., 2019). This brings serious problems to the Text summarization models aim to produce an credibility and usability of abstractive summariza- abridged version of long text while preserving tion systems. Table 1 demonstrates an example salient information. Abstractive summarization is article and excerpts of generated summaries. As a type of such models that can freely generate sum- shown, the article mentions that the actor Geof- maries, with no constraint on the words or phrases fery Lewis began his association with Eastwood arXiv:2003.08612v3 [cs.CL] 4 Apr 2020 used. This format is closer to human-edited sum- in “High Plains Drifter”, while the summary from maries and is both flexible and informative. Thus, BOTTOMUP (Gehrmann et al., 2018) picks the there are numerous literature on producing abstrac- wrong movie title from another place in the article. tive summaries (See et al., 2017; Paulus et al., 2017; Also, the article says that the deceased Geoffery Dong et al., 2019; Gehrmann et al., 2018). Lewis is the father of Juliette Lewis. However, an However, one prominent issue with abstractive ablated version of our model without leveraging the summarization is factual inconsistency. It refers factual knowledge graph in the article, denoted by to the phenomenon that the summary sometimes SEQ2SEQ, wrongly expresses that Juliette passed distorts or fabricates the facts in the article. Re- away. Comparatively, our model FASUM generates cent studies show that up to 30% of the summaries a summary that correctly exhibits the relationship generated by abstractive models contain such fac- in the article. tual inconsistencies (Krysci´ nski´ et al., 2019b; Falke On the other hand, most existing abstractive sum- The flame of remembrance burns in Jerusalem, and a song of memory haunts Valerie Braham as it never has before. This year, Israel’s Memorial Day commemoration is for bereaved family members such as Braham. “Now I truly understand everyone who has lost a loved one,” Braham said. Her husband, Philippe Braham, Article was one of 17 people killed in January’s terror attacks in Paris... As Israel mourns on the nation’s remembrance day, French Prime Minister Manuel Valls announced after his weekly Cabinet meeting that French authorities had foiled a terror plot... Valerie Braham was one of 17 people killed in January ’s terror attacks in Paris. France’s memorial day BOTTOMUP commemoration is for bereaved family members as Braham. Israel’s Prime Minister says the terror plot has not been done. Philippe Braham was one of 17 people killed in January’s terror attacks in Paris. Israel’s memorial day Corrected by FC commemoration is for bereaved family members as Braham. France’s Prime Minister says the terror plot has not been done. Table 2: Example excerpts of an article from CNN/DailyMail and the summary from generated by BOTTOMUP. The summary contains a number of factual errors (marked in red). After modification by our model FC, all the errors have been corrected (marked in green). marization models apply a conditional language selectively produce contents either from the dic- model to focus on the token-level accuracy of tionary or from the graph entities. We denote this summaries, while neglecting semantic-level consis- model as a Fact-Aware Summarization model, FA- tency between the summary and article. Therefore, SUM. the generated summaries are often high in token- In addition, to utilize existing summarization level metrics like ROUGE (Lin, 2004) but lack systems, we propose a Factual Corrector model, factual correctness. In view of this, we argue that FC, to help improve the factual correctness of any a robust abstractive summarization system must given summary. We frame the correction process be equipped with factual knowledge to accurately as a seq2seq problem: the input is the original sum- summarize the article. mary and the article, and the output is the corrected In this paper, we represent facts in the form of summary. Thus, we leverage the large-scale pre- a knowledge graph. Although there are numerous trained language generation model UniLM (Dong efforts in building commonly applicable knowledge et al., 2019). We finetune the model on summariza- networks to facilitate knowledge extraction and tion datasets with correct/wrong summaries auto- integration, such as ConceptNet (Speer et al., 2017) matically generated by backtranslation/randomly and WikiData, we find that these tools are more swapped entities. Therefore, the training data does useful in conferring commonsense knowledge. In not require additional human labelling. As shown abstractive summarization for contents like news in Table 2, FC makes three corrections, replacing articles, many entities and relations are previously the original wrong entities which appear elsewhere unseen. Plus, our goal is to produce summaries that in the article with the right ones. do not conflict with the facts in the article. Thus, we propose to extract factual knowledge from within In the experiments on benchmark summariza- the article. tion datasets, the FASUM model shows great im- provements in the factual correctness of generated We employ the information extraction (IE) tool summaries. Using an independently trained BERT- OpenIE (Angeli et al., 2015) to extract facts from based factual correctness evaluator (Krysci´ nski´ the article in the form of relational tuples: (subject, et al., 2019b), we find that on CNN/DailyMail, FA- relation, object). We then use the Levi transforma- SUM obtains 1.2% higher fact correctness scores tion (Levi, 1942) to convert each tuple component than UNILM (Dong et al., 2019) and 4.5% higher into a node in the knowledge graph. This graph than BOTTOMUP (Gehrmann et al., 2018). On contains the facts in the article and is integrated in the other hand, FC can effectively improve the the summary generation process. factual correctness of given summaries: after cor- Then, we use a graph attention network rection, the factual score of summaries from BOT- (Velickoviˇ c´ et al., 2017) to obtain the representation TOMUP increases 1.7% on CNN/DailyMail and of each node, and fuse that into a transformer-based 1.2% on XSum, and the score of summaries from encoder-decoder architecture. The decoder attends TCONVS2S increases 3.9% on XSum. Human to each graph node in every transformer block. Fi- evaluation further corroborates the effectiveness of nally, it leverages the copy-generate mechanism to our models. We further propose an easy-to-compute metric, state-of-the-art results on various tasks including matched relation tuples, to evaluate the factual pre- abstractive summarization. cision of summaries without using ground-truth labels. This is an extension to the metric proposed 2.2 Fact-Aware Summarization in Goodrich et al. (2019). Under this metric, we Since the task of judging whether a summary is show that FASUM can obtain a higher score than other baselines and FC can help boost the score consistent with an article is similar to the text en- of summaries from other abstractive models. Fur- tailment problem, several recent work use text entailment models to evaluate and boost the factual thermore, we quantitatively show that FASUM is a highly abstractive summarization model. correctness of summarization. Li et al. (2018) co- In summary, our contribution in this paper is trains summarization and entailment for the en- three-fold: coder and employs an entailment-aware decoder via Reward Augmented Maximum Likelihood 1. We propose FASUM, which conducts neural (RAML). Falke et al. (2019) proposes using off- graph computation on the factual knowledge the-shelf entailment models to rerank candidate graph and integrate it into an end-to-end pro- summary sentences to boost factual correctness. cess of summary generation. Apart from using entailment models for factual correctness, Zhang et al.

Load more