On Faithfulness and Factuality in Abstractive Summarization Joshua Maynez∗ Shashi Narayan∗ Bernd Bohnet Ryan McDonald Google Research fjoshuahm,shashinarayan,bohnetbd,
[email protected] Abstract understanding how maximum likelihood training and approximate beam-search decoding in these It is well known that the standard likelihood models lead to less human-like text in open-ended training and approximate decoding objectives text generation such as language modeling and in neural text generation models lead to less human-like responses for open-ended tasks story generation (Holtzman et al., 2020; Welleck such as language modeling and story gener- et al., 2020; See et al., 2019). In this paper we ation. In this paper we have analyzed limi- investigate how these models are prone to gener- tations of these models for abstractive docu- ate hallucinated text in conditional text generation, ment summarization and found that these mod- specifically, extreme abstractive document summa- els are highly prone to hallucinate content that rization (Narayan et al., 2018a). is unfaithful to the input document. We con- Document summarization — the task of produc- ducted a large scale human evaluation of sev- eral neural abstractive summarization systems ing a shorter version of a document while preserv- to better understand the types of hallucinations ing its information content (Mani, 2001; Nenkova they produce. Our human annotators found and McKeown, 2011) — requires models to gener- substantial amounts of hallucinated content in ate text that is not only human-like but also faith- all model generated summaries. However, our ful and/or factual given the document. The exam- analysis does show that pretrained models are ple in Figure1 illustrates that the faithfulness and better summarizers not only in terms of raw factuality are yet to be conquered by conditional metrics, i.e., ROUGE, but also in generating text generators.