Arxiv:2010.04529V1 [Cs.CL] 9 Oct 2020

What Have We Achieved on Text Summarization? Dandan Huang1,2∗, Leyang Cui1,2,3∗, Sen Yang1,2∗, Guangsheng Bao1,2, Kun Wang, Jun Xie4, Yue Zhang1,2y 1 School of Engineering, Westlake University 2 Institute of Advanced Technology, Westlake Institute for Advanced Study 3 Zhejiang University, 4 Tencent SPPD fhuangdandan, cuileyang, yangsen, [email protected], [email protected], [email protected], [email protected] Abstract been investigated for both extractive (Cheng and Lapata, 2016; Xu and Durrett, 2019) and abstrac- Deep learning has led to significant improve- ment in text summarization with various meth- tive (Nallapati et al., 2016; Lewis et al., 2019; Bal- ods investigated and improved ROUGE scores achandran et al., 2020) summarization systems. reported over the years. However, gaps still Although improved ROUGE scores have been exist between summaries produced by auto- reported on standard benchmarks such as Giga- matic summarizers and human professionals. word (Graff et al., 2003), NYT (Grusky et al., Aiming to gain more understanding of summa- 2018) and CNN/DM (Hermann et al., 2015) over rization systems with respect to their strengths the years, it is commonly accepted that the quality and limits on a fine-grained syntactic and se- mantic level, we consult the Multidimensional of machine-generated summaries still falls far be- Quality Metric1 (MQM) and quantify 8 ma- hind human written ones. As a part of the reason, jor sources of errors on 10 representative sum- ROUGE has been shown insufficient as a precise marization models manually. Primarily, we indicator on summarization quality evaluation (Liu find that 1) under similar settings, extractive and Liu, 2008;B ohm¨ et al., 2019). In the research summarizers are in general better than their literature, human evaluation has been conducted abstractive counterparts thanks to strength in as a complement (Narayan et al., 2018). However, faithfulness and factual-consistency; 2) milestone techniques such as copy, coverage and human evaluation reports that accompany ROUGE hybrid extractive/abstractive methods do bring scores are limited in scope and coverage. On a specific improvements but also demonstrate fine-grained level, it still remains uncertain what limitations; 3) pre-training techniques, and in we have achieved overall and what fundamental particular sequence-to-sequence pre-training, changes each milestone technique has brought. are highly effective for improving text summa- We aim to address the above issues by quantify- rization, with BART giving the best results. ing the primary sources of errors over representa- 1 Introduction tive models. In particular, following MQM (Mar- iana, 2014), we design 8 metrics on the Accuracy Automatic text summarization has received con- and Fluency aspects. Models are analyzed by the stant research attention due to its practical impor- overall error counts on a test set according to each tance. Existing methods can be categorized into metric, and therefore our evaluation can be more arXiv:2010.04529v1 [cs.CL] 9 Oct 2020 extractive (Dorr et al., 2003; Mihalcea and Tarau, informative and objective compared with exist- 2004; Nallapati et al., 2017) and abstractive (Jing ing manual evaluation reports. We call this set and McKeown, 2000; Rush et al., 2015; See et al., of metrics PolyTope. Using PolyTope, we manu- 2017) methods, with the former directly selecting ally evaluate 10 text summarizers including Lead-3, phrases and sentences from the original text as sum- TextRank (Mihalcea and Tarau, 2004), Sequence- maries, and the latter synthesizing an abridgment to-sequence with Attention (Rush et al., 2015), by using vocabulary words. Thanks to the resur- SummaRuNNer (Nallapati et al., 2017), Point- gence of deep learning, neural architectures have Generator (See et al., 2017), Point-Generator-with- ∗ Equal contribution. Coverage (Tu et al., 2016; See et al., 2017), Bottom- y Corresponding author. Up (Gehrmann et al., 2018), BertSumExt (Liu and 1 MQM is a framework for declaring and describing human writing quality which stipulates a hierarchical listing of error Lapata, 2019), BertSumExtAbs (Liu and Lapata, types restricted to human writing and translation. 2019) and BART (Lewis et al., 2019), through which we compare neural structures with tradi- et al., 2017; Xu and Durrett, 2019). Most recently, tional preneural ones, and abstractive models with Zhong et al.(2020) adopts document-level features their extractive counterparts, discussing the effec- to rerank extractive summaries. tiveness of frequently-used techniques in summarization systems. Empirically, we find that: Abstractive Summarization Jing and McKe- own (2000) presented a cut-paste based abstractive 1. Preneural vs Neural: Traditional rule-based summarizer, which edited and merged extracted methods are still strong baselines given pow- snippets into coherent sentences. Rush et al. (2015) erful neural architectures. proposed a sequence-to-sequence architecture for 2. Extractive vs Abstractive: Under similar set- abstractive summarization. Subsequently, Trans- tings, extractive approaches outperform ab- former was used and outperformed traditional abstractive models in general. The main short- stractive summarizer by ROUGE scores (Duan coming is unnecessity for extractive models, et al., 2019). Techniques such as AMR pars- and omission and intrinsic hallucination for ing (Liu et al., 2015), copy (Gu et al., 2016), cov- abstractive models. erage (Tu et al., 2016; See et al., 2017), smooth- ing (Muller¨ et al., 2019) and pre-training (Lewis 3. Milestone Techniques: Copy works effec- et al., 2019; Liu and Lapata, 2019) were also ex- tively in reproducing details. It also reduces amined to enhance summarization. Hybrid abstrac- duplication on the word level but tends to tive and extractive methods adopt a two-step ap- cause redundancy to a certain degree. Cover- proach including content selection and text gen- age solves repetition errors by a large margin, eration (Gehrmann et al., 2018; Hsu et al., 2018; but shows limits in faithful content generation. Celikyilmaz et al., 2018), achieving higher perfor- Hybrid extractive/abstractive models reflect mance than end-to-end models in ROUGE. the relative strengths and weaknesses of the two methods. Analysis of Summarization There has been much work analyzing summarization systems 4. Pre-training: Pre-training is highly effective based on ROUGE. Lapata and Barzilay (2005) ex- for summarization, and even achieves a bet- plored the fundamental aspect of “coherence” in ter content selection capability without copy machine generated summaries. Zhang et al. (2018) and coverage mechanisms. Particularly, joint analyzed abstractive systems, while Kedzie et pre-training combining text understanding and al. (2018) and Zhong et al. (2019) searched for generation gives the most salient advantage, effective architectures in extractive summarization. with the BART model achieving by far the Kryscinski et al. (2019) evaluated the overall qual- state-of-the-art results on both automatic and ity of summarization in terms of redundancy, rele- our human evaluations. vance and informativeness. All the above rely on automatic evaluation metrics. Our work is in line We release the test set, which includes 10 system with these efforts in that we conduct a fine-grained outputs and their manually-labeled errors based on evaluation on various aspects. Different from the PolyTope, and a user-friendly evaluation toolkit to above work, we use human evaluation instead of help future research both on evaluation methods automatic evaluation. In fact, while yielding rich and automatic summarization systems2. conclusions, the above analytical work has also 2 Related Work exposed deficiencies of automatic toolkits. The quality of automatic evaluation is often criticized Extractive Summarization Early efforts based by the research community (Novikova et al., 2017; on statistical methods (Neto et al., 2002; Mihalcea Zopf, 2018) for its insufficiency in neither perme- and Tarau, 2004) make use of expertise knowledge ating into the overall quality of generation-based to manually design features or rules. Recent work texts (Liu and Liu, 2008) nor correlating with hu- based on neural architectures considers summa- man judgements (Kryscinski et al., 2019). rization as a word or sentence level classification There has also been analysis work augmenting problem and addresses it by calculating sentence ROUGE with human evaluation (Narayan et al., representations (Cheng and Lapata, 2016; Nallapati 2018; Liu and Lapata, 2019). Such work reports 2 https://github.com/hddbang/PolyTope coarse-grained human evaluation scores which typ- Methods Extractive Methods Abstractive Methods ROUGE Lead-3 TextRank Summa BertExt S2S PG PG-Coverage Bottom-Up BertAbs BART ROUGE-1 39.20 40.20 39.60 43.25 31.33 36.44 39.53 41.22 42.13 44.16 ROUGE-2 15.70 17.56 16.20 20.24 11.81 15.66 17.28 18.68 19.60 21.28 ROUGE-L 35.50 36.44 35.30 39.63 28.80 33.42 36.38 38.34 39.18 40.90 Table 1: ROUGE scores of 10 summarizers on CNN/DM Dataset (non-anonymous version). We get the score of Lead-3 and TextRank from Nallapati et al.(2017) and Zhou et al.(2018), respectively. ically consist of 2 to 3 aspects such as informative- summary. It is used as a standard baseline by most ness, fluency and succinctness. Recently, Maynez recent work (Cheng and Lapata, 2016; Gehrmann et al.(2020) conducted a human evaluation of 5 et al., 2018). Intuitively, the first three sentences of neural abstractive models on 500 articles. Their an article in news domain can likely be its abstract, main goal is to verify the faithfulness and factual- so the results of Lead-3 can be a highly faithful ity in abstractive models. In contrast, we evaluate approximation of human-written summary. both rule-based baselines and extractive/abstractive TextRank TextRank (Mihalcea and Tarau, 2004) summarizers on 8 error metrics, among which faith- is an unsupervised key text units selection method fulness and factuality are included. based on graph-based ranking models (Page et al., Our work is also related to research on human 1998).

Arxiv:2010.04529V1 [Cs.CL] 9 Oct 2020

Sports 28-9-2017.Qxp Layout 1

United States Securities and Exchange Commission Form

Man City Missed Penalty

Topps - UEFA Champions League Match Attax 2015/16 (08) - Checklist

Market Segmentation and Selective Exposure in Online News

2016 Panini Flawless Soccer Cards Checklist

Hairy Bikers Best of British Episode Guide

Estate Planning Mistakes

Team-Building Works Wonders for PSG As They Make Maiden Champions League Final

Ligue 1 Guide 2021/22

KARAOKE CATALOGUE: K, L Artist Song Title K

Whitney Houston's Millions: What Bobbi Kristina Brown May Leave