What Have We Achieved on Text Summarization? Dandan Huang1,2∗, Leyang Cui1,2,3∗, Sen Yang1,2∗, Guangsheng Bao1,2, Kun Wang, Jun Xie4, Yue Zhang1,2y 1 School of Engineering, Westlake University 2 Institute of Advanced Technology, Westlake Institute for Advanced Study 3 Zhejiang University, 4 Tencent SPPD fhuangdandan, cuileyang, yangsen,
[email protected],
[email protected],
[email protected],
[email protected] Abstract been investigated for both extractive (Cheng and Lapata, 2016; Xu and Durrett, 2019) and abstrac- Deep learning has led to significant improve- ment in text summarization with various meth- tive (Nallapati et al., 2016; Lewis et al., 2019; Bal- ods investigated and improved ROUGE scores achandran et al., 2020) summarization systems. reported over the years. However, gaps still Although improved ROUGE scores have been exist between summaries produced by auto- reported on standard benchmarks such as Giga- matic summarizers and human professionals. word (Graff et al., 2003), NYT (Grusky et al., Aiming to gain more understanding of summa- 2018) and CNN/DM (Hermann et al., 2015) over rization systems with respect to their strengths the years, it is commonly accepted that the quality and limits on a fine-grained syntactic and se- mantic level, we consult the Multidimensional of machine-generated summaries still falls far be- Quality Metric1 (MQM) and quantify 8 ma- hind human written ones. As a part of the reason, jor sources of errors on 10 representative sum- ROUGE has been shown insufficient as a precise marization models manually. Primarily, we indicator on summarization quality evaluation (Liu find that 1) under similar settings, extractive and Liu, 2008;B ohm¨ et al., 2019).