Question Answering As an Automatic Evaluation Metric for News Article Summarization
Total Page:16
File Type:pdf, Size:1020Kb
Question Answering as an Automatic Evaluation Metric for News Article Summarization Matan Eyal1, 2, Tal Baumel1, 3, Michael Elhadad1 1Dept. Computer Science, Ben Gurion University 2IBM Research, Israel, 3Microsoft fmataney, [email protected], [email protected] Abstract See et al.(2017)’s Summary: bolton will offer new contracts to emile heskey, 37, eidur gudjohnsen, 36, and adam bogdan, 27. Recent work in the field of automatic sum- heskey and gudjohnsen joined on short-term deals in december. eidur gudjohnsen has scored five times in the championship . marization and headline generation focuses on APES score: 0.33 maximizing ROUGE scores for various news Baseline Model Summary (Encoder / Decoder / Attention / datasets. We present an alternative, extrin- Copy / Coverage): bolton will offer new contracts to emile hes- sic, evaluation metric for this task, Answering key, 37, eidur gudjohnsen, 36, and goalkeeper adam bogdan, 27. Performance for Evaluation of Summaries. heskey and gudjohnsen joined on short-term deals in december, and have helped neil lennon ’s side steer clear of relegation. ei- APES utilizes recent progress in the field of dur gudjohnsen has scored five times in the championship, as reading-comprehension to quantify the ability well as once in the cup this season . of a summary to answer a set of manually cre- APES score: 0.33 ated questions regarding central entities in the Our Model (APES optimization): bolton will offer new con- source article. We first analyze the strength tracts to emile heskey, 37, eidur gudjohnsen, 36, and goalkeeper adam bogdan, 27. heskey joined on short-term deals in decem- of this metric by comparing it to known man- ber, and have helped neil lennon ’s side steer clear of relegation. ual evaluation metrics. We then present an eidur gudjohnsen has scored five times in the championship, as end-to-end neural abstractive model that maxi- well as once in the cup this season. lennon has also fined mid- mizes APES, while increasing ROUGE scores fielders barry bannan and neil danns two weeks wages this week. to competitive results. both players have apologised to lennon . APES score: 1.00 Questions from the CNN/Daily Mail Dataset: 1 Introduction Q: goalkeeper also rewarded with new contract; A: adam bogdan The task of automatic text summarization aims to Q: and neil danns both fined by club after drinking inci- produce a concise version of a source document dent; A: barry bannan Q: barry bannan and both fined by club after drinking in- while preserving its central information. Current cident; A: neil danns summarization models are divided into two ap- proaches, extractive and abstractive. In extractive Figure 1: Example 3083 from the test set. summarization, summaries are created by select- ing a collection of key sentences from the source document (e.g., Nallapati et al.(2017); Narayan et al., 2007). Tasks like TAC AESOP (Owczarzak et al.(2018)). Abstractive summarization, on the and Dang, 2011) used ROUGE as a strong base- other hand, aims to rephrase and compress the in- line and confirmed the correlation of ROUGE with put text in order to create the summary. Progress manual evaluation. in sequence-to-sequence models (Sutskever et al., While it has been shown that ROUGE is corre- 2014) has led to recent success in abstractive sum- lated to Pyramid, Louis and Nenkova(2013) show marization models. Current models (Nallapati that this summary level correlation decreases sig- et al., 2016; See et al., 2017; Paulus et al., 2017; nificantly when only a single reference is given. Celikyilmaz et al., 2018) made various adjust- In contrast to the smaller manually curated DUC ments to sequence-to-sequence models to gain im- datasets used in the past, more recent large-scale provements in ROUGE (Lin, 2004) scores. summarization and headline generation datasets ROUGE has achieved its status as the most (CNN/Daily Mail (Hermann et al., 2015), Giga- common method for summaries evaluation by word (Graff et al., 2003), New York Times (Sand- showing high correlation to manual evaluation haus, 2008)) provide only a single reference sum- methods, e.g., the Pyramid method (Nenkova mary for each source document. In this work, 3938 Proceedings of NAACL-HLT 2019, pages 3938–3948 Minneapolis, Minnesota, June 2 - June 7, 2019. c 2019 Association for Computational Linguistics we introduce a new automatic evaluation metric more suitable for such single reference news arti- cle datasets. We define APES, Answering Performance for Evaluation of Summaries, a new metric for au- tomatically evaluating summarization systems by querying summaries with a set of questions central to the input document (see Fig.1). Reducing the task of summaries evaluation to an extrinsic task such as question answering is in- tuitively appealing. This reduction, however, is ef- fective only under specific settings: (1) Availabil- ity of questions focusing on central information Figure 2: Evaluation flow of APES. and (2) availability of a reliable question answer- ing (QA) model. Concerning issue 1, questions focusing on the task of extending this method to other genres salient entities can be available as part of the for future work. dataset: the headline generation dataset most used Our contributions in this work are: (1) We in recent years, the CNN/Daily Mail dataset (Her- first present APES, a new extrinsic summarization mann et al., 2015), was constructed by creating evaluation metric; (2) We show APES strength questions about entities that appear in the refer- through an analysis of its correlation with Pyra- ence summary. Since the target summary contains mid and Responsiveness manual metrics; (3) we salient information from the source document, we present a new abstractive model which maximizes consider all entities appearing in the target sum- APES by increasing attention scores of salient mary as salient entities. In other cases, salient entities, while increasing ROUGE to competitive questions can be generated in an automated man- level. We make two software packages avail- ner, as we discuss below. able online: (a) An evaluation library which re- ceives the same input as ROUGE and produces Concerning issue 2, we focus on a relatively both APES and ROUGE scores.1 (b) Our PyTorch easy type of questions: given source documents (Paszke et al., 2017) based summarizer that opti- and associated questions, a QA system can be mizes APES scores together with trained models.2 trained over fill-in-the-blank type questions as was shown in Hermann et al.(2015) and Chen et al. 2 Related Work (2016). In their work, Chen et al.(2016) achieve ‘ceiling performance’ for the QA task on the 2.1 Evaluation Methods CNN/Daily Mail dataset. We empirically assess Automatic evaluation metrics of summarization in our work whether this performance level (accu- methods can be categorized into either intrinsic racy of 72.4 and 75.8 over CNN and Daily Mail re- or extrinsic metrics. Intrinsic metrics measure spectively) makes our evaluation scheme feasible a summary’s quality by measuring its similarity and well correlated with manual summary evalua- to a manually produced target gold summary or tion. by inspecting properties of the summary. Exam- Given the availability of salient questions and ples of such metrics include ROUGE (Lin, 2004), automatic QA systems, we propose APES as an Basic Elements (Hovy et al., 2006) and Pyramid evaluation metric for news article datasets, the (Nenkova et al., 2007). Alternatively, extrinsic most popular summarization genre in recent years. metrics test the ability of a summary to support To measure the APES metric of a candidate performing related tasks and compare the perfor- summary, we run a trained QA system with the mance of humans or systems when completing a summary as input alongside a set of questions as- task that requires understanding the source docu- sociated with the source document. The APES ment (Steinberger and Jezekˇ , 2012). Such extrin- metric for a summarization model is the percent- sic tasks may include text categorization, infor- age of questions that were answered correctly over 1www.github.com/mataney/APES the whole dataset, as depicted in Fig.2. We leave 2www.github.com/mataney/APES-optimizer 3939 mation retrieval, question answering (Jing et al., grammaticality, coherence and structure, focus, 1998) or assessing the relevance of a document to referential clarity, and non-redundancy. Although a query (Hobson et al., 2007). some automatic methods were suggested as sum- ROUGE, or “Recall-Oriented Understudy for marization evaluation metrics (Vadlapudi and Ka- Gisting Evaluation” (Lin, 2004), refers to a set tragadda, 2010; Tay et al., 2017), these metrics of automatic intrinsic metrics for evaluating au- are commonly assessed manually, and, therefore, tomatic summaries. ROUGE-N scores a candi- rarely reported as part of experiments. date summary by counting the number of N-gram Our proposed evaluation method, APES, at- overlaps between the automatic summary and the tempts to capture the capability of a summary to reference summaries. Other notable metrics from enable readers to answer questions – similar to the this family are ROUGE-L, where scores are given manual task initially discussed in Jing et al.(1998) by the Longest Common Subsequence (LCS) be- and recently reported in Narayan et al.(2018). Our tween the suggested and reference documents, and contribution consists of automating this method ROUGE-SU4, which uses skip-bigram, a more and assessing the feasibility of the resulting ap- flexible method for computing the overlap of bi- proximation. grams. 2.2 Neural Methods for Abstractive and The Pyramid method (Nenkova et al., 2007) is Extractive Summarization a manual evaluation metric that analyzes multiple human-made summaries into “Summary Content The first paper to use an end-to-end neural network Units” (SCUs) and assigns importance weights to for the summarization task was Rush et al.(2015): each SCU.