Better Summarization Evaluation with Word Embeddings for ROUGE

Better Summarization Evaluation with Word Embeddings for ROUGE Jun-Ping Ng Viktoria Abrecht Bloomberg L.P. Bloomberg L.P. New York, USA New York, USA [email protected] [email protected] Abstract ROUGE is not perfect however. Two problems with ROUGE are that 1) it favors lexical simi- ROUGE is a widely adopted, automatic larities between generated summaries and model evaluation measure for text summariza- summaries, which makes it unsuitable to evaluate tion. While it has been shown to corre- abstractive summarization, or summaries with a late well with human judgements, it is bi- significant amount of paraphrasing, and 2) it does ased towards surface lexical similarities. not make any provision to cater for the readability This makes it unsuitable for the evalua- or fluency of the generated summaries. tion of abstractive summarization, or sum- There has been on-going efforts to improve maries with substantial paraphrasing. We on automatic summarization evaluation measures, study the effectiveness of word embed- such as the Automatically Evaluating Summaries dings to overcome this disadvantage of of Peers (AESOP) task in TAC (Dang and ROUGE. Specifically, instead of measur- Owczarzak, 2009; Owczarzak, 2010; Owczarzak ing lexical overlaps, word embeddings are and Dang, 2011). However, ROUGE remains as used to compute the semantic similarity of one of the most popular metric of choice, as it has the words used in summaries instead. Our repeatedly been shown to correlate very well with experimental results show that our pro- human judgements (Lin, 2004a; Over and Yen, posal is able to achieve better correlations 2004; Owczarzak and Dang, 2011). with human judgements when measured In this work, we describe our efforts to tackle with the Spearman and Kendall rank co- the first problem of ROUGE that we have iden- efficients. tified above — its bias towards lexical similari- 1 Introduction ties. We propose to do this by making use of word embeddings (Bengio et al., 2003). Word embed- Automatic text summarization is a rich field of re- dings refer to the mapping of words into a multi- search. For example, shared task evaluation work- dimensional vector space. We can construct the shops for summarization were held for more than mapping, such that the distance between two word a decade in the Document Understanding Con- projections in the vector space corresponds to the ference (DUC), and subsequently the Text Anal- semantic similarity between the two words. By in- ysis Conference (TAC). An important element of corporating these word embeddings into ROUGE, these shared tasks is the evaluation of participating we can overcome its bias towards lexical similar- systems. Initially, manual evaluation was carried ities and instead make comparisons based on the out, where human judges were tasked to assess semantics of words sequences. We believe that the quality of automatically generated summaries. this will result in better correlations with human However in an effort to make evaluation more assessments, and avoid situations where two word scaleable, the automatic ROUGE1 measure (Lin, sequences share similar meanings, but get unfairly 2004b) was introduced in DUC-2004. ROUGE penalized by ROUGE due to differences in lexico- determines the quality of an automatic summary graphic representations. through comparing overlapping units such as n- As an example, consider these two phrases: 1) grams, word sequences, and word pairs with hu- It is raining heavily, and 2) It is pouring. If we man written summaries. are performing a lexical string match, as ROUGE 1Recall-Oriented Understudy of Gisting Evaluation does, there is nothing in common between the 1925 Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1925–1930, Lisbon, Portugal, 17-21 September 2015. c 2015 Association for Computational Linguistics. terms “raining”, “heavily”, and “pouring”. How- between the graph structures of the generated sum- ever, these two phrases mean the same thing. If maries and model summaries. one of the phrases was part of a human written summary, while the other was output by an auto- 3 Methodology matic summarization system, we want to be able Let us now describe our proposal to integrate word to reward the automatic system accordingly. embeddings into ROUGE in greater detail. In our experiments, we show that word embed- To start off, we will first describe the word em- dings indeed give us better correlations with hu- beddings that we intend to adopt. A word embed- man judgements when measured with the Spear- ding is really a function W , where W : w Rn, → man and Kendall rank coefficient. This is a signif- and w is a word or word sequence. For our pur- icant and exciting result. Beyond just improving pose, we want W to map two words w1 and w2 the evaluation prowess of ROUGE, it has the po- such that their respective projections are closer tential to expand the applicability of ROUGE to to each other if the words are semantically sim- abstractive summmarization as well. ilar, and further apart if they are not. Mikolov et al. (2013b) describe one such variant, called 2 Related Work word2vec, which gives us this desired property2. We will thus be making use of word2vec. While ROUGE is widely-used, as we have noted We will now explain how word embeddings earlier, there is a significant body of work study- can be incorporated into ROUGE. There are sev- ing the evaluation of automatic text summarization eral variants of ROUGE, of which ROUGE-1, systems. A good survey of many of these mea- ROUGE-2, and ROUGE-SU4 have often been sures has been written by Steinberger and Jezekˇ used. This is because they have been found to cor- (2012). We will thus not attempt to go through relate well with human judgements (Lin, 2004a; every measure here, but rather highlight the more Over and Yen, 2004; Owczarzak and Dang, 2011). significant efforts in this area. ROUGE-1 measures the amount of unigram over- Besides ROUGE, Basic Elements (BE) (Hovy lap between model summaries and automatic sum- et al., 2005) has also been used in the DUC/TAC maries, and ROUGE-2 measures the amount of bi- shared task evaluations. It is an automatic method gram overlap. ROUGE-SU4 measures the amount which evaluates the content completeness of a of overlap of skip-bigrams, which are pairs of generated summary by breaking up sentences into words in the same order as they appear in a sen- smaller, more granular units of information (re- tence. In each of these variants, overlap is com- ferred to as “Basic Elements”). puted by matching the lexical form of the words The pyramid method originally proposed by within the target pieces of text. Formally, we can Passonneau et al. (2005) is another staple in define this as a similarity function fR such that: DUC/TAC. However it is a semi-automated method, where significant human intervention is 1, if w1 = w2 required to identify units of information, called fR(w1, w2) = (1) (0, otherwise Summary Content Units (SCUs), and then to map content within generated summaries to these where w1 and w2 are the words (could be unigrams SCUs. Recently however, an automated variant of or n-grams) being compared. this method has been proposed (Passonneau et al., In our proposal3, which we will refer to as 2013). In this variant, word embeddings are used, ROUGE-WE, we define a new similarity function as we are proposing in this paper, to map text con- fWE such that: tent within generated summaries to SCUs. How- 0, if v1or v2 are OOV ever the SCUs still need to be manually identified, fWE(w1, w2) = (2) (v1 v2, otherwise limiting this variant’s scalability and applicability. · Many systems have also been proposed in the where w1 and w2 are the words being compared, AESOP task in TAC from 2009 to 2011. For ex- and vx = W (wx). OOV here means a situation ample, the top system reported in Owczarzak and 2 Dang (2011), AutoSummENG (Giannakopoulos The effectiveness of the learnt mapping is such that we can now compute analogies such as king man + woman = and Karkaletsis, 2009), is a graph-based system queen. − which scores summaries based on the similarity 3https://github.com/ng-j-p/rouge-we 1926 where we encounter a word w that our word em- score for each of the summaries generated by all of bedding function W returns no vector for. For the 51 participating systems. Each of these sum- the purpose of this work, we make use of a set maries would also have been assessed by human of 3 million pre-trained vector mappings4 trained judges using these three key metrics: from part of Google’s news dataset (Mikolov et al., Pyramid. As reviewed in Section 2, this is a semi- 2013a) for W . automated measure described in Passonneau et al. Reducing OOV terms for n-grams. With our (2005). formulation for fWE, we are able to compute Responsiveness. Human judges are tasked to variants of ROUGE-WE that correspond to those evaluate how well a summary adheres to the infor- of ROUGE, including ROUGE-WE-1, ROUGE- mation requested, as well as the linguistic quality WE-2, and ROUGE-WE-SU4. However, despite of the generated summary. the large number of vector mappings that we have, Readability. Human judges give their judgement there will still be a large number of OOV terms in on how fluent and readable a summary is. the case of ROUGE-WE-2 and ROUGE-WE-SU4, The evaluation system’s scores are then tested to where the basic units of comparison are bigrams. see how well they correlate with the human assess- To solve this problem, we can compose individ- ments.

Load more