Automatic Evaluation of Summary Using Textual Entailment
Total Page:16
File Type:pdf, Size:1020Kb
Automatic Evaluation of Summary Using Textual Entailment Pinaki Bhaskar Partha Pakray Department of Computer Science & Department of Computer Science & Engineering, Jadavpur University, Engineering, Jadavpur University, Kolkata – 700032, India Kolkata – 700032, India [email protected] [email protected] summary is very much needed when a large Abstract number of summaries have to be evaluated, spe- cially for multi-document summaries. For sum- This paper describes about an automatic tech- mary evaluation we have developed an automat- nique of evaluating summary. The standard ed evaluation technique based on textual entail- and popular summary evaluation techniques or ment. tools are not fully automatic; they all need Recognizing Textual Entailment (RTE) is one some manual process. Using textual entail- of the recent research areas of Natural Language ment (TE) the generated summary can be evaluated automatically without any manual Processing (NLP). Textual Entailment is defined evaluation/process. The TE system is the as a directional relationship between pairs of text composition of lexical entailment module, lex- expressions, denoted by the entailing “Text” (T) ical distance module, Chunk module, Named and the entailed “Hypothesis” (H). T entails H if Entity module and syntactic text entailment the meaning of H can be inferred from the mean- (TE) module. The syntactic TE system is ing of T. Textual Entailment has many applica- based on the Support Vector Machine (SVM) tions in NLP tasks, such as Summarization, In- that uses twenty five features for lexical simi- formation Extraction, Question Answering, In- larity, the output tag from a rule based syntac- formation Retrieval. tic two-way TE system as a feature and the outputs from a rule based Chunk Module and 2 Related Work Named Entity Module as the other features. The documents are used as text (T) and sum- Most of the approaches in textual entailment mary of these documents are taken as hypoth- domain take Bag-of-words representation as one esis (H). So, if the information of documents is option, at least as a baseline system. The system entailed into the summary then it will be a (Herrera et al., 2005) obtains lexical entailment very good summary. After comparing with the 1 ROUGE 1.5.5 evaluation scores, the proposed relations from WordNet . The lexical unit T en- evaluation technique achieved a high accuracy tails the lexical unit H if they are synonyms, Hy- of 98.25% w.r.t ROUGE-2 and 95.65% w.r.t ponyms, Multiwords, Negations and Antonyms ROUGE-SU4. according to WordNet or if there is a relation of similarity between them. The system accuracy 1 Introduction was 55.8% on RTE-1 test dataset. Based on the idea that meaning is determined Automatic summaries are usually evaluated us- by context, (Clarke, 2006) proposed a formal ing human generated reference summaries or definition of entailment between two sentences some manual efforts. The summary, which has in the form of a conditional probability on a been generated automatically from the docu- measure space. The system submitted in RTE-4 ments, is difficult to evaluated using completely provided three practical implementations of this automatic evaluation process or tool. The most formalism: a bag of words comparison as a base- popular and standard summary evaluation tool is line and two methods based on analyzing sub- ROUGE and Pyramid. ROUGE evaluates the sequences of the sentences possibly with inter- automated summary by comparing it with the set vening symbols. The system accuracy was 53% of human generated reference summary. Where on RTE-2 test dataset. as Pyramid method needs to identify the nuggets manually. Both the processes are very hectic and time consuming. So, automatic evaluation of 1 http://wordnet.princeton.edu/ 30 Proceedings of the Student Research Workshop associated with RANLP 2013, pages 30–37, Hissar, Bulgaria, 9-11 September 2013. Adams et al. (2007) has used linguistic fea- plates that are extracted from different sentences tures as training data for a decision tree classifi- (text and hypothesis) and connect the same an- er. These features are derived from the text– chors in these sentences are assumed to entail hypothesis pairs under examination. The system each other. The system accuracy was 46.8% on mainly used ROUGE (Recall–Oriented Under- RTE-5 test set. study for Gisting Evaluation), NGram overlap Tsuchida and Ishikawa (2011) combine the metrics, Cosine Similarity metric and WordNet entailment score calculated by lexical-level based measure as features. The system accuracy matching with the machine-learning based filter- was 52% on RTE-2 test dataset. ing mechanism using various features obtained Montalvo-Huhn et al. (2008) guessed at en- from lexical-level, chunk-level and predicate ar- tailment based on word similarity between the gument structure-level information. In the filter- hypotheses and the text. Three kinds of compari- ing mechanism, the false positive T-H pairs that sons were attempted: original words (with nor- have high entailment score but do not represent malized dates and numbers), synonyms and an- entailment are discarded. The system accuracy tonyms. Each of the three comparisons contrib- was 48% on RTE-7 test set. utes a different weight to the entailment decision. Lin and Hovy (2003) developed an automatic The two-way accuracy of the system was 52.6% summary evaluation system using n-gram co- on RTE-4 test dataset. occurrence statistics. Following the recent adop- Litkowski’s (2009) system consists solely of tion by the machine translation community of routines to examine the overlap of discourse enti- automatic evaluation using the BLEU/NIST ties between the texts and hypotheses. The two- scoring process, they conduct an in-depth study way accuracy of the system was 53% on RTE-5 of a similar idea for evaluating summaries. They Main task test dataset. showed that automatic evaluation using unigram Majumdar and Bhattacharyya (2010) describe co-occurrences between summary pairs corre- a simple lexical based system, which detects en- lates surprising well with human evaluations, tailment based on word overlap between the Text based on various statistical metrics; while direct and Hypothesis. The system is mainly designed application of the BLEU evaluation procedure to incorporate various kinds of co-referencing does not always give good results. that occur within a document and take an active Harnly et al. (2005) also proposed an automat- part in the event of Text Entailment. The accura- ic summary evaluation technique by the Pyramid cy of the system was 47.56% on RTE-6 Main method. They presented an experimental system Task test dataset. for testing automated evaluation of summaries, The MENT (Microsoft ENTailment) pre-annotated for shared information. They re- (Vanderwende et al., 2006) system predicts en- duced the problem to a combination of similarity tailment using syntactic features and a general measure computation and clustering. They purpose thesaurus, in addition to an overall achieved best results with a unigram overlap alignment score. MENT is based on the premise similarity measure and single link clustering, that it is easier for a syntactic system to predict which yields high correlation to manual pyramid false entailments. The system accuracy was scores (r=0.942, p=0.01), and shows better corre- 60.25% on RTE-2 test set. lation than the n-gram overlap automatic ap- Wang and Neumannm (2007) present a novel proaches of the ROUGE system. approach to RTE that exploits a structure- oriented sentence representation followed by a 3 Textual Entailment System similarity function. The structural features are A two-way hybrid textual entailment (TE) automatically acquired from tree skeletons that recognition system that uses lexical and syntactic are extracted and generalized from dependency features has been described in this section. The trees. The method makes use of a limited size of system architecture has been shown in Figure 1. training data without any external knowledge The hybrid TE system as (Pakray et al., 2011b) bases (e.g., WordNet) or handcrafted inference has used the Support Vector Machine Learning rules. They achieved an accuracy of 66.9% on technique that uses thirty four features for train- the RTE-3 test data. ing. Five features from Lexical TE, seventeen The major idea of Varma et al. (2009) is to features from Lexical distance measure and elev- find linguistic structures, termed templates that en features from the rule based syntactic two- share the same anchors. Anchors are lexical ele- way TE system have been selected. ments describing the context of a sentence. Tem- 31 between two words in order in a sentence. The measure 1-skip_bigram_Match is defined as !"#$_!"#$(!,!) 1_skip_bigram_Match = (1) ! where, skip_gram(T,H) refers to the number of common 1-skip-bigrams (pair of words in sen- tence order with one word gap) found in T and H and n is the number of 1-skip-bigrams in the hy- pothesis H. v. Stemming. Stemming is the process of re- ducing terms to their root forms. For example, the plural forms of a noun such as ‘boxes’ are stemmed into ‘box’, and inflectional endings with ‘ing’, ‘es’, ‘s’ and ‘ed’ are removed from verbs. Each word in the text and hypothesis pair is stemmed using the stemming function provid- ed along with the WordNet 2.0. If s1= number of common stemmed uni- grams between text and hypothesis and s2= Figure 1. Hybrid Textual Entailment System number of stemmed unigrams in Hypothesis, then the measure Stemming_match is defined as 3.1 Lexical Similarity Stemming_Match=s1/s2 In this section the various lexical features WordNet is one of most important resource for (Pakray et al., 2011b) for textual entailment are lexical analysis. The WordNet 2.0 has been used described in detail. for WordNet based unigram match and stemming 2 i.