Automatic Evaluation of Summary Using Textual Entailment

Pinaki Bhaskar Partha Pakray Department of Computer Science & Department of Computer Science & Engineering, Jadavpur University, Engineering, Jadavpur University, Kolkata – 700032, India Kolkata – 700032, India [email protected] [email protected]

summary is very much needed when a large Abstract number of summaries have to be evaluated, spe- cially for multi-document summaries. For sum- This paper describes about an automatic tech- mary evaluation we have developed an automat- nique of evaluating summary. The standard ed evaluation technique based on textual entail- and popular summary evaluation techniques or ment. tools are not fully automatic; they all need Recognizing Textual Entailment (RTE) is one some manual process. Using textual entail- of the recent research areas of Natural Language ment (TE) the generated summary can be evaluated automatically without any manual Processing (NLP). Textual Entailment is defined evaluation/process. The TE system is the as a directional relationship between pairs of text composition of lexical entailment module, lex- expressions, denoted by the entailing “Text” (T) ical distance module, Chunk module, Named and the entailed “Hypothesis” (H). T entails H if Entity module and syntactic text entailment the meaning of H can be inferred from the mean- (TE) module. The syntactic TE system is ing of T. Textual Entailment has many applica- based on the Support Vector Machine (SVM) tions in NLP tasks, such as Summarization, In- that uses twenty five features for lexical simi- formation Extraction, , In- larity, the output tag from a rule based syntac- formation Retrieval. tic two-way TE system as a feature and the outputs from a rule based Chunk Module and 2 Related Work Named Entity Module as the other features. The documents are used as text (T) and sum- Most of the approaches in textual entailment mary of these documents are taken as hypoth- domain take Bag-of-words representation as one esis (H). So, if the information of documents is option, at least as a baseline system. The system entailed into the summary then it will be a (Herrera et al., 2005) obtains lexical entailment very good summary. After comparing with the 1 ROUGE 1.5.5 evaluation scores, the proposed relations from WordNet . The lexical unit T en- evaluation technique achieved a high accuracy tails the lexical unit H if they are synonyms, Hy- of 98.25% w.r.t ROUGE-2 and 95.65% w.r.t ponyms, Multiwords, Negations and Antonyms ROUGE-SU4. according to WordNet or if there is a relation of similarity between them. The system accuracy 1 Introduction was 55.8% on RTE-1 test dataset. Based on the idea that meaning is determined Automatic summaries are usually evaluated us- by context, (Clarke, 2006) proposed a formal ing human generated reference summaries or definition of entailment between two sentences some manual efforts. The summary, which has in the form of a conditional probability on a been generated automatically from the docu- measure space. The system submitted in RTE-4 ments, is difficult to evaluated using completely provided three practical implementations of this automatic evaluation process or tool. The most formalism: a bag of words comparison as a base- popular and standard summary evaluation tool is line and two methods based on analyzing sub- ROUGE and Pyramid. ROUGE evaluates the sequences of the sentences possibly with inter- automated summary by comparing it with the set vening symbols. The system accuracy was 53% of human generated reference summary. Where on RTE-2 test dataset. as Pyramid method needs to identify the nuggets manually. Both the processes are very hectic and time consuming. So, automatic evaluation of 1 http://wordnet.princeton.edu/

30 Proceedings of the Student Research Workshop associated with RANLP 2013, pages 30–37, Hissar, Bulgaria, 9-11 September 2013. Adams et al. (2007) has used linguistic fea- plates that are extracted from different sentences tures as training data for a decision tree classifi- (text and hypothesis) and connect the same an- er. These features are derived from the text– chors in these sentences are assumed to entail hypothesis pairs under examination. The system each other. The system accuracy was 46.8% on mainly used ROUGE (Recall–Oriented Under- RTE-5 test set. study for Gisting Evaluation), NGram overlap Tsuchida and Ishikawa (2011) combine the metrics, Cosine Similarity metric and WordNet entailment score calculated by lexical-level based measure as features. The system accuracy matching with the machine-learning based filter- was 52% on RTE-2 test dataset. ing mechanism using various features obtained Montalvo-Huhn et al. (2008) guessed at en- from lexical-level, chunk-level and predicate ar- tailment based on word similarity between the gument structure-level information. In the filter- hypotheses and the text. Three kinds of compari- ing mechanism, the false positive T-H pairs that sons were attempted: original words (with nor- have high entailment score but do not represent malized dates and numbers), synonyms and an- entailment are discarded. The system accuracy tonyms. Each of the three comparisons contrib- was 48% on RTE-7 test set. utes a different weight to the entailment decision. Lin and Hovy (2003) developed an automatic The two-way accuracy of the system was 52.6% summary evaluation system using n-gram co- on RTE-4 test dataset. occurrence statistics. Following the recent adop- Litkowski’s (2009) system consists solely of tion by the community of routines to examine the overlap of discourse enti- automatic evaluation using the BLEU/NIST ties between the texts and hypotheses. The two- scoring process, they conduct an in-depth study way accuracy of the system was 53% on RTE-5 of a similar idea for evaluating summaries. They Main task test dataset. showed that automatic evaluation using unigram Majumdar and Bhattacharyya (2010) describe co-occurrences between summary pairs corre- a simple lexical based system, which detects en- lates surprising well with human evaluations, tailment based on word overlap between the Text based on various statistical metrics; while direct and Hypothesis. The system is mainly designed application of the BLEU evaluation procedure to incorporate various kinds of co-referencing does not always give good results. that occur within a document and take an active Harnly et al. (2005) also proposed an automat- part in the event of Text Entailment. The accura- ic summary evaluation technique by the Pyramid cy of the system was 47.56% on RTE-6 Main method. They presented an experimental system Task test dataset. for testing automated evaluation of summaries, The MENT (Microsoft ENTailment) pre-annotated for shared information. They re- (Vanderwende et al., 2006) system predicts en- duced the problem to a combination of similarity tailment using syntactic features and a general measure computation and clustering. They purpose thesaurus, in addition to an overall achieved best results with a unigram overlap alignment score. MENT is based on the premise similarity measure and single link clustering, that it is easier for a syntactic system to predict which yields high correlation to manual pyramid false entailments. The system accuracy was scores (r=0.942, p=0.01), and shows better corre- 60.25% on RTE-2 test set. lation than the n-gram overlap automatic ap- Wang and Neumannm (2007) present a novel proaches of the ROUGE system. approach to RTE that exploits a structure- oriented sentence representation followed by a 3 Textual Entailment System similarity function. The structural features are A two-way hybrid textual entailment (TE) automatically acquired from tree skeletons that recognition system that uses lexical and syntactic are extracted and generalized from dependency features has been described in this section. The trees. The method makes use of a limited size of system architecture has been shown in Figure 1. training data without any external knowledge The hybrid TE system as (Pakray et al., 2011b) bases (e.g., WordNet) or handcrafted inference has used the Support Vector Machine Learning rules. They achieved an accuracy of 66.9% on technique that uses thirty four features for train- the RTE-3 test data. ing. Five features from Lexical TE, seventeen The major idea of Varma et al. (2009) is to features from Lexical distance measure and elev- find linguistic structures, termed templates that en features from the rule based syntactic two- share the same anchors. Anchors are lexical ele- way TE system have been selected. ments describing the context of a sentence. Tem-

31 between two words in order in a sentence. The measure 1-skip_bigram_Match is defined as !"#$_!"#$(!,!) 1_skip_bigram_Match = (1) ! where, skip_gram(T,H) refers to the number of common 1-skip- (pair of words in sen- tence order with one word gap) found in T and H and n is the number of 1-skip-bigrams in the hy- pothesis H. v. . Stemming is the process of re- ducing terms to their root forms. For example, the plural forms of a noun such as ‘boxes’ are stemmed into ‘box’, and inflectional endings with ‘ing’, ‘es’, ‘s’ and ‘ed’ are removed from verbs. Each word in the text and hypothesis pair is stemmed using the stemming function provid- ed along with the WordNet 2.0. If s1= number of common stemmed uni- grams between text and hypothesis and s2= Figure 1. Hybrid Textual Entailment System number of stemmed unigrams in Hypothesis, then the measure Stemming_match is defined as 3.1 Lexical Similarity Stemming_Match=s1/s2 In this section the various lexical features WordNet is one of most important resource for (Pakray et al., 2011b) for textual entailment are . The WordNet 2.0 has been used described in detail. for WordNet based unigram match and stemming 2 i. WordNet based Unigram Match. In this step. API for WordNet Searching (JAWS) is an method, the various unigrams in the hypothesis API that provides Java applications with the abil- for each text-hypothesis pair are checked for ity to retrieve data from the WordNet database. their presence in text. WordNet synset are identi- 3.2 Syntactic Similarity fied for each of the unmatched unigrams in the hypothesis. If any synset for the hypothesis uni- In this section the various syntactic similarity gram matches with any synset of a word in the features (Pakray et al., 2011b) for textual entail- text then the hypothesis unigram is considered as ment are described in detail. This module is a WordNet based unigram match. based on the Stanford Dependency Parser 3 , ii. Match. Each bigram in the hy- which normalizes data from the corpus of text pothesis is searched for a match in the corre- and hypothesis pairs, accomplishes the depend- sponding text part. The measure Bigram_Match ency analysis and creates appropriate structures is calculated as the fraction of the hypothesis bi- Our Entailment system uses the following fea- grams that match in the corresponding text, i.e., tures. Bigram_Match = (Total number of matched bi- a. Subject. The dependency parser generates grams in a text-hypothesis pair /Number of hy- nsubj (nominal subject) and nsubjpass (passive pothesis bigrams). nominal subject) tags for the subject feature. Our iii. Longest Common Subsequence (LCS). entailment system uses these tags. The Longest Common Subsequence of a text- b. Object. The dependency parser generates hypothesis pair is the longest sequence of words, dobj (direct object) as object tags. which is common to both the text and the hy- c. Verb. Verbs are wrapped with either the pothesis. LCS(T,H) estimates the similarity be- subject or the object. tween text T and hypothesis H, as d. Noun. The dependency parser generates nn LCS_Match=LCS(T,H)/length of H. (noun modifier) as noun tags. iv. Skip-grams. A skip-gram is any combina- e. Preposition. Different types of preposition- tion of n words in the order as they appear in a al tags are prep_in, prep_to, prep_with etc. For sentence, allowing arbitrary gaps. In the present example, in the sentence “A plane crashes in Ita- work, only 1-skip-bigrams are considered where 1-skip-bigrams are bigrams with one word gap 2 http://wordnetweb.princeton.edu/perl/webwn 3 http://www-nlp.stanford.edu/software/lex-parser.shtml

32 ly.” the prepositional tag is identified as associated with the hypothesis object is com- prep_in(in, Italy). pared with the verb associated with the with text f. Determiner. Determiner denotes a relation object. If the two verbs do not match then the with a noun phase. The dependency parser gen- WordNet distance between the two verbs is cal- erates det as determiner tags. For example, the culated. If the value of WordNet distance is be- of the sentence “A journalist reports on low 0.50 then a matching score of 0.5 is as- his own murders.” generates the determiner rela- signed. tion as det(journalist,A). vi. Cross Subject-Object Comparison. The g. Number. The numeric modifier of a noun system compares hypothesis subject and verb phrase is any number phrase. The dependency with text object and verb or hypothesis object parser generates num (numeric modifier). For and verb with text subject and verb. In case of a example, the parsing of the sentence “Nigeria match, a matching score of 0.5 is assigned. seizes 80 tonnes of drugs.” generates the relation vii. Number Comparison. The system com- num (tonnes, 80). pares numbers along with units in the hypothesis Matching Module: After dependency rela- with similar numbers along with units in the text. tions are identified for both the text and the hy- Units are first compared and if they match then pothesis in each pair, the hypothesis relations are the corresponding numbers are compared. In compared with the text relations. The different case of a match, a matching score of 1 is as- features that are compared are noted below. In all signed. the comparisons, a matching score of 1 is con- viii. Noun Comparison. The system compares sidered when the complete dependency relation hypothesis noun words with text noun words that along with all of its arguments matches in both are identified through nn dependency relation. In the text and the hypothesis. In case of a partial case of a match, a matching score of 1 is as- match for a dependency relation, a matching signed. score of 0.5 is assumed. ix. Prepositional Phrase Comparison. The i. Subject-Verb Comparison. The system system compares the prepositional dependency compares hypothesis subject and verb with text relations in the hypothesis with the correspond- subject and verb that are identified thROUGE the ing relations in the text and then checks for the nsubj and nsubjpass dependency relations. A noun words that are arguments of the relation. In matching score of 1 is assigned in case of a com- case of a match, a matching score of 1 is as- plete match. Otherwise, the system considers the signed. following matching process. x. Determiner Comparison. The system ii. WordNet Based Subject-Verb Compari- compares the determiners in the hypothesis and son. If the corresponding hypothesis and text in the text that are identified through det relation. subjects do match in the subject-verb compari- In case of a match, a matching score of 1 is as- son, but the verbs do not match, then the Word- signed. Net distance between the hypothesis and the text xi. Other relation Comparison. Besides the is compared. If the value of the WordNet dis- above relations that are compared, all other re- tance is less than 0.5, indicating a closeness of maining relations are compared verbatim in the the corresponding verbs, then a match is consid- hypothesis and in the text. In case of a match, a ered and a matching score of 0.5 is assigned. matching score of 1 is assigned. Otherwise, the subject-subject comparison pro- cess is applied. 3.3 Part-of-Speech (POS) Matching iii. Subject-Subject Comparison. The sys- This module basically matches common POS tem compares hypothesis subject with text sub- tags between the text and the hypothesis pairs. ject. If a match is found, a score of 0.5 is as- Stanford POS tagger4 is used to tag the part of signed to the match. speech in both text and hypothesis. System iv. Object-Verb Comparison. The system matches the verb and noun POS words in the compares hypothesis object and verb with text hypothesis with those in the text. A score object and verb that are identified through dobj POS_match is defined in equation 2. dependency relation. In case of a match, a match- Number of Verb and Noun Match in Text and Hypothesis POS_Match = ing score of 0.5 is assigned. Total number of Verb and Noun in Hypothesis v. WordNet Based Object-Verb Compari- (2) son. The system compares hypothesis object with text object. If a match is found then the verb 4 http://nlp.stanford.edu/software/tagger.shtml

33 3.4 Lexical Distance set of training examples, each marked as belong- ing to one of two categories; an SVM training The important lexical distance measures that are algorithm builds a model that assigns new exam- used in the present system include Vector Space ples into one category or the other. An SVM Measures (Euclidean distance, Manhattan dis- model is a representation of the examples as tance, Minkowsky distance, Cosine similarity, points in space, mapped so that the examples of Matching coefficient), Set-based Similarities the separate categories are divided by a clear gap (Dice, Jaccard, Overlap, Cosine, Harmonic), as wide as possible. New examples are then Soft-Cardinality, Q-Grams Distance, Edit Dis- mapped into that same space and predicted to tance Measures (, Smith- belong to a category based on which side of the Waterman Distance, Jaro). These lexical distance gap they fall on. features have been used as described in detail by The system has used LIBSVM7 for building Pakray et al. (2011b). the model file. The TE system has used the fol- 3.5 Chunk Similarity lowing data sets: RTE-1 development and test set, RTE-2 development and annotated test set, The part of speech (POS) tags of the hypothesis RTE-3 development and annotated test set and and text are identified using the Stanford POS RTE-4 annotated test set to deal with the two- tagger. After getting the POS information, the way classification task for training purpose to system extracts the chunk output using the CRF 5 build the model file. The LIBSVM tool is used Chunker . Chunk boundary detector detects each by the SVM classifier to learn from this data set. individual chunk such as noun chunk, verb chunk For training purpose, 3967 text-hypothesis pairs etc. Thus, all the chunks for each sentence in the have been used. It has been tested on the RTE hypothesis are identified. Each chunk of the hy- test dataset and we have got 60% to 70% accura- pothesis is now searched in the text side and the cy on RTE datasets. We have applied this textual sentences that contain the key chunk words are entailment system on summarize data sets and extracted. If chunks match then the system as- system gives the entailment score with entail- signs scores for each individual chunk corre- ment decisions (i.e., “YES” / “NO”). We have sponding to the hypothesis. The scoring values tested in both directions. are changed according to the matching of chunk and word containing the chunk. The entire scor- 4 Automatic Evaluation of Summary ing calculation is given in equations 3 and 4 be- low: Ideally summary of some documents should con- W [i] tain all the necessary information contained in Match score (M[i]) = m (3) the documents. So the quality of a summary W [i] c should be judged on how much information of where, Wm[i] = Number of words that match in the documents it contains. If the summary con- th the i chunk and Wc[i] = Total number of words tains all the necessary information from the doc- containing the ith chunk. uments, then it will be a perfect summary. But N M[i] manual comparison is the best way to judge that Overall score (S) = (4) how much information it contains from the doc- ∑ N i=1 ument. But manual evaluation is a very hectic where, N = Total number of chunks in the hy- process, specially when the summary generated pothesis. from multiple documents. When a large number of multi-document summaries have to be evalu- 3.6 Support Vector Machines (SVM) ated, then an automatic evaluation method needs In machine learning, support vector machines to evaluate the summaries. Here we propose tex- (SVMs)6 are supervised learning models used for tual entailment (TE) based automatic evaluation classification and regression analysis. Associated technique for summary. learning algorithms analyze data and recognize patterns. The basic SVM takes a set of input data 4.1 Textual Entailment (TE) Based Sum- and predicts, for each given input, which of two mary Evaluation possible classes form the output, making it a Textual Entailment is defined as a directional non-probabilistic binary linear classifier. Given a relationship between pairs of text expressions,

5 http://crfchunker.sourceforge.net/ 6 http://en.wikipedia.org/wiki/Support_vector_machine 7 http://www.csie.ntu.edu.tw/~cjlin/libsvm/

34 denoted by the entailing “Text” (T) and the en- ROUGE- ROUGE-2 Proposed Summaries SU4 tailed “Hypothesis” (H). Text (T) entails hypoth- Average_R method esis (H) if the information of text (T) is inferred Average_R Top ranked par- into the hypothesis (H). Here the documents are 0.111 0.143 0.7063 used as text (T) and summary of these documents ticipant (id:43) 2nd ranked par- are taken as hypothesis (H). So, if the infor- 0.110 0.140 0.7015 ticipant (id:13) mation of documents is entailed into the sum- 3rd ranked par- 0.104 0.142 0.6750 mary then it will be a very good summary, which ticipant (id:60) should get a good evaluation score. 4th ranked par- 0.103 0.143 0.6810 As our textual entailment system works on ticipant (id:37) sentence level each sentence of documents are 5th ranked par- 0.101 0.140 0.6325 taken as text (T) and calculate the entailment ticipant (id:6) score comparing with each sentence of the sum- Table 1. Comparison of Summary Evaluation Score mary assuming them as hypothesis (H). For ex- th ! ! ample, if T is the i sentence of documents, then (!!!! ) i �������� = 1 − !!! ! (5) it will compared with each sentence of the sum- !! where, ri = the rank of ith summary after eval- mary, i.e. Hj, where, j = 1 to n; and n is the total number of sentences in the summary. Now if T uated by the proposed method i ! is validated with any one of the summary sen- �! = the rank of ith summary after tences using our textual entailment system, then evaluated by ROUGE 1.5.5 it will be marked as validated. After get the en- and n = total number of multi-document tailment result of all the sentences of documents, summaries. the percentage or ratio of the marked/validated After evaluating 48 (only set A) multi- sentences w.r.t unmarked/rejected sentences will document summaries of 72 participants, i.e total be the evaluation score of the summary. 3456 multi-document summaries using the eval- uation method, ROUGE 1.5.5 and the proposed 5 Data Collection method, the accuracy of this proposed method calculated using equation 4 comparing with the We have collected Text Analysis Conference ROUGE’s evaluation scores. The accuracy fig- (TAC, formerly DUC, conducted by NIST) 2008 ures are 0.9825 w.r.t ROUGE-2 and 0.9565 w.r.t 8 Update Summarization track’s data sets for this ROUGE-SU4. experiment. This data set contains 48 topics and each topic has two sets of 10 documents, i.e. 7 Conclusion there are 960 documents. The evaluation data set has 4 model summaries for each document set, From the comparison of evaluation score of the i.e. 8 model summaries for each topic. In 2008, proposed method and ROUGE 1.5.5, it is clear there are 72 participants and we also take the that it can be easily judged that which summary summaries of all the participants of this year. is better like evaluation done by ROUGE. But if evaluation is done using ROUGE then evaluator 6 Comparison of Automatic v/s Manual has to make reference summaries manually, Evaluation which is a very hectic task as well as time con- suming task and can not be generated any auto- We have the evaluation scores of all the 72 par- mated process. Hence if we have to evaluate ticipants of TAC 2008 using ROUGE 1.5.5. We multiple summaries of same set of documents, have calculated the evaluation scores of the same then this proposed automated evaluation process summaries of 72 participants using the proposed could be very useful method. automated evaluation technique and compared it with ROUGE scores. The comparison of both the Acknowledgments evaluation scores of top five participants is We acknowledge the supports of the DST India- shown in the table 1. CONACYT Mexico project “Answer Validation For measuring the accuracy of this proposed thROUGE Textual Entailment” funded by DST, method, we take the ROUGE 1.5.5 evaluation Govt. of India and the DeitY, MCIT, Govt. of score as the gold standard score and then calcu- India funded project “Development of Cross late the accuracy using equation 5. Lingual Information Access (CLIA) System Phase II”. 8 http://www.nist.gov/tac/data/index.html

35 References TAC08, Fourth PASCAL Challenges Workshop on Recognising Textual Entailment. Aaron Harnly, Ani Nenkova, Rebecca Passonneau, and Owen Rambow. 2005. Automation of summary Partha Pakray, Pinaki Bhaskar, Santanu Pal, Dipankar evaluation by the pyramid method. In: Recent Ad- Das, Sivaji Bandyopadhyay, and Alexander Gel- vances in Natural Language Processing (RANLP). bukh. 2010. JU_CSE_TE: System Description Borovets, Bulgaria. QA@CLEF 2010 – ResPubliQA. In: CLEF 2010 Workshop on Multiple Language Question An- Chin-Yew Lin, and Eduard Hovy. 2003. Automatic swering (MLQA 2010), Padua, Italy. evaluation of summaries using n-gram co- occurrence statistics. In: 2003 Conference of the Partha Pakray, Pinaki Bhaskar, Somnath Banerjee, North American Chapter of the Association for Bidhan Chandra Pal, Alexander Gelbukh, and Si- Computational Linguistics on Human Language vaji Bandyopadhyay. 2011a. A Hybrid Question Technology-Volume 1, pp. 71-78, Association for Answering System based on Information Retrieval Computational Linguistics. and Answer Validation. In: Question Answering for Machine Reading Evaluation (QA4MRE) at Daoud Clarke. 2006. Meaning as Context and Subse- CLEF 2011, Amsterdam. quence Analysis for Entailment. In: the Second PASCAL Recognising Textual Entailment Chal- Partha Pakray, Snehasis Neogi, Pinaki Bhaskar, Sou- lenge, Venice, Italy. janya Poria, Sivaji Bandyopadhyay, Alexander Gelbukh. 2011b. A Textual Entailment System us- Debarghya Majumdar, and Pushpak Bhattacharyya. ing Anaphora Resolution. In: the Text Analysis 2010. Lexical Based Text Entailment System for Conference Recognizing Textual Entailment Track Summarization Settings of RTE6. In: the Text (TAC RTE) Notebook. Analysis Conference (TAC 2010), National Insti- tute of Standards and Technology Gaithersburg, Pinaki Bhaskar, and Sivaji Bandyopadhyay. 2010a. A Maryland, USA. Query Focused Automatic Multi Document Sum- marizer. In: the proceeding of the 8th International Eamonn Newman, John Dunnion, and Joe Carthy. Conference on Natural Language Processing 2006. Constructing a Decision Tree Classifier us- (ICON 2010), pp. 241-250, India. ing Lexical and Syntactic Features. In: the Second PASCAL Recognising Textual Entailment Chal- Pinaki Bhaskar, and Sivaji Bandyopadhyay. 2010b. A lenge. Query Focused Multi Document Automatic Sum- marization. In: the proceeding of the 24th Pacific Jesús Herrera, Anselmo Peñas, and Felisa Verdejo. Asia Conference on Language, Information and 2005. Textual Entailment Recognition Based on Computation (PACLIC 24), pp 545-554, Japan. Dependency Analysis and WordNet. In: the First Challenge Workshop Recognising Textual Entail- Pinaki Bhaskar, Amitava Das, Partha Pakray, and ment, pp. 21-24, 33–36 Southampton, U.K. Sivaji Bandyopadhyay. 2010c. Theme Based Eng- lish and Bengali Ad-hoc Monolingual Information Ken Litkowski. 2009. Overlap Analysis in Textual Retrieval in FIRE 2010. In: the Forum for Infor- Entailment Recognition. In: TAC 2009 Workshop, mation Retrieval Evaluation (FIRE) – 2010, National Institute of Standards and Technology Gandhinagar, India. Gaithersburg, Maryland USA. Pinaki Bhaskar, Somnath Banerjee, Snehasis Neogi, Lucy Vanderwende, Arul Menezes, and Rion Snow. and Sivaji Bandyopadhyay. 2012a. A Hybrid QA 2006. Microsoft Research at RTE-2: Syntactic System with Focused IR and Automatic Summari- Contributions in the Entailment Task: an imple- zation for INEX 2011. In: Geva, S., Kamps, J., mentation. In: the Second PASCALChallenges Schenkel, R.(eds.): Focused Retrieval of Content Workshop. and Structure: 10th International Workshop of the Masaaki Tsuchida, and Kai Ishikawa. 2011. IKOMA Initiative for the Evaluation of XML Retrieval, at TAC2011: A Method for Recognizing Textual INEX 2011. Lecture Notes in Computer Science, Entailment using Lexical-level and Sentence Struc- vol. 7424. Springer Verlag, Berlin, Heidelberg ture-level features. In: TAC 2011 Notebook Pro- Pinaki Bhaskar, and Sivaji Bandyopadhyay. 2012b. ceedings. Cross Lingual Query Dependent Snippet Genera- Milen Kouylekov, Bernardo Magnini. 2005. Recog- tion. In: International Journal of Computer Science nizing Textual Entailment with Tree Edit Distance and Information Technologies (IJCSIT), ISSN: Algorithms. In: the First PASCAL Recognizing 0975-9646, Vol. 3, Issue 4, pp. 4603 – 4609. Textual Entailment Workshop. Pinaki Bhaskar, and Sivaji Bandyopadhyay. 2012c. Orlando Montalvo-Huhn, and Stephen Taylor. 2008. Language Independent Query Focused Snippet Textual Entailment – Fitchburg State College. In: Generation. In: T. Catarci et al. (Eds.): Information Access Evaluation. Multilinguality, Multimodality, and Visual Analytics: Third International Confer-

36 ence of the CLEF Initiative, CLEF 2012, Rome, It- Conference and Labs of the Evaluation Forum, Va- aly, Proceedings, Lecture Notes in Computer Sci- lencia, Spain. ence Volume 7488, 2012, pp 138-140, DOI Sivaji Bandyopadhyay, Amitava Das, and Pinaki 10.1007/978-3-642-33247-0_16, ISBN 978-3-642- Bhaskar. 2008. English Bengali Ad-hoc Monolin- 33246-3, ISSN 0302-9743, Springer Verlag, Ber- gual Information Retrieval Task Result at FIRE lin, Heidelberg, Germany. 2008. In: the Forum for Information Retrieval Pinaki Bhaskar, Somnath Banerjee, and Sivaji Ban- Evaluation (FIRE) - 2008, Kolkata, India. dyopadhyay. 2012d. A Hybrid Tweet Contextual- Somnath Banerjee, Partha Pakray, Pinaki Bhaskar, ization System using IR and Summarization. In: the and Sivaji Bandyopadhyay, Alexander Gelbukh. Initiative for the Evaluation of XML Retrieval, 2013. Multiple Choice Question (MCQ) Answering INEX 2012 at Conference and Labs of the Evalua- System for Entrance Examination. In: Question tion Forum (CLEF) 2012, Pamela Forner, Jussi Answering for Machine Reading Evaluation Karlgren, Christa Womser-Hacker (Eds.): CLEF (QA4MRE) at CLEF 2013 Conference and Labs of 2012 Evaluation Labs and Workshop, pp. 164-175, the Evaluation Forum, Valencia, Spain. ISBN 978-88-904810-3-1, ISSN 2038-4963, Rome, Italy. Rod Adams, Gabriel Nicolae, Cristina Nicolae, and Sanda Harabagiu. 2007. Textual Entailment Pinaki Bhaskar, Partha Pakray, Somnath Banerjee, Through Extended Lexical Overlap and Lexico- Samadrita Banerjee, Sivaji Bandyopadhyay, and Semantic Matching. In: ACL PASCAL Workshop Alexander Gelbukh. 2012e. Question Answering on Textual Entailment and Paraphrasing. pp.119- System for QA4MRE@CLEF 2012. In: Question 124. 28-29, Prague, Czech Republic. Answering for Machine Reading Evaluation (QA4MRE) at Conference and Labs of the Evalua- Rui Wang, and Günter Neumannm. 2007. Recogniz- tion Forum (CLEF) 2012, Rome, Italy. ing Textual Entailment Using Sentence Similarity based on Dependency Tree Skeletons. In: the third Pinaki Bhaskar, Kishorjit Nongmeikapam, and Sivaji PASCAL Recognising Textual Entailment Chal- Bandyopadhyay. 2012f. Keyphrase Extraction in lenge. Scientific Articles: A Supervised Approach. In: the proceedings of 24th International Conference on Vasudeva Varma, Vijay Bharat, Sudheer Kovelamudi, Computational Linguastics (Coling 2012), pp. 17- Praveen Bysani, Santosh GSK, Kiran Kumar N, 24, India. Kranthi Reddy, Karuna Kumar, and Nitin Maganti. 2009. IIIT Hyderabad at TAC 2009. In: TAC 2009 Pinaki Bhaskar. 2013a. A Query Focused Language Workshop, National Institute of Standards and Independent Multi-document Summarization. Jian, Technology Gaithersburg, Maryland USA. A. (Eds.), ISBN 978-3-8484-0089-8, LAMBERT Academic Publishing, Saarbrücken, Germany. Pinaki Bhaskar. 2013b. Answering Questions from Multiple Documents – the Role of Multi-Document Summarization. In: Student Research Workshop in the Recent Advances in Natural Language Pro- cessing (RANLP), Hissar, Bulgaria. Pinaki Bhaskar 2013c. Multi-Document Summariza- tion using Automatic Key-Phrase Extraction. In: Student Research Workshop in the Recent Ad- vances in Natural Language Processing (RANLP), Hissar, Bulgaria. Pinaki Bhaskar, Somnath Banerjee, and Sivaji Ban- dyopadhyay. 2013d. Tweet Contextualization (An- swering Tweet Question) – the Role of Multi- document Summarization. In: the Initiative for the Evaluation of XML Retrieval, INEX 2013 at CLEF 2013 Conference and Labs of the Evaluation Fo- rum, Valencia, Spain. Pinaki Bhaskar, Somnath Banerjee, Partha Pakray, Samadrita Banerjee, Sivaji Bandyopadhyay, and Alexander Gelbukh. 2013e. A Hybrid Question An- swering System for Multiple Choice Question (MCQ). In: Question Answering for Machine Reading Evaluation (QA4MRE) at CLEF 2013

37