Automated Text Summarization Base on Lexicales Chain and Graph Using of Wordnet and Wikipedia Knowledge Base

Total Page:16

File Type:pdf, Size:1020Kb

Automated Text Summarization Base on Lexicales Chain and Graph Using of Wordnet and Wikipedia Knowledge Base IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 1, No 3, January 2012 ISSN (Online): 1694-0814 www.IJCSI.org 343 Automated Text Summarization Base on Lexicales Chain and graph Using of WordNet and Wikipedia Knowledge Base Mohsen Pourvali and Mohammad Saniee Abadeh Department of Electrical & Computer Qazvin Branch Islamic Azad University Qazvin, Iran Department of Electrical and Computer Engineering at Tarbiat Modares University Tehran, Iran Abstract The technology of automatic document summarization is delivering the majority of information content from a set of maturing and may provide a solution to the information overload documents about an explicit or implicit main topic [14]. problem. Nowadays, document summarization plays an important Authors of the paper [10] provide the following definition role in information retrieval. With a large volume of documents, for a summary: “A summary can be loosely defined as a presenting the user with a summary of each document greatly facilitates the task of finding the desired documents. Document text that is produced from one or more texts that conveys summarization is a process of automatically creating a important information in the original text(s), and that is no compressed version of a given document that provides useful longer than half of the original text(s) and usually information to users, and multi-document summarization is to significantly less than that. Text here is used rather loosely produce a summary delivering the majority of information content and can refer to speech, multimedia documents, hypertext, from a set of documents about an explicit or implicit main topic. etc. The main goal of a summary is to present the main The lexical cohesion structure of the text can be exploited to ideas in a document in less space. If all sentences in a text determine the importance of a sentence/phrase. Lexical chains are document were of equal importance, producing a summary useful tools to analyze the lexical cohesion structure in a text .In would not be very effective, as any reduction in the size of this paper we consider the effect of the use of lexical cohesion features in Summarization, And presenting a algorithm base on a document would carry a proportional decrease in its in the knowledge base. Ours algorithm at first find the correct sense formativeness. Luckily, information content in a document of any word, Then constructs the lexical chains, remove Lexical appears in bursts, and one can therefore distinguish chains that less score than other ,detects topics roughly from between more and less informative segments. Identifying lexical chains, segments the text with respect to the topics and the informative segments at the expense of the rest is the selects the most important sentences. The experimental results on main challenge in summarization”. assumes a tripartite an open benchmark datasets from DUC01 and DUC02 show that processing model distinguishing three stages: source text our proposed approach can improve the performance compared to interpretation to obtain a source representation, source sate-of-the-art summarization approaches. representation transformation to summary representation, Keywords: text Summarization, Data Mining, Text mining, Word and summary text generation from the summary Sense Disambiguation representation. A variety of document summarization methods have been developed recently. The paper [4] 1. Introduction reviews research on automatic summarizing over the last decade. This paper reviews salient notions and developments, and seeks to assess the state-of-the-art for The technology of automatic document summarization is this challenging natural language processing (NLP) task. maturing and may provide a solution to the information The review shows that some useful summarizing for overload problem. Nowadays, document summarization various purposes can already be done but also, not plays an important role in information retrieval (IR). With a surprisingly, that there is a huge amount more to do. large volume of documents, presenting the user with a Sentence based extractive summarization techniques are summary of each document greatly facilitates the task of commonly used in automatic summarization to produce finding the desired documents. Text summarization is the extractive summaries. Systems for extractive process of automatically creating a compressed version of a summarization are typically based on technique for given text that provides useful information to users, and sentence extraction, and attempt to identify the set of multi-document summarization is to produce a summary sentences that are most important for the overall understanding of a given document. In paper [11] proposed Copyright (c) 2012 International Journal of Computer Science Issues. All Rights Reserved. IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 1, No 3, January 2012 ISSN (Online): 1694-0814 www.IJCSI.org 344 paragraph extraction from a document based on intra- centroid score, position, and overlap with first sentence document links between paragraphs. It yields a text (which may happen to be the title of a document). For relationship map (TRM) from intra-links, which indicate single-documents or (given) clusters it computes centroid that the linked texts are semantically related. It proposes topic characterizations using tf–idf-type data. It ranks four strategies from the TRM: bushy path, depth-first path, candidate summary sentences by combining sentence segmented bushy path, augmented segmented bushy path. scores against centroid, text position value, and tf–idf In our study we focus on sentence based extractive title/lead overlap. Sentence selection is constrained by a summarization. In this way we to express that The lexical summary length threshold, and redundant new sentences cohesion structure of the text can be exploited to avoided by checking cosine similarity against prior ones. In determine the importance of a sentence. Eliminate the the past, extractive summarizers have been mostly based on ambiguity of the word has a significant impact on the scoring sentences in the source document. In paper [12] inference sentence. In this article we will show that the each document is considered as a sequence of sentences separation text into the inside issues by using the correct and the objective of extractive summarization is to label the concept Noticeable effect on the summary text is created. sentences in the sequence with 1 and 0, where a label of 1 The experimental results on an open benchmark datasets indicates that a sentence is a summary sentence while 0 from DUC01 and DUC02 show that our proposed approach denotes a non-summary sentence. To accomplish this task, can improve the performance compared to state-of-the-art is applied conditional random field, which is a state-of-the- summarization approaches. art sequence labeling method .In paper [15] proposed a The rest of this paper is organized as follows: Section 2 novel extractive approach based on manifold–ranking of introduces related works, Word sense disambiguation is sentences to query-based multi-document summarization. presented in Section 3, clustering of the lexical chains is The proposed approach first employs the manifold–ranking presented in Section 4, text segmentation base on the inner process to compute the manifold–ranking score for each topics is presented in Section 5, The experiments and sentence that denotes the biased information-richness of the results are given in Section 6. Finally conclusion presents sentence, and then uses greedy algorithm to penalize the in section 7. sentences with highest overall scores, which are deemed both informative and novel, and highly biased to the given query. The summarization techniques can be classified into 2. Related work two groups: supervised techniques that rely on pre-existing document-summary pairs, and unsupervised techniques, based on properties and heuristics derived from the text. Generally speaking, the methods can be either extractive Supervised extractive summarization techniques treat the summarization or abstractive summarization. Extractive summarization task as a two-class classification problem at summarization involves assigning salience scores to some the sentence level, where the summary sentences are units (e.g.sentences, paragraphs) of the document and positive samples while the non-summary sentences are extracting the sentences with highest scores, while negative samples. After representing each sentence by a abstraction summarization vector of features, the classification function can be trained (e.g.http://www1.cs.columbia.edu/nlp/newsblaster/) usually in two different manners [7]. One is in a discriminative way needs information fusion, sentence compression and with well-known algorithms such as support vector reformulation [14]. machine (SVM) [16]. Many unsupervised methods have Sentence extraction summarization systems take as input been developed for document summarization by exploiting a collection of sentences (one or more documents) and different features and relationships of the sentences – see, select some subset for output into a summary. This is best for example [3] and the references therein. On the other treated as a sentence ranking problem, which allows for hand, summarization task can also be categorized as either varying thresholds to meet varying summary length generic or query-based. A query-based summary presents requirements. Most commonly, such ranking
Recommended publications
  • Shakespeare in the Eighteenth Century: Algorithm for Quotation Identification
    University of Arkansas, Fayetteville ScholarWorks@UARK Theses and Dissertations 5-2020 Shakespeare in the Eighteenth Century: Algorithm for Quotation Identification Marion Pauline Chiariglione University of Arkansas, Fayetteville Follow this and additional works at: https://scholarworks.uark.edu/etd Part of the Numerical Analysis and Scientific Computing Commons, and the Theory and Algorithms Commons Citation Chiariglione, M. P. (2020). Shakespeare in the Eighteenth Century: Algorithm for Quotation Identification. Theses and Dissertations Retrieved from https://scholarworks.uark.edu/etd/3580 This Thesis is brought to you for free and open access by ScholarWorks@UARK. It has been accepted for inclusion in Theses and Dissertations by an authorized administrator of ScholarWorks@UARK. For more information, please contact [email protected]. Shakespeare in the Eighteenth Century: Algorithm for Quotation Identification A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science by Marion Pauline Chiariglione IUT Dijon, University of Burgundy Bachelor of Science in Computer Science, 2017 May 2020 University of Arkansas This thesis is approved for recommendation to the Graduate Council Susan Gauch, Ph.D. Thesis Director Qinghua Li, Ph.D. Committee member Khoa Luu, Ph.D. Committee member Abstract Quoting a borrowed excerpt of text within another literary work was infrequently done prior to the beginning of the eighteenth century. However, quoting other texts, particularly Shakespeare, became quite common after that. Our work develops automatic approaches to identify that trend. Initial work focuses on identifying exact and modified sections of texts taken from works of Shakespeare in novels spanning the eighteenth century. We then introduce a novel approach to identifying modified quotes by adapting the Edit Distance metric, which is character based, to a word based approach.
    [Show full text]
  • Automatic Summarization of Student Course Feedback
    Automatic Summarization of Student Course Feedback Wencan Luo† Fei Liu‡ Zitao Liu† Diane Litman† †University of Pittsburgh, Pittsburgh, PA 15260 ‡University of Central Florida, Orlando, FL 32716 wencan, ztliu, litman @cs.pitt.edu [email protected] { } Abstract Prompt Describe what you found most interesting in today’s class Student course feedback is generated daily in Student Responses both classrooms and online course discussion S1: The main topics of this course seem interesting and forums. Traditionally, instructors manually correspond with my major (Chemical engineering) analyze these responses in a costly manner. In S2: I found the group activity most interesting this work, we propose a new approach to sum- S3: Process that make materials marizing student course feedback based on S4: I found the properties of bike elements to be most the integer linear programming (ILP) frame- interesting work. Our approach allows different student S5: How materials are manufactured S6: Finding out what we will learn in this class was responses to share co-occurrence statistics and interesting to me alleviates sparsity issues. Experimental results S7: The activity with the bicycle parts on a student feedback corpus show that our S8: “part of a bike” activity approach outperforms a range of baselines in ... (rest omitted, 53 responses in total.) terms of both ROUGE scores and human eval- uation. Reference Summary - group activity of analyzing bicycle’s parts - materials processing - the main topic of this course 1 Introduction Table 1: Example student responses and a reference summary Instructors love to solicit feedback from students. created by the teaching assistant. ‘S1’–‘S8’ are student IDs.
    [Show full text]
  • Sentence Boundary Detection for Handwritten Text Recognition Matthias Zimmermann
    Sentence Boundary Detection for Handwritten Text Recognition Matthias Zimmermann To cite this version: Matthias Zimmermann. Sentence Boundary Detection for Handwritten Text Recognition. Tenth International Workshop on Frontiers in Handwriting Recognition, Université de Rennes 1, Oct 2006, La Baule (France). inria-00103835 HAL Id: inria-00103835 https://hal.inria.fr/inria-00103835 Submitted on 5 Oct 2006 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Sentence Boundary Detection for Handwritten Text Recognition Matthias Zimmermann International Computer Science Institute Berkeley, CA 94704, USA [email protected] Abstract 1) The summonses say they are ” likely to persevere in such In the larger context of handwritten text recognition sys- unlawful conduct . ” <s> They ... tems many natural language processing techniques can 2) ” It comes at a bad time , ” said Ormston . <s> ”A singularly bad time ... potentially be applied to the output of such systems. How- ever, these techniques often assume that the input is seg- mented into meaningful units, such as sentences. This pa- Figure 1. Typical ambiguity for the position of a sen- per investigates the use of hidden-event language mod- tence boundary token <s> in the context of a period els and a maximum entropy based method for sentence followed by quotes.
    [Show full text]
  • Multiple Segmentations of Thai Sentences for Neural Machine Translation
    Proceedings of the 1st Joint SLTU and CCURL Workshop (SLTU-CCURL 2020), pages 240–244 Language Resources and Evaluation Conference (LREC 2020), Marseille, 11–16 May 2020 c European Language Resources Association (ELRA), licensed under CC-BY-NC Multiple Segmentations of Thai Sentences for Neural Machine Translation Alberto Poncelas1, Wichaya Pidchamook2, Chao-Hong Liu3, James Hadley4, Andy Way1 1ADAPT Centre, School of Computing, Dublin City University, Ireland 2SALIS, Dublin City University, Ireland 3Iconic Translation Machines 4Trinity Centre for Literary and Cultural Translation, Trinity College Dublin, Ireland {alberto.poncelas, andy.way}@adaptcentre.ie [email protected], [email protected], [email protected] Abstract Thai is a low-resource language, so it is often the case that data is not available in sufficient quantities to train an Neural Machine Translation (NMT) model which perform to a high level of quality. In addition, the Thai script does not use white spaces to delimit the boundaries between words, which adds more complexity when building sequence to sequence models. In this work, we explore how to augment a set of English–Thai parallel data by replicating sentence-pairs with different word segmentation methods on Thai, as training data for NMT model training. Using different merge operations of Byte Pair Encoding, different segmentations of Thai sentences can be obtained. The experiments show that combining these datasets, performance is improved for NMT models trained with a dataset that has been split using a supervised splitting tool. Keywords: Machine Translation, Word Segmentation, Thai Language In Machine Translation (MT), low-resource languages are 1. Combination of Segmented Texts especially challenging as the amount of parallel data avail- As the encoder-decoder framework deals with a sequence able to train models may not be enough to achieve high of tokens, a way to address the Thai language is to split the translation quality.
    [Show full text]
  • Using N-Grams to Understand the Nature of Summaries
    Using N-Grams to Understand the Nature of Summaries Michele Banko and Lucy Vanderwende One Microsoft Way Redmond, WA 98052 {mbanko, lucyv}@microsoft.com views of the event being described over different Abstract documents, or present a high-level view of an event that is not explicitly reflected in any single document. A Although single-document summarization is a useful multi-document summary may also indicate the well-studied task, the nature of multi- presence of new or distinct information contained within document summarization is only beginning to a set of documents describing the same topic (McKeown be studied in detail. While close attention has et. al., 1999, Mani and Bloedorn, 1999). To meet these been paid to what technologies are necessary expectations, a multi-document summary is required to when moving from single to multi-document generalize, condense and merge information coming summarization, the properties of human- from multiple sources. written multi-document summaries have not Although single-document summarization is a well- been quantified. In this paper, we empirically studied task (see Mani and Maybury, 1999 for an characterize human-written summaries overview), multi-document summarization is only provided in a widely used summarization recently being studied closely (Marcu & Gerber 2001). corpus by attempting to answer the questions: While close attention has been paid to multi-document Can multi-document summaries that are summarization technologies (Barzilay et al. 2002, written by humans be characterized as Goldstein et al 2000), the inherent properties of human- extractive or generative? Are multi-document written multi-document summaries have not yet been summaries less extractive than single- quantified.
    [Show full text]
  • Automatic Summarization of Medical Conversations, a Review Jessica Lopez
    Automatic summarization of medical conversations, a review Jessica Lopez To cite this version: Jessica Lopez. Automatic summarization of medical conversations, a review. TALN-RECITAL 2019- PFIA 2019, Jul 2019, Toulouse, France. pp.487-498. hal-02611210 HAL Id: hal-02611210 https://hal.archives-ouvertes.fr/hal-02611210 Submitted on 30 May 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Jessica López Espejel Automatic summarization of medical conversations, a review Jessica López Espejel 1, 2 (1) CEA, LIST, DIASI, F-91191 Gif-sur-Yvette, France. (2) Paris 13 University, LIPN, 93430 Villateneuse, France. [email protected] RÉSUMÉ L’analyse de la conversation joue un rôle important dans le développement d’appareils de simulation pour la formation des professionnels de la santé (médecins, infirmières). Notre objectif est de développer une méthode de synthèse automatique originale pour les conversations médicales entre un patient et un professionnel de la santé, basée sur les avancées récentes en matière de synthèse à l’aide de réseaux de neurones convolutionnels et récurrents. La méthode proposée doit être adaptée aux problèmes spécifiques liés à la synthèse des dialogues. Cet article présente une revue des différentes méthodes pour les résumés par extraction et par abstraction et pour l’analyse du dialogue.
    [Show full text]
  • Exploring Sentence Vector Spaces Through Automatic Summarization
    Under review as a conference paper at ICLR 2018 EXPLORING SENTENCE VECTOR SPACES THROUGH AUTOMATIC SUMMARIZATION Anonymous authors Paper under double-blind review ABSTRACT Vector semantics, especially sentence vectors, have recently been used success- fully in many areas of natural language processing. However, relatively little work has explored the internal structure and properties of spaces of sentence vectors. In this paper, we will explore the properties of sentence vectors by studying a par- ticular real-world application: Automatic Summarization. In particular, we show that cosine similarity between sentence vectors and document vectors is strongly correlated with sentence importance and that vector semantics can identify and correct gaps between the sentences chosen so far and the document. In addition, we identify specific dimensions which are linked to effective summaries. To our knowledge, this is the first time specific dimensions of sentence embeddings have been connected to sentence properties. We also compare the features of differ- ent methods of sentence embeddings. Many of these insights have applications in uses of sentence embeddings far beyond summarization. 1 INTRODUCTION Vector semantics have been growing in popularity for many other natural language processing appli- cations. Vector semantics attempt to represent words as vectors in a high-dimensional space, where vectors which are close to each other have similar meanings. Various models of vector semantics have been proposed, such as LSA (Landauer & Dumais, 1997), word2vec (Mikolov et al., 2013), and GLOVE(Pennington et al., 2014), and these models have proved to be successful in other natural language processing applications. While these models work well for individual words, producing equivalent vectors for sentences or documents has proven to be more difficult.
    [Show full text]
  • A Clustering-Based Algorithm for Automatic Document Separation
    A Clustering-Based Algorithm for Automatic Document Separation Kevyn Collins-Thompson Radoslav Nickolov School of Computer Science Microsoft Corporation Carnegie Mellon University 1 Microsoft Way 5000 Forbes Avenue Redmond, WA USA Pittsburgh, PA USA [email protected] [email protected] ABSTRACT For text, audio, video, and still images, a number of projects have addressed the problem of estimating inter-object similarity and the related problem of finding transition, or ‘segmentation’ points in a stream of objects of the same media type. There has been relatively little work in this area for document images, which are typically text-intensive and contain a mixture of layout, text-based, and image features. Beyond simple partitioning, the problem of clustering related page images is also important, especially for information retrieval problems such as document image searching and browsing. Motivated by this, we describe a model for estimating inter-page similarity in ordered collections of document images, based on a combination of text and layout features. The features are used as input to a discriminative classifier, whose output is used in a constrained clustering criterion. We do a task-based evaluation of our method by applying it the problem of automatic document separation during batch scanning. Using layout and page numbering features, our algorithm achieved a separation accuracy of 95.6% on the test collection. Keywords Document Separation, Image Similarity, Image Classification, Optical Character Recognition contiguous set of pages, but be scattered into several 1. INTRODUCTION disconnected, ordered subsets that we would like to recombine. Such scenarios are not uncommon when The problem of accurately determining similarity between scanning large volumes of paper: for example, one pages or documents arises in a number of settings when document may be accidentally inserted in the middle of building systems for managing document image another in the queue.
    [Show full text]
  • An Incremental Text Segmentation by Clustering Cohesion
    An Incremental Text Segmentation by Clustering Cohesion Raúl Abella Pérez and José Eladio Medina Pagola Advanced Technologies Application Centre (CENATAV), 7a #21812 e/ 218 y 222, Rpto. Siboney, Playa, C.P. 12200, Ciudad de la Habana, Cuba {rabella, jmedina} @cenatav.co.cu Abstract. This paper describes a new method, called IClustSeg, for linear text segmentation by topic using an incremental overlapped clustering algorithm. Incremental algorithms are able to process new objects as they are added to the collection and, according to the changes, to update the results using previous information. In our approach, we maintain a structure to get an incremental overlapped clustering. The results of the clustering algorithm, when processing a stream, are used any time text segmentation is required, using the clustering cohesion as the criteria for segmenting by topic. We compare our proposal against the best known methods, outperforming significantly these algorithms. 1 Introduction Topic segmentation intends to identify the boundaries in a document with goal of capturing the latent topical structure. The automatic detection of appropriate subtopic boundaries in a document is a very useful task in text processing. For example, in information retrieval and in passages retrieval, to return documents, segments or passages closer to the user’s queries. Another application of topic segmentation is in summarization, where it can be used to select segments of texts containing the main ideas for the summary requested [6]. Many text segmentation methods by topics have been proposed recently. Usually, they obtain linear segmentations, where the output is a document divided into sequences of adjacent segments [7], [9].
    [Show full text]
  • A Text Denormalization Algorithm Producing Training Data for Text Segmentation
    A Text Denormalization Algorithm Producing Training Data for Text Segmentation Kilian Evang Valerio Basile University of Groningen University of Groningen [email protected] [email protected] Johan Bos University of Groningen [email protected] As a first step of processing, text often has to be split into sentences and tokens. We call this process segmentation. It is often desirable to replace rule-based segmentation tools with statistical ones that can learn from examples provided by human annotators who fix the machine's mistakes. Such a statistical segmentation system is presented in Evang et al. (2013). As training data, the system requires the original raw text as well as information about the boundaries between tokens and sentences within this raw text. Although raw as well as segmented versions of text corpora are available for many languages, this required information is often not trivial to obtain because the segmented version differs from the raw one also in other respects. For example, punctuation marks and diacritics have been normalized to canonical forms by human annotators or by rule-based segmentation and normalization tools. This is the case with e.g. the Penn Treebank, the Dutch Twente News Corpus and the Italian PAISA` corpus. This problem of missing alignment between raw and segmented text is also noted by Dridan and Oepen (2012). We present a heuristic algorithm that recovers the alignment and thereby produces standoff annotations marking token and sentence boundaries in the raw test. The algorithm is based on the Levenshtein algorithm and is general in that it does not assume any language-specific normalization conventions.
    [Show full text]
  • Multi-Document Biography Summarization
    Multi-document Biography Summarization Liang Zhou, Miruna Ticrea, Eduard Hovy University of Southern California Information Sciences Institute 4676 Admiralty Way Marina del Rey, CA 90292-6695 {liangz, miruna, hovy} @isi.edu Abstract In this paper we describe a biography summarization system using sentence classification and ideas from information retrieval. Although the individual techniques are not new, assembling and applying them to generate multi-document biographies is new. Our system was evaluated in DUC2004. It is among the top performers in task 5–short summaries focused by person questions. 1 Introduction Automatic text summarization is one form of information management. It is described as selecting a subset of sentences from a document that is in size a small percentage of the original and Figure 1. Overall design of the biography yet is just as informative. Summaries can serve as summarization system. surrogates of the full texts in the context of To determine what and how sentences are Information Retrieval (IR). Summaries are created selected and ranked, a simple IR method and from two types of text sources, a single document experimental classification methods both or a set of documents. Multi-document contributed. The set of top-scoring sentences, after summarization (MDS) is a natural and more redundancy removal, is the resulting biography. elaborative extension of the single-document As yet, the system contains no inter-sentence summarization, and poses additional difficulties on ‘smoothing’ stage. algorithm design. Various kinds of summaries fall In this paper, work in related areas is discussed into two broad categories: generic summaries are in Section 2; a description of our biography corpus the direct derivatives of the source texts; special- used for training and testing the classification interest summaries are generated in response to component is in Section 3; Section 4 explains the queries or topic-oriented questions.
    [Show full text]
  • Keyphrase Based Evaluation of Automatic Text Summarization
    International Journal of Computer Applications (0975 – 8887) Volume 117 – No. 7, May 2015 Keyphrase based Evaluation of Automatic Text Summarization Fatma Elghannam Tarek El-Shishtawy Electronics Research Institute Faculty of Computers and Information Cairo, Egypt Benha University, Benha, Egypt ABSTRACT KpEval idea is to count the matches between the peer The development of methods to deal with the informative summary and reference summaries for the essential parts of contents of the text units in the matching process is a major the summary text. KpEval have three main modules, i) challenge in automatic summary evaluation systems that use lemma extractor module that breaks the text into words and fixed n-gram matching. The limitation causes inaccurate extracts their lemma forms and the associated lexical and matching between units in a peer and reference summaries. syntactic features, ii) keyphrase extractor that extracts The present study introduces a new Keyphrase based important keyphrases in their lemma forms, and iii) the Summary Evaluator (KpEval) for evaluating automatic evaluator that scoring the summary based on counting the summaries. The KpEval relies on the keyphrases since they matched keyphrases occur between the peer summary and one convey the most important concepts of a text. In the or more reference summaries. The remaining of this paper is evaluation process, the keyphrases are used in their lemma organized as follows: Section 2 reviews the previous works; form as the matching text unit. The system was applied to Section 3 the proposed keyphrase based summary evaluator; evaluate different summaries of Arabic multi-document data Section 4 discusses the performance evaluation; and section 5 set presented at TAC2011.
    [Show full text]