Topic-Based Multi-Document Summarization with Probabilistic Latent Semantic Analysis

Total Page:16

File Type:pdf, Size:1020Kb

Topic-Based Multi-Document Summarization with Probabilistic Latent Semantic Analysis Topic-based Multi-Document Summarization with Probabilistic Latent Semantic Analysis Leonhard Hennig DAI Labor, TU Berlin Berlin, Germany [email protected] Abstract based on the identification of topics (or thematic foci) We consider the problem of query-focused multi- to construct generic or query-focused summaries. Of- document summarization, where a summary ten, thematic features rely on identifying and weight- containing the information most relevant to a ing important keywords [21], or creating topic signa- user’s information need is produced from a set tures [14, 10]. Sentences are scored by combinations of of topic-related documents. We propose a new keyword scores, or by computing similarities between method based on probabilistic latent semantic sentences and queries. Yet it is well known that term analysis, which allows us to represent sentences and queries as probability distributions over la- matching has severe drawbacks due to the ambivalence tent topics. Our approach combines query- of words and to differences in word usage and personal focused and thematic features computed in the style across authors. This is especially important for latent topic space to estimate the summary- automatic summarization, as summaries produced by relevance of sentences. In addition, we evaluate humans may differ significantly, potentially not shar- several different similarity measures for comput- ing very many terms [16]. ing sentence-level feature scores. Experimental Latent Semantic Indexing (LSI) is an approach to results show that our approach outperforms the best reported results on DUC 2006 data, and also overcome these problems by mapping documents to a compares well on DUC 2007 data. latent semantic space, and has been shown to work well for text summarization [9, 23]. However, LSI has a number of drawbacks, namely its unsatisfactory sta- Keywords tistical foundations. The technique of probabilistic la- tent semantic analysis (PLSA) assumes a latent lower text summarization, probabilistic latent semantic analysis, plsa dimensional topic model as the origin of observed term co-occurrence distributions, and can be seen as a prob- abilistic analogue to LSI [11]. It has a solid statistical 1 Introduction foundation, is based on the likelihood principle and de- fines a proper generative model for data. PLSA models Automatically producing summaries from large tex- documents as a list of mixing proportions for mixture tual sources is an extensively studied problem in IR components that can be viewed as representations of and NLP [17, 12]. In this paper, we investigate the “topics” [4]. problem of multi-document summarization, where a summary is created from a set of related documents In this paper, we are primarily interested the ca- and optionally fulfills a specific information need of a pability of the PLSA approach to model documents user. In particular, we focus on generating an extrac- as mixtures of topics. Unlike previous approaches tive summary by selecting sentences from a document in PLSA-based extractive summarization, we repre- cluster [8]. Multi-document summarization is an in- sent sentences, queries, and documents as probability creasingly important task: With the rapid growth of distributions over topics. We train the probabilistic online information, and many documents covering the model on the term-sentence matrix of all sentences in same topic, the condensation of information from dif- a document cluster, and proceed by folding queries, ferent sources into an informative summary helps to document titles and cluster centroid vectors into the reduce information overload. Automatically created trained model. This allows us to compute various the- summaries can either consist of the most important matic and query-focused similarity measures, as well information overall (generic summarization) or of the as redundancy measures, in the space of latent top- information most relevant with respect to a user’s in- ics, in order to estimate the summary-worthiness of formation need (query-focused summarization). sentences. A major aspect of identifying relevant information Our system improves on previous approaches in is to find out what a text is about. A document three ways: First, we investigate PLSA in the context will generally contain a variety of information cen- of multi-document summarization, modeling topic dis- tered around a main theme, and covering different tributions across documents and taking into account aspects of the main topic. Similarly, human sum- information redundancy. Second, we do not only pick maries tend to cover different topics of the original sentences from topics with the highest likelihood in source text to increase the informative content of the the training data as in [3], but compute a sentence’s summary. Various approaches have exploited features score based on a linear function of query-focused and 144 International Conference RANLP 2009 - Borovets, Bulgaria, pages 144–149 thematic features. Third, we examine how a PLSA The standard procedure for maximizing the likeli- model can be used to represent documents, sentences hood function in the presence of latent variables is and queries in the context of multi-document summa- the Expectation Maximization (EM) algorithm. EM rization, and investigate which measures are most use- is an iterative algorithm where each iteration consists ful for computing similarities in the latent topic space. of two steps, an expectation step where the posterior We evaluate our approach on the data sets of the DUC probabilities for the latent classes z are computed, and 2006 and DUC 2007 text summarization challenges, a maximization step where the conditional probabili- and show that the resulting summaries compare fa- ties of the parameters given the posterior probabili- vorably on ROUGE metrics with those produced by ties of the latent classes are updated. Alternating the existing state-of-the-art summarization systems. expectation and maximization steps, one arrives at a The rest of this paper is organized as follows: In converging point which describes a local maximum of Section 2 we describe the probabilistic latent semantic the log likelihood. The output of the algorithm are analysis algorithm. Next, in Section 3, we give de- the mixture components, as well as the mixing propor- tails of our summarization system, the sentence-level tions over the components for each training document, features we use, as well as of the similarity measures i.e. the conditional probabilities P (w z) and P (z d). we evaluate. In Section 4, we give experimental re- For details of the EM algorithm and its| application| to sults showing that our approach leads to improvements PLSA, see [11]. over a LSI baseline, and that overall scores compare well with those of existing systems on ROUGE met- rics. We then compare our system to related work in 3 Topic-based summarization Section 5, and finally Section 6 concludes the paper. Our approach for producing a summary consists of three steps: First, we associate sentences and queries 2 Probabilistic Latent Semantic with a representation in the latent topic space of a Analysis PLSA model by estimating their mixing proportions P (z d)1. We then compute several sentence-level fea- Probabilistic latent semantic analysis is a latent vari- tures| based on the similarity of sentence and query able model for co-occurrence data which has been distributions over latent topics. Finally, we combine found to provide better results than LSI for term individual feature scores linearly into an overall sen- matching in retrieval applications [11]. It associates tence score to create a ranking, which we use to select an unobserved class variable z = z1,...,zk sentences for the summary. We follow a greedy ap- with each observation (d, w),∈ where Z word{ w } proach for selecting sentences, and penalize candidate = w , . , w occurs in document d =∈ sentences based on their similarity to the partial sum- W { 1 i} ∈ D d1, . , dj . Each word in a document is considered mary. as{ a sample} from a mixture model, where the mixture components are multinomial random variables that can be viewed as representations of latent topics. A 3.1 Sentence representation in the la- document is represented as a list of mixing proportions tent topic space for the mixing components, i.e. it is reduced to a prob- ability distribution over a fixed set of latent classes. Given a corpus of topic-related documents, we per- form sentence splittingD on each document using the In terms of a generative model, PLSA can be defined 2 as follows: NLTK toolkit . Each sentence is represented as a bag- of-words w = (w1, . , wm). During preprocessing, we select a document d with probability P (d), remove stop words, and apply stemming using Porter’s • stemmer [22]. We discard all sentences which contain pick a latent class z with probability P (z d), • | less than lmin = 5 or more than lmax = 20 content generate a word w with probability P (w z). words, as these sentences are unlikely to be useful for • | a summary [24]. We create a term-sentence matrix For each observation pair (d, w) the resulting likeli- TS containing all sentences of the corpus, where each hood expression is: entry TS(i, j) is given by the frequency of term i in P (d, w) = P (d)P (w d), where (1) sentence j. We then train the PLSA model on the | term-sentence matrix TS. P (w d) = P (w z)P (z d). (2) After the model has been trained, it provides a rep- | | | z resentation of the sentences as probability distribu- X∈Z tions P (z s) over the latent topics z. This represen- A document d and a word w are assumed to be con- tation can| be interpreted as follows: Since the source ditionally independent given the unobserved topic z. documents cover multiple topics related to a central Following the maximum likelihood principle, the mix- theme, each sentence can be viewed as representing ing components and the mixing proportions are deter- one or more of these topics.
Recommended publications
  • Shakespeare in the Eighteenth Century: Algorithm for Quotation Identification
    University of Arkansas, Fayetteville ScholarWorks@UARK Theses and Dissertations 5-2020 Shakespeare in the Eighteenth Century: Algorithm for Quotation Identification Marion Pauline Chiariglione University of Arkansas, Fayetteville Follow this and additional works at: https://scholarworks.uark.edu/etd Part of the Numerical Analysis and Scientific Computing Commons, and the Theory and Algorithms Commons Citation Chiariglione, M. P. (2020). Shakespeare in the Eighteenth Century: Algorithm for Quotation Identification. Theses and Dissertations Retrieved from https://scholarworks.uark.edu/etd/3580 This Thesis is brought to you for free and open access by ScholarWorks@UARK. It has been accepted for inclusion in Theses and Dissertations by an authorized administrator of ScholarWorks@UARK. For more information, please contact [email protected]. Shakespeare in the Eighteenth Century: Algorithm for Quotation Identification A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science by Marion Pauline Chiariglione IUT Dijon, University of Burgundy Bachelor of Science in Computer Science, 2017 May 2020 University of Arkansas This thesis is approved for recommendation to the Graduate Council Susan Gauch, Ph.D. Thesis Director Qinghua Li, Ph.D. Committee member Khoa Luu, Ph.D. Committee member Abstract Quoting a borrowed excerpt of text within another literary work was infrequently done prior to the beginning of the eighteenth century. However, quoting other texts, particularly Shakespeare, became quite common after that. Our work develops automatic approaches to identify that trend. Initial work focuses on identifying exact and modified sections of texts taken from works of Shakespeare in novels spanning the eighteenth century. We then introduce a novel approach to identifying modified quotes by adapting the Edit Distance metric, which is character based, to a word based approach.
    [Show full text]
  • Automatic Summarization of Student Course Feedback
    Automatic Summarization of Student Course Feedback Wencan Luo† Fei Liu‡ Zitao Liu† Diane Litman† †University of Pittsburgh, Pittsburgh, PA 15260 ‡University of Central Florida, Orlando, FL 32716 wencan, ztliu, litman @cs.pitt.edu [email protected] { } Abstract Prompt Describe what you found most interesting in today’s class Student course feedback is generated daily in Student Responses both classrooms and online course discussion S1: The main topics of this course seem interesting and forums. Traditionally, instructors manually correspond with my major (Chemical engineering) analyze these responses in a costly manner. In S2: I found the group activity most interesting this work, we propose a new approach to sum- S3: Process that make materials marizing student course feedback based on S4: I found the properties of bike elements to be most the integer linear programming (ILP) frame- interesting work. Our approach allows different student S5: How materials are manufactured S6: Finding out what we will learn in this class was responses to share co-occurrence statistics and interesting to me alleviates sparsity issues. Experimental results S7: The activity with the bicycle parts on a student feedback corpus show that our S8: “part of a bike” activity approach outperforms a range of baselines in ... (rest omitted, 53 responses in total.) terms of both ROUGE scores and human eval- uation. Reference Summary - group activity of analyzing bicycle’s parts - materials processing - the main topic of this course 1 Introduction Table 1: Example student responses and a reference summary Instructors love to solicit feedback from students. created by the teaching assistant. ‘S1’–‘S8’ are student IDs.
    [Show full text]
  • Using N-Grams to Understand the Nature of Summaries
    Using N-Grams to Understand the Nature of Summaries Michele Banko and Lucy Vanderwende One Microsoft Way Redmond, WA 98052 {mbanko, lucyv}@microsoft.com views of the event being described over different Abstract documents, or present a high-level view of an event that is not explicitly reflected in any single document. A Although single-document summarization is a useful multi-document summary may also indicate the well-studied task, the nature of multi- presence of new or distinct information contained within document summarization is only beginning to a set of documents describing the same topic (McKeown be studied in detail. While close attention has et. al., 1999, Mani and Bloedorn, 1999). To meet these been paid to what technologies are necessary expectations, a multi-document summary is required to when moving from single to multi-document generalize, condense and merge information coming summarization, the properties of human- from multiple sources. written multi-document summaries have not Although single-document summarization is a well- been quantified. In this paper, we empirically studied task (see Mani and Maybury, 1999 for an characterize human-written summaries overview), multi-document summarization is only provided in a widely used summarization recently being studied closely (Marcu & Gerber 2001). corpus by attempting to answer the questions: While close attention has been paid to multi-document Can multi-document summaries that are summarization technologies (Barzilay et al. 2002, written by humans be characterized as Goldstein et al 2000), the inherent properties of human- extractive or generative? Are multi-document written multi-document summaries have not yet been summaries less extractive than single- quantified.
    [Show full text]
  • Automatic Summarization of Medical Conversations, a Review Jessica Lopez
    Automatic summarization of medical conversations, a review Jessica Lopez To cite this version: Jessica Lopez. Automatic summarization of medical conversations, a review. TALN-RECITAL 2019- PFIA 2019, Jul 2019, Toulouse, France. pp.487-498. hal-02611210 HAL Id: hal-02611210 https://hal.archives-ouvertes.fr/hal-02611210 Submitted on 30 May 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Jessica López Espejel Automatic summarization of medical conversations, a review Jessica López Espejel 1, 2 (1) CEA, LIST, DIASI, F-91191 Gif-sur-Yvette, France. (2) Paris 13 University, LIPN, 93430 Villateneuse, France. [email protected] RÉSUMÉ L’analyse de la conversation joue un rôle important dans le développement d’appareils de simulation pour la formation des professionnels de la santé (médecins, infirmières). Notre objectif est de développer une méthode de synthèse automatique originale pour les conversations médicales entre un patient et un professionnel de la santé, basée sur les avancées récentes en matière de synthèse à l’aide de réseaux de neurones convolutionnels et récurrents. La méthode proposée doit être adaptée aux problèmes spécifiques liés à la synthèse des dialogues. Cet article présente une revue des différentes méthodes pour les résumés par extraction et par abstraction et pour l’analyse du dialogue.
    [Show full text]
  • Exploring Sentence Vector Spaces Through Automatic Summarization
    Under review as a conference paper at ICLR 2018 EXPLORING SENTENCE VECTOR SPACES THROUGH AUTOMATIC SUMMARIZATION Anonymous authors Paper under double-blind review ABSTRACT Vector semantics, especially sentence vectors, have recently been used success- fully in many areas of natural language processing. However, relatively little work has explored the internal structure and properties of spaces of sentence vectors. In this paper, we will explore the properties of sentence vectors by studying a par- ticular real-world application: Automatic Summarization. In particular, we show that cosine similarity between sentence vectors and document vectors is strongly correlated with sentence importance and that vector semantics can identify and correct gaps between the sentences chosen so far and the document. In addition, we identify specific dimensions which are linked to effective summaries. To our knowledge, this is the first time specific dimensions of sentence embeddings have been connected to sentence properties. We also compare the features of differ- ent methods of sentence embeddings. Many of these insights have applications in uses of sentence embeddings far beyond summarization. 1 INTRODUCTION Vector semantics have been growing in popularity for many other natural language processing appli- cations. Vector semantics attempt to represent words as vectors in a high-dimensional space, where vectors which are close to each other have similar meanings. Various models of vector semantics have been proposed, such as LSA (Landauer & Dumais, 1997), word2vec (Mikolov et al., 2013), and GLOVE(Pennington et al., 2014), and these models have proved to be successful in other natural language processing applications. While these models work well for individual words, producing equivalent vectors for sentences or documents has proven to be more difficult.
    [Show full text]
  • Multi-Document Biography Summarization
    Multi-document Biography Summarization Liang Zhou, Miruna Ticrea, Eduard Hovy University of Southern California Information Sciences Institute 4676 Admiralty Way Marina del Rey, CA 90292-6695 {liangz, miruna, hovy} @isi.edu Abstract In this paper we describe a biography summarization system using sentence classification and ideas from information retrieval. Although the individual techniques are not new, assembling and applying them to generate multi-document biographies is new. Our system was evaluated in DUC2004. It is among the top performers in task 5–short summaries focused by person questions. 1 Introduction Automatic text summarization is one form of information management. It is described as selecting a subset of sentences from a document that is in size a small percentage of the original and Figure 1. Overall design of the biography yet is just as informative. Summaries can serve as summarization system. surrogates of the full texts in the context of To determine what and how sentences are Information Retrieval (IR). Summaries are created selected and ranked, a simple IR method and from two types of text sources, a single document experimental classification methods both or a set of documents. Multi-document contributed. The set of top-scoring sentences, after summarization (MDS) is a natural and more redundancy removal, is the resulting biography. elaborative extension of the single-document As yet, the system contains no inter-sentence summarization, and poses additional difficulties on ‘smoothing’ stage. algorithm design. Various kinds of summaries fall In this paper, work in related areas is discussed into two broad categories: generic summaries are in Section 2; a description of our biography corpus the direct derivatives of the source texts; special- used for training and testing the classification interest summaries are generated in response to component is in Section 3; Section 4 explains the queries or topic-oriented questions.
    [Show full text]
  • Keyphrase Based Evaluation of Automatic Text Summarization
    International Journal of Computer Applications (0975 – 8887) Volume 117 – No. 7, May 2015 Keyphrase based Evaluation of Automatic Text Summarization Fatma Elghannam Tarek El-Shishtawy Electronics Research Institute Faculty of Computers and Information Cairo, Egypt Benha University, Benha, Egypt ABSTRACT KpEval idea is to count the matches between the peer The development of methods to deal with the informative summary and reference summaries for the essential parts of contents of the text units in the matching process is a major the summary text. KpEval have three main modules, i) challenge in automatic summary evaluation systems that use lemma extractor module that breaks the text into words and fixed n-gram matching. The limitation causes inaccurate extracts their lemma forms and the associated lexical and matching between units in a peer and reference summaries. syntactic features, ii) keyphrase extractor that extracts The present study introduces a new Keyphrase based important keyphrases in their lemma forms, and iii) the Summary Evaluator (KpEval) for evaluating automatic evaluator that scoring the summary based on counting the summaries. The KpEval relies on the keyphrases since they matched keyphrases occur between the peer summary and one convey the most important concepts of a text. In the or more reference summaries. The remaining of this paper is evaluation process, the keyphrases are used in their lemma organized as follows: Section 2 reviews the previous works; form as the matching text unit. The system was applied to Section 3 the proposed keyphrase based summary evaluator; evaluate different summaries of Arabic multi-document data Section 4 discusses the performance evaluation; and section 5 set presented at TAC2011.
    [Show full text]
  • Latent Semantic Analysis and the Construction of Coherent Extracts
    Latent Semantic Analysis and the Construction of Coherent Extracts Tristan Miller German Research Center for Artificial Intelligence0 Erwin-Schrodinger-Straße¨ 57, D-67663 Kaiserslautern [email protected] Keywords: automatic summarization, latent semantic analy- many of these techniques are tied to a particular sis, LSA, coherence, extracts language or require resources such as a list of dis- Abstract course keywords and a manually marked-up cor- pus; others are constrained in the type of summary We describe a language-neutral au- they can generate (e.g., general-purpose vs. query- tomatic summarization system which focussed). aims to produce coherent extracts. It In this paper, we present a new, recursive builds an initial extract composed solely method for automatic text summarization which of topic sentences, and then recursively aims to preserve both the topic coverage and fills in the topical lacunae by provid- the coherence of the source document, yet has ing linking material between semanti- minimal reliance on language-specific NLP tools. cally dissimilar sentences. While exper- Only word- and sentence-boundary detection rou- iments with human judges did not prove tines are required. The system produces general- a statistically significant increase in tex- purpose extracts of single documents, though it tual coherence with the use of a latent should not be difficult to adapt the technique semantic analysis module, we found a to query-focussed summarization, and may also strong positive correlation between co- be of use in improving the coherence of multi- herence and overall summary quality. document summaries. 2 Latent semantic analysis 1 Introduction Our system fits within the general category of IR- A major problem with automatically-produced based systems, but rather than comparing text with summaries in general, and extracts in particular, the standard vector-space model, we employ la- is that the output text often lacks fluency and orga- tent semantic analysis (LSA) [Deerwester et al., nization.
    [Show full text]
  • Quesgen Using Nlp 01
    Quesgen Using Nlp 01 International Journal of Latest Trends in Engineering and Technology Vol.(13)Issue(2), pp.009-014 DOI: http://dx.doi.org/10.21172/1.132.02 e-ISSN:2278-621X QUESGEN USING NLP Pawan NGP1, Pooja Bahuguni2, Pooja Dattatri3, Shilpi Kumari4, Vikranth B.M5 Abstract— when people read for long hours, they seldom are able to grasp concepts and it gives them false sense of understanding it. The aim of this project is to tackle this problem by processing given text and generating applicable questions and answer. The steps followed are: 1. Candidate key sentences are selected (using Text Rank). 2. Candidate key words are selected from candidate key sentences (RAKE). 3. These selected key sentences and words are stored in the database (MongoDB) and presented to the user through chatbot interface. Keywords— NLP, NLP toolkit, Sentence extraction, Keyword extraction, ChatBot, RAKE, TextRank 1. INTRODUCTION Humans are the most curious by nature. Asking Questions to meet their never-ending quest for information and knowledge. For Example,teachers ask students, questions to evaluate performance of the students, pupils learn by asking questions to teachers,and even our normal life conversation consists of asking questions. Questions are the major part of countless learning interactions. However, with the advent of technology, attention spans of individuals have significantly gone down and they are not able to ask good questions. It has been noticed that when people try to read for long hours, they seldom are able to grasp concepts. But having spent some time reading gives people a false sense of understanding it.
    [Show full text]
  • Leveraging Word Embeddings for Spoken Document Summarization
    Leveraging Word Embeddings for Spoken Document Summarization Kuan-Yu Chen*†, Shih-Hung Liu*, Hsin-Min Wang*, Berlin Chen#, Hsin-Hsi Chen† *Institute of Information Science, Academia Sinica, Taiwan #National Taiwan Normal University, Taiwan †National Taiwan University, Taiwan * # † {kychen, journey, whm}@iis.sinica.edu.tw, [email protected], [email protected] Abstract without human annotations involved. Popular methods include Owing to the rapidly growing multimedia content available on the vector space model (VSM) [9], the latent semantic analysis the Internet, extractive spoken document summarization, with (LSA) method [9], the Markov random walk (MRW) method the purpose of automatically selecting a set of representative [10], the maximum marginal relevance (MMR) method [11], sentences from a spoken document to concisely express the the sentence significant score method [12], the unigram most important theme of the document, has been an active area language model-based (ULM) method [4], the LexRank of research and experimentation. On the other hand, word method [13], the submodularity-based method [14], and the embedding has emerged as a newly favorite research subject integer linear programming (ILP) method [15]. Statistical because of its excellent performance in many natural language features may include the term (word) frequency, linguistic processing (NLP)-related tasks. However, as far as we are score, recognition confidence measure, and prosodic aware, there are relatively few studies investigating its use in information. In contrast, supervised sentence classification extractive text or speech summarization. A common thread of methods, such as the Gaussian mixture model (GMM) [9], the leveraging word embeddings in the summarization process is Bayesian classifier (BC) [16], the support vector machine to represent the document (or sentence) by averaging the word (SVM) [17], and the conditional random fields (CRFs) [18], embeddings of the words occurring in the document (or usually formulate sentence selection as a binary classification sentence).
    [Show full text]
  • An Automatic Text Summarization for Malayalam Using Sentence Extraction
    International Journal of Advanced Computational Engineering and Networking, ISSN: 2320-2106, Volume-3, Issue-8, Aug.-2015 AN AUTOMATIC TEXT SUMMARIZATION FOR MALAYALAM USING SENTENCE EXTRACTION 1RENJITH S R, 2SONY P 1M.Tech Computer and Information Science, Dept.of Computer Science, College of Engineering Cherthala Kerala, India-688541 2Assistant Professor, Dept. of Computer Science, College of Engineering Cherthala, Kerala, India-688541 Abstract—Text Summarization is the process of generating a short summary for the document that contains the significant portion of information. In an automatic text summarization process, a text is given to the computer and the computer returns a shorter less redundant extract of the original text. The proposed method is a sentence extraction based single document text summarization which produces a generic summary for a Malayalam document. Sentences are ranked based on feature scores and Googles PageRank formula. Top k ranked sentences will be included in summary where k depends on the compression ratio between original text and summary. Performance evaluation will be done by comparing the summarization outputs with manual summaries generated by human evaluators. Keywords—Text summarization, Sentence Extraction, Stemming, TF-ISF score, Sentence similarity, PageRank formula, Summary generation. I. INTRODUCTION a summary, which represents the subject matter of an article by understanding the whole meaning, which With enormous growth of information on cyberspace, are generated by reformulating the salient unit conventional Information Retrieval techniques have selected from an input sentences. It may contain some become inefficient for finding relevant information text units which are not present in the input text. An effectively. When we give a keyword to be searched extract is a summary consisting of a number of on the internet, it returns thousands of documents sentences selected from the input text.Sentence overwhelming the user.
    [Show full text]
  • Automatic Summarization and Readability
    COGNITIVE SCIENCE MASTER THESIS Automatic summarization and Readability LIU-IDA/KOGVET-A–11/004–SE Author: Supervisor: Christian SMITH Arne JONSSON¨ [email protected] [email protected] List of Figures 2.1 A simplified graph where sentences are linked and weighted ac- cording to the cosine values between them. 10 3.1 Evaluation of summaries on different dimensionalities. The X-axis denotes different dimensionalities, the Y-axis plots the mean value from evaluations on several seeds on each dimensionality. 18 3.2 The iterations of PageRank. The figure depicts the ranks of the sentences plotted on the Y-axis and the iterations on the X-axis. Each series represents a sentence. 19 3.3 The iterations of PageRank in different dimensionalities of the RI- space. The figure depicts 4 different graphs, each representing the trial of a specific setting of the dimensionality. From the left the dimensionalities of 10, 100, 300 and 1000 were used. The ranks of the sentences is plotted on the Y-axis and the iterations on the X-axis. Each series represents a sentence. 20 3.4 Figures of sentence ranks on different damping factors in PageRank 21 3.5 Effect of randomness, same text and settings on ten different ran- dom seeds. The final ranks of the sentences is plotted on the Y, with the different seeds on X. The graph depicts 9 trials at follow- ing dimensionalities, from left: 10, 20, 50, 100, 300, 500, 1000, 2000, 10000. 22 3.6 Values sometimes don’t converge on smaller texts. The left graph depicts the text in d=100 and the right graph d=20.
    [Show full text]