Single-Document Summarization Using Latent Semantic Analysis

International Journal of Scientific Research in Information Systems and Engineering (IJSRISE) Volume 1, Issue 2, December-2015. ISSN 2380-8128 Single-Document Summarization Using Latent Semantic Analysis Oluwajana Dokun, Institute of Graduate Studies and research Cyprus International University North-Cyprus E-mail: [email protected] Erbug Celebi, Institute of Graduate Studies and research Cyprus International University North-Cyprus E-mail: [email protected] Abstract - In this study we have evaluated the existing methods of automatic document summarization system and we proposed two approaches in English documents that are based on Latent semantic analysis. Summary selection four existing and two proposed methods for automatic summarization are also used. The evaluated methods that are used include Gong and Liu, Steinberger and Jezek, Murray, Renal & Chaletta, Cross approach and the proposed methods are avesvd and ravesvd. Latent semantic analysis (LSA) is a technique that uses vectorial semantics, for analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA brings out latent relationships within a collection of documents rather than looking at each document isolated from the others. It looks at all the documents as a whole and the terms they contain to identify relationships between them. We have compared the performance of our systems with existing systems in the literature which was developed for this document summarization. The document set used for evaluation of our system is the Document Understanding Conferences (DUC) datasets are document summaries on corpus DUC-2002 and 2004. The evaluation and comparisons of the summaries are performed with ROUGE-L. Keywords – Abstractive Summarization, Extractive Summarization, Information Retrieval, Latent Semantic Analysis. 1. Introduction part of the original text and to be more specific automatic text summarization aim at extract- The century we are in today, everything ing the important sentences from large amount is evolving more than speed of light. The use of text in a document and still retain its qual- of internet and documents online has increased ity. The goal of this study is to focus on divers exponentially causing a lot of problems for ways of automatic text summarization, using the user to find more precise documents from singular value decomposition as the algorithm the millions of documents available. Even the and finding an efficient and effective output stress of reading the documents and knowing method for the summary. the exact document is a vital problem. This Following sections are organized as fol- problem can basically be solved using auto- lows: Firstly, in section 2 we review previous matic text summarization which is a branch work of text summarization approaches and under information retrieval (IR). IR is widely evaluation measures. Section 3 explains the used in search engines, online-book websites, main approach used that is LSA approach in new portals and etc because it makes them details, preprocesses method and step by step more bulky in terms of semantic relationships of the approach to arrive at the summary of and context of the documents retrieved. IR is document are explained. Section 4 works on subdivided into many branches one of which implementation of latent semantic analysis us- is automatic summarization. Automatic text ing our proposed system for summarization summarization is the creation of reduced type system. Section 5 also explains the evalua- of text by a computer program and the out- tion results of the LSA based single document put produce will still contain the most relevant summarization algorithms using English document sets. Section 6 gives a very brief de- scription of some ideas, concluding remarks Corresponding Author and future works. OLUWAJANA DOKUN, Institute of Graduate Studies and research, Cyprus International University North- Cyprus E-mail: [email protected] IJSRISE © 2015. http://www.ijsrise.com International Journal of Scientific Research in Information Systems and Engineering (IJSRISE) Volume 1, Issue 2, December-2015. ISSN 2380-8128 2. RELATED WORKS of document summarization because there is still no proper or ideal summary for document Researchers have been working actively but different evaluation approaches have been on text summarization within the Natural Lan- used for text summarization in general. guage Processing (NLP) to create a better and more efficient summary. This work started in 3. LATENT SEMANTIC ANALYSIS the late fifties and since then many methods developed from single to multi-document Latent Semantic Analysis (LSA) is the even to multi-lingual text summarization ap- combination of algebraic and statistical meth- proaches or from extractive to abstractive ap- ods and this technique brings out the hidden proach aboveall there have been an increasing structure of words, between words, sentenc- in output of summary over the years. Extrac- es or document. The main ideas of LSA is tive summarization works with the method of that it extracts the input document and con- finding the salient topics in a text such as Luhn vert to sentence – term matrix and process it [1] at IBM laboratory, worked on frequency through an algorithm called singular value of word in the text. H. P Edmundson [2] used decomposition(SVD). The purpose of the title of the word, cue phase, key method, posi- SVD is to find relationship between word and tion method – surface level approach, Daniel sentences, reduce noise and also model the Jacob Gillick [3] used classification function relationship among sentences and words. Fi- to categorize each sentence (sentence extrac- nally, output is obtained from SVD algorithm. tion) using naïve-Bayes classifier - machine LSA main algorithm to text summarization is Learning Based Approach. Eduard Hovy and divided into three steps: creation of sentence Chin-Yew Lin [4] also, studied on sentence - term matrix, applying SVD to matrix and se- position and later tried to restructure the sen- lection the sentence for the summary. tence extraction using decision tree - Statisti- In this section, we firstly describe cre- cal Approaches. Gerald Salton [5] worked on ation of sentence-term matrix. Secondly, ap- automatic indexing which later turned to sta- plying SVD to matrix and different algorithms tistical process that based on term frequency for selecting sentence in latent semantic anal- - inverse document frequency algorithm - ysis Graph Based Approaches. Abstractive or non- extractive approach is different from extractive approach but abstractive approach uses 3.1 Creation of sentence - term extractive approach to generate abstract is that matrix it observes and understand the document, then generates a new summary and this summary In LSA, creation of sentences by term does not contain any word from the original matrix is based on vector-space model (VSM) document such as Knight and Marcus[6] that that is the arrangement of bags of words into used statistical - based summarization to train their sentence by term matrix. Matrix is the a system to compress the syntactic parse tree representation of data into rows and columns; of a sentence in order to produce a shorter but where rows represent the words, columns rep- still maximally grammatical version – reduc- resent sentences and each data is filled into tion approach. Daume and Marcus [7] contrib- their cell. The creation of matrix is a very dif- uted to compression approach using Rhetori- ficult task in LSA because it must pass through cal Structure Tree in which they used decision a pre-processing method before it becomes tree to pick the relevant compressed and leave full sentence – term matrix the irrelevant ones – compressive summarization. There are many approaches of text summarization and majority of them are extractive 3.2 Applying SVD to matrix approach because it extracts the important sentences from the input text while text abstrac- SVD is based on a theorem from linear tion or non - extractive approach, prove to be algebra in which a rectangular matrix A is de- the more challenging task, to parse the original composed into three matrices - an orthogonal text in a deep linguistic way, interpret the text matrix U, a diagonal matrix ∑, and the trans- semantically into a formal representation, find pose of an orthogonal matrix V. The purpose new more concise concepts to describe the of SVD is using a dimensional matrix set of text and then generate a new shorter text with data points and reducing it to a lower dimen- the same information content. The evaluation sional space. SVD is used to reduce dimension of the summaries is another challenging part of term-by-document matrix. This technique IJSRISE © 2015. http://www.ijsrise.com International Journal of Scientific Research in Information Systems and Engineering (IJSRISE) Volume 1, Issue 2, December-2015. ISSN 2380-8128 also reveals the latent data while removing the two views: from transformation aspect that is noise. The computation of the it gives a mapping between the m-dimensional SVD as follows: space spanned by the weighted term-frequency SVD decomposes a matrix (A) into three vectors and the r-dimensional singular vector matrixes. space with all of its axes linearly-independent. From semantic point of view, the SVD obtains the latent semantic analysis from the document represented by matrix that is the break- down of the original document into r linearly- where, U is a matrix that their columns independent base vectors or concepts. After are the eigenvectors of the AAT matrix. This performing the SVD on term sentence matrix, matrix is called left eigenvectors and it repre- a singular value matrix and the right singular sents concept-by-term relation. vector matrix VT. In the singular vector space, Σ is a matrix that their diagonal elements each sentence is represented by the column are the singular values of A. Its non-diagonal vector of VT and then picks the pth right sin- elements are 0.

Load more