
Sådhanå (2018) 43:27 Ó Indian Academy of Sciences https://doi.org/10.1007/s12046-018-0789-ySadhana(0123456789().,-volV)FT3](0123456789().,-volV) Enhancing multi-document summarization using concepts PATTABHI R K RAO* and S LALITHA DEVI* AU-KBC Research Centre, MIT Campus of Anna University, Chennai 600044, India e-mail: [email protected]; [email protected] MS received 21 November 2016; revised 12 May 2017; accepted 19 June 2017; published online 10 March 2018 Abstract. In this paper we propose a methodology to mine concepts from documents and use these concepts to generate an objective summary of all relevant documents. We use the conceptual graph (CG) formalism as proposed by Sowa to represent the concepts and their relationships in the documents. In the present work we have modified and extended the definition of the concept given by Sowa. The modified and extended definition is discussed in detail in section 2 of this paper. A CG of a set of relevant documents can be considered as a semantic network. The semantic network is generated by automatically extracting CG for each document and merging them into one. We discuss (i) generation of semantic network using CGs and (ii) generation of multi- document summary. Here we use restricted Boltzmann machines, a deep learning technique, for automatically extracting CGs. We have tested our methodology using MultiLing 2015 corpus. We have obtained encouraging results, which are comparable to those from the state of the art systems. Keywords. Concept mining; text mining; multi-document summarization; machine learning; restricted Boltzmann machines; MultiLing 2015 dataset. 1. Introduction concept is also very important. Identification of concepts provides us a proper understanding of texts, helps in Today, with the advancement of technology, there is an understanding relationships and gives a semantic repre- explosion of data available on web. Earlier the content had sentation of the text. We observe in the literature that the to be generated by the publishing houses. However, now traditional methods to identify concepts have been through with easy access to internet, the users themselves are able to the use of thesaurus such as WordNet, dictionaries or generate content using blogs, micro-blogs such as Facebook lexicons. and Twitter. Thus, there is a great need for mining the web There are various methods for summarizing a text and and extracting relevant information. To automatically we find that concept-based summarization will be more identify the relevant information, semantically driven semantically driven and gives cohesion to the summary mining methods are necessary. One such method where automatically generated. ‘‘A summarizer is a system semantic relation can be utilized in mining is the concept- whose goal is to produce a condensed representation of based mining where conceptual graph (CG) formalism is the content of its input for human consumption’’ [1]. In used to determine the concept and its relations. Here the most of the methods used for automated summary gen- underlying concepts have to be identified and they in turn eration, the end result is a collection of sentences that do are used for mining the relevant data. The concept thus not have connectivity of topic, or we can say the cohesion obtained could be used for various applications such as of the text is not present. We are trying to bring in this Information retrieval, extraction, database creation, to cohesion to the summary through the CG-based generate summaries, etc. summarization. Concept mining is the task of extracting the concepts Automated summarization is an important area of embedded in the text document and concept is a repre- research in NLP, which uses data mining technology. One sentation of an idea or entity. A concept can be either a of the popularly known earliest works on text summa- word or phrase and is totally dependent on the semantics of rization is by Luhn [2]. He proposed that frequency of a the sentence. Thus, identification of concepts from text word in articles provides a useful measure of its signifi- documents involves aspects of artificial intelligence such as cance. Significance factor was derived at sentence level natural language processing (NLP) and machine learning. and top ranking sentences were selected to form the auto- Concept mining is a non-trivial task and what constitutes a abstract. A variety of automated summarization schemes have been proposed in the last decade. NeATS [3]isan *For correspondence 1 27 Page 2 of 11 Sådhanå (2018) 43:27 approach based on sentence position, term frequency, v. CGs are scalable and can be adopted for any language. topic signature and term clustering, and MEAD [4]isa Though here in this work we have demonstrated using centroid-based approach. Iterative graph-based ranking English, they can be used for any language from any algorithms, such as Kleinberg’ s HITS algorithm [5] and language family. Google’s Page- Rank [6], have been traditionally and vi. This is one of the first works to fully, automatically successfully used in web-link analysis, social networks extract CGs using one of the deep learning and more recently in text processing applications. Erkan algorithms. and Radav [7], Mihalcea [8], Mihalcea and Tarau [9] and This paper is organized as follows. In the next sub-section, Mihalcea et al.[10] have been proposed for single-docu- we describe briefly the background of CGs. Section 2 ment summary generation. Multi-document summarization is the process of filtering describes our modification and extension of concept defi- important information from a set of documents to produce a nitions used to facilitate our work. The methodology and condensed version for particular users and applications. It our approach are described in detail in section 3. In sec- can be viewed as an extension of single-document sum- tion 4, we describe the experiments and their results. Sec- marization. Issues like redundancy, novelty, coverage, tion 5 concludes the paper. temporal relatedness, compression ratio, etc., are more prominent in multi-document summarization [4]. MEAD is a multi-document summarization system. MEAD is a large 1.1 Background of CG scale extractive system that works in a general domain. A CG is a graph representation of logic based on the SUMMONS [11] is an abstractive system that works in a semantic networks of artificial intelligence and existential strict domain, and relies on template-driven Information graphs of Charles Sanders Peirce. John Sowa states the Extraction (IE) technology and Natural Language Genera- purpose of CGs as follws: ‘‘to express meaning in a form tion (NLG) tools. Virendra and Tanveer [12] propose a that is logically precise, human readable and computa- multi-document summarization system that uses sentence tionally tractable’’ [13]. Mathematically, a CG is a bipartite, clustering. It initially identifies summaries on single docu- directed, finite graph; each node in the graph is either a ments and then combines single-document summaries concept node or relation node. Concept node represents using sentence clustering. entities, attributes, states and events, and relation node Earlier works demonstrate that in the multi-document shows how the concepts are interconnected. A node (con- summarization, methods are used to combine single-docu- cept or relation) has two associated values: a type and a ment summaries to form multi-document summary. How- referent or marker; a referent can be either a single generic ever, a more intuitive methodology would be to process the referent or an individual referent. Thus a CG consists of a multiple documents as one set collectively and develop a set of concept types and a set of relation types. coherent and semantic summary. In this work we propose A CG is represented mainly in two forms, viz., (i) display such a methodology, where multiple documents are col- form and ii) linear form. The display form uses the tradi- lectively considered for generating a summary. And for this tional graph form, where concept nodes are represented by we use the CG formalism. rectangular boxes and relation nodes are represented by We propose an algorithm that forms a semantic network ovals. In the linear form, concepts are represented by square of all the documents in the set. From this semantic network, brackets and relation nodes are represented using paren- we form a summary for the set of documents. The semantic thesis. To represent these graphs internally in the computer network is generated using CG. system we use a list data structure consisting of triplet value The major contributions of this work are as follows: (c1, c2, r), where c1 is concept one, c2 is concept two and r i) We have used CGs, which is semantic knowledge is the relationship between the concepts c1 and c2. This representation formalism. triplet structure can be again represented using traditional ii) We have modified and extended the definition given by matrix representation, which is currently followed by Sowa [13]. We discuss in detail what constitutes a information systems. The following example gives more concept and how concepts are formed in section 1.1 of insight into CGs. this paper. Example 1: English sentence: ‘‘Marie hit the piggy bank iii) We mine the concepts and their relationships and with a hammer.’’ develop a CG completely by automated means. All the Figure 1 shows the CG for the example 1 sentence. The earlier works in literature have used partial automation concepts are ‘‘Marie’’, ‘‘Hit’’, ‘‘Hammer’’ and ‘‘the piggy for the development of CGs. bank’’; these concepts are connected by the relationships iv) The formalism of CGs helps in generating an abstractive ‘‘agent’’, ‘‘instrument’’ and ‘‘patient’’, respectively. From summary. Most of the earlier works are extractive the graph we can infer the following: the subject ‘‘Marie’’ summaries. Those that have generated abstractive sum- ‘‘hit’’ the object ‘‘piggy bank’’ using the instrument mary are not scalable, as they use a rule-based approach.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages11 Page
-
File Size-