Query-Based Multi-Document Summarization by Clustering of Documents
Total Page:16
File Type:pdf, Size:1020Kb
Query-based Multi-Document Summarization by Clustering of Documents Naveen Gopal K R Prema Nedungadi Dept. of Computer Science and Engineering Amrita CREATE Amrita Vishwa Vidyapeetham Dept. of Computer Science and Engineering Amrita School of Engineering Amrita Vishwa Vidyapeetham Amritapuri, Kollam -690525 Amrita School of Engineering [email protected] Amritapuri, Kollam -690525 [email protected] ABSTRACT based Hierarchical Agglomerative clustering, k-means, Clus- Information Retrieval (IR) systems such as search engines tering Gain retrieve a large set of documents, images and videos in re- sponse to a user query. Computational methods such as Au- 1. INTRODUCTION tomatic Text Summarization (ATS) reduce this information Automatic Text Summarization [12] reduces the volume load enabling users to find information quickly without read- of information by creating a summary from one or more ing the original text. The challenges to ATS include both the text documents. The focus of the summary may be ei- time complexity and the accuracy of summarization. Our ther generic, which captures the important semantic con- proposed Information Retrieval system consists of three dif- cepts of the documents, or query-based, which captures the ferent phases: Retrieval phase, Clustering phase and Sum- sub-concepts of the user query and provides personalized marization phase. In the Clustering phase, we extend the and relevant abstracts based on the matching between input Potential-based Hierarchical Agglomerative (PHA) cluster- query and document collection. Currently, summarization ing method to a hybrid PHA-ClusteringGain-K-Means clus- has gained research interest due to the enormous information tering approach. Our studies using the DUC 2002 dataset load generated particularly on the web including large text, show an increase in both the efficiency and accuracy of clus- audio, and video files etc. Text Summarization provides an ters when compared to both the conventional Hierarchical overview of a large text allowing the user to understand, and Agglomerative Clustering (HAC) algorithm and PHA. reject or include the text without reading the whole text. Summarization can be useful in text classification, question Categories and Subject Descriptors answering, information retrieval etc. Search engines such as Google use summarization techniques for improving their H.2.8 [Database Management]: Database Applications| search quality. Data mining; H.3.3 [Information Storage and Retrieval]: Generally, Automatic Text summarization methods can Information Search and Retrieval|Clustering, Query for- be divided into two categories: supervised and unsupervised mulation; H.3.4 [Information Storage and Retrieval]: methods. Summarization tasks may be extractive or ab- Systems and Software|Performance evaluation (efficiency stractive. Extractive methods extract significant sentences and effectiveness); I.2.7 [Artificial Intelligence]: Natural from the given documents and generate a summary while Language Processing|Text analysis; I.5.3 [Pattern Recog- abstractive methods may further modify the sentence struc- nition]: Clustering|Algorithms, Similarity measures ture. In this work, we implement a more efficient and inte- General Terms grated Information Retrieval system [1] [2] [3] with three Algorithms, Performance, Experimentation different phases by an extractive based query oriented multi- document summarization. The three phases include Re- trieval phase, Clustering phase and Summarization phase. Keywords The methods used in these phases are all unsupervised meth- Information Retrieval, Automatic Text Summarization, Hi- ods and do not require any training data. In the Retrieval erarchical Agglomerative Clustering algorithms, Potential- phase, a keyword matching links the user query and with the document collection using cosine similarity as the similarity measure. This means that the top-scored relevant docu- ments are retrieved. In the Clustering phase, the retrieved Permission to make digital or hard copies of all or part of this work for documents are clustered into different topic groups based on personal or classroom use is granted without fee provided that copies are the score obtained in the first phase. The actual summary not made or distributed for profit or commercial advantage and that copies is formed in the Summarization phase. The sentences in bear this notice and the full citation on the first page. To copy otherwise, to each of these clusters are ranked using the ranking method, republish, to post on servers or to redistribute to lists, requires prior specific TextRank. We use sentence level extraction approach for permission and/or a fee. ICONIAAC ’14, October 10 - 11 2014, Amritapuri, India the summarization that extracts top ranked sentences from Copyright 2014 ACM 978-1-4503-2908-8/14/08 ...$15.00. each of the clusters and form the summary. The efficiency of the conventional Hierarchical Agglom- 3. DISADVANTAGES CLUSTERING erative Clustering (HAC) [13] method, Single Linkage clus- METHODS tering and the Potential-based Hierarchical Agglomerative (PHA) clustering is frequently reduced by the chaining prob- Hierarchical Agglomerative Clustering (HAC) methods suc lem and form heterogeneous clusters that may affect the per- h as PHA clustering method, Single Linkage clustering meth formance of the retrieval system. Another drawback of HAC od and Partitioning-based clustering method such as k-means concerns the Convergence Criterion. Alternative methods method have many disadvantages as listed as follows. also face problems. Choosing the optimum number of Clus- 3.1 Chaining effect ters k restricts the efficiency of Partitioning-based approach Chaining is a common problem in Single Linkage cluster- such as k-means method. However, HAC methods and Parti ing and PHA method and it can be defined as the gradual tioning-based clustering methods can be combined for effi- growth of a single cluster as one data object with the el- cient and accurate clustering. In this work, we extend PHA ements added to that cluster at each iterative step of the clustering method to a PHA-ClusteringGain-K-Means hy- algorithm. This leads to the formation of impractical het- brid clustering approach, in which the PHA method and erogeneous clusters and may result in the unequal partition- k-means method are combined by incorporating a cluster- ing of data objects. In a clustering process many singleton ing evaluation criterion concept called clustering gain into clusters are formed as a result of chaining. Thus the out- PHA for finding the optimum number of clusters automat- put clusters cannot be properly define from the input data ically. This number is assigned as the value of k in k- objects. Example for chaining is given below. means algorithm and forms homogeneous clusters that in turn eradicates the above disadvantages and improves the Let the data objects be p1, p2, p3, p4, p5 with random performance. The proposed hybrid clustering approach is coordinate values as [4, 5, 6, 5], [3, 4, 5, 8], [1, 2, 4, 5], more accurate and efficient than the conventional HAC meth [2, 3, 6, 7], [4, 3, 5, 6] respectively. Let the Hierarchi- od and the PHA method. The Clustering module in the pro- cal Agglomerative clustering be performed on these objects posed system is tested using Silhouette Coefficient method and the clusters at each iteration of the algorithm is shown and the Summarization part is tested using ROUGE evalu- below. ation method. All sets of experiments are conducted using First Iteration - [[p1], [p3], [p5], [p4,p2]] DUC 2002 dataset. Second Iteration - [[p1], [p3], [p4, p2, p5]] Third Iteration - [[p3], [p4, p2, p5, p1]] Fourth Iteration - [[p4, p2, p5, p1, p3]] 2. RELATED WORKS In each iteration, the single cluster itself is growing grad- There are many existing Information Retrieval systems ually with the new data objects added to it in the successive with different phases such as Retrieval phase, Clustering iteration. We obtain heterogeneous partitioning of data ob- phase and Summarization phase. The Clustering module in jects at each iteration. In a document clustering process if a such systems clusters the retrieved documents into different cluster is formed with large number of less scored documents, topic groups and improves the retrieval system by provid- then the output summary may contain more information for ing organized and focused information. The Summarization a less significant topic. This affects the performance of the module provides an abstract of large documents and allows Information Retrieval system in a way that the system gives users to find relevant information quickly without reading more significance for particular information that may have the whole text. less significance in reality. QCS [4], a system for querying, clustering and summa- rizing documents, is an Information Retrieval system that 3.2 Convergence criterion for HAC employs three phases Querying phase, Clustering phase and The HAC methods merge two most similar data objects Summarization phase. In Querying phase, QCS retrieves a into a single cluster hierarchically in each iteration as bottom- set of relevant document for a given input query using Latent up manner in a hierarchical tree. It starts from the bottom Semantic Indexing (LSI). In Clustering phase, the retrieved level with N clusters and go each level up and ends in the documents are clustered into different topic clusters using top level with a single cluster. The stopping criterion is a generalized spherical k-means algorithm. In Summarization