Query-based Multi-Document Summarization by Clustering of Documents

Naveen Gopal K R Prema Nedungadi Dept. of Computer Science and Engineering Amrita CREATE Amrita Vishwa Vidyapeetham Dept. of Computer Science and Engineering Amrita School of Engineering Amrita Vishwa Vidyapeetham Amritapuri, Kollam -690525 Amrita School of Engineering [email protected] Amritapuri, Kollam -690525 [email protected]

ABSTRACT based Hierarchical Agglomerative clustering, k-means, Clus- (IR) systems such as search engines tering Gain retrieve a large set of documents, images and videos in re- sponse to a user query. Computational methods such as Au- 1. INTRODUCTION tomatic Text Summarization (ATS) reduce this information Automatic Text Summarization [12] reduces the volume load enabling users to find information quickly without read- of information by creating a summary from one or more ing the original text. The challenges to ATS include both the text documents. The focus of the summary may be ei- time complexity and the accuracy of summarization. Our ther generic, which captures the important semantic con- proposed Information Retrieval system consists of three dif- cepts of the documents, or query-based, which captures the ferent phases: Retrieval phase, Clustering phase and Sum- sub-concepts of the user query and provides personalized marization phase. In the Clustering phase, we extend the and relevant abstracts based on the matching between input Potential-based Hierarchical Agglomerative (PHA) cluster- query and document collection. Currently, summarization ing method to a hybrid PHA-ClusteringGain-K-Means clus- has gained research interest due to the enormous information tering approach. Our studies using the DUC 2002 dataset load generated particularly on the web including large text, show an increase in both the efficiency and accuracy of clus- audio, and video files etc. Text Summarization provides an ters when compared to both the conventional Hierarchical overview of a large text allowing the user to understand, and Agglomerative Clustering (HAC) algorithm and PHA. reject or include the text without reading the whole text. Summarization can be useful in text classification, question Categories and Subject Descriptors answering, information retrieval etc. Search engines such as Google use summarization techniques for improving their H.2.8 [Database Management]: Database Applications— search quality. Data mining; H.3.3 [Information Storage and Retrieval]: Generally, Automatic Text summarization methods can Information Search and Retrieval—Clustering, Query for- be divided into two categories: supervised and unsupervised mulation; H.3.4 [Information Storage and Retrieval]: methods. Summarization tasks may be extractive or ab- Systems and Software—Performance evaluation (efficiency stractive. Extractive methods extract significant sentences and effectiveness); I.2.7 [Artificial Intelligence]: Natural from the given documents and generate a summary while Language Processing—Text analysis; I.5.3 [Pattern Recog- abstractive methods may further modify the sentence struc- nition]: Clustering—Algorithms, Similarity measures ture. In this work, we implement a more efficient and inte- General Terms grated Information Retrieval system [1] [2] [3] with three Algorithms, Performance, Experimentation different phases by an extractive based query oriented multi- document summarization. The three phases include Re- trieval phase, Clustering phase and Summarization phase. Keywords The methods used in these phases are all unsupervised meth- Information Retrieval, Automatic Text Summarization, Hi- ods and do not require any training data. In the Retrieval erarchical Agglomerative Clustering algorithms, Potential- phase, a keyword matching links the user query and with the document collection using cosine similarity as the similarity measure. This means that the top-scored relevant docu- ments are retrieved. In the Clustering phase, the retrieved Permission to make digital or hard copies of all or part of this work for documents are clustered into different topic groups based on personal or classroom use is granted without fee provided that copies are the score obtained in the first phase. The actual summary not made or distributed for profit or commercial advantage and that copies is formed in the Summarization phase. The sentences in bear this notice and the full citation on the first page. To copy otherwise, to each of these clusters are ranked using the ranking method, republish, to post on servers or to redistribute to lists, requires prior specific TextRank. We use sentence level extraction approach for permission and/or a fee. ICONIAAC ’14, October 10 - 11 2014, Amritapuri, India the summarization that extracts top ranked sentences from Copyright 2014 ACM 978-1-4503-2908-8/14/08 ...$15.00. each of the clusters and form the summary. The efficiency of the conventional Hierarchical Agglom- 3. DISADVANTAGES CLUSTERING erative Clustering (HAC) [13] method, Single Linkage clus- METHODS tering and the Potential-based Hierarchical Agglomerative (PHA) clustering is frequently reduced by the chaining prob- Hierarchical Agglomerative Clustering (HAC) methods suc lem and form heterogeneous clusters that may affect the per- h as PHA clustering method, Single Linkage clustering meth formance of the retrieval system. Another drawback of HAC od and Partitioning-based clustering method such as k-means concerns the Convergence Criterion. Alternative methods method have many disadvantages as listed as follows. also face problems. Choosing the optimum number of Clus- 3.1 Chaining effect ters k restricts the efficiency of Partitioning-based approach Chaining is a common problem in Single Linkage cluster- such as k-means method. However, HAC methods and Parti ing and PHA method and it can be defined as the gradual tioning-based clustering methods can be combined for effi- growth of a single cluster as one data object with the el- cient and accurate clustering. In this work, we extend PHA ements added to that cluster at each iterative step of the clustering method to a PHA-ClusteringGain-K-Means hy- algorithm. This leads to the formation of impractical het- brid clustering approach, in which the PHA method and erogeneous clusters and may result in the unequal partition- k-means method are combined by incorporating a cluster- ing of data objects. In a clustering process many singleton ing evaluation criterion concept called clustering gain into clusters are formed as a result of chaining. Thus the out- PHA for finding the optimum number of clusters automat- put clusters cannot be properly define from the input data ically. This number is assigned as the value of k in k- objects. Example for chaining is given below. means algorithm and forms homogeneous clusters that in turn eradicates the above disadvantages and improves the Let the data objects be p1, p2, p3, p4, p5 with random performance. The proposed hybrid clustering approach is coordinate values as [4, 5, 6, 5], [3, 4, 5, 8], [1, 2, 4, 5], more accurate and efficient than the conventional HAC meth [2, 3, 6, 7], [4, 3, 5, 6] respectively. Let the Hierarchi- od and the PHA method. The Clustering module in the pro- cal Agglomerative clustering be performed on these objects posed system is tested using Silhouette Coefficient method and the clusters at each iteration of the algorithm is shown and the Summarization part is tested using ROUGE evalu- below. ation method. All sets of experiments are conducted using First Iteration - [[p1], [p3], [p5], [p4,p2]] DUC 2002 dataset. Second Iteration - [[p1], [p3], [p4, p2, p5]] Third Iteration - [[p3], [p4, p2, p5, p1]] Fourth Iteration - [[p4, p2, p5, p1, p3]] 2. RELATED WORKS In each iteration, the single cluster itself is growing grad- There are many existing Information Retrieval systems ually with the new data objects added to it in the successive with different phases such as Retrieval phase, Clustering iteration. We obtain heterogeneous partitioning of data ob- phase and Summarization phase. The Clustering module in jects at each iteration. In a document clustering process if a such systems clusters the retrieved documents into different cluster is formed with large number of less scored documents, topic groups and improves the retrieval system by provid- then the output summary may contain more information for ing organized and focused information. The Summarization a less significant topic. This affects the performance of the module provides an abstract of large documents and allows Information Retrieval system in a way that the system gives users to find relevant information quickly without reading more significance for particular information that may have the whole text. less significance in reality. QCS [4], a system for querying, clustering and summa- rizing documents, is an Information Retrieval system that 3.2 Convergence criterion for HAC employs three phases Querying phase, Clustering phase and The HAC methods merge two most similar data objects Summarization phase. In Querying phase, QCS retrieves a into a single cluster hierarchically in each iteration as bottom- set of relevant document for a given input query using Latent up manner in a hierarchical tree. It starts from the bottom Semantic Indexing (LSI). In Clustering phase, the retrieved level with N clusters and go each level up and ends in the documents are clustered into different topic clusters using top level with a single cluster. The stopping criterion is a generalized spherical k-means algorithm. In Summarization problem in HAC that is in which level the iteration needs to phase, a summary is created from each clusters using the be stopped. methods Hidden Markov Model (HMM) and pivoted QR 3.3 Deciding the number of clusters decomposition. However, k value used in the spherical k- means algorithm is given as user input. Deciding optimum number of clusters k is the major prob- Hierarchical Agglomerative Clustering (HAC) methods ha lem with Partitioning-based clustering approaches such as ve been extensively applied in the field of Information Re- k-means method. trieval as the methods can improve the efficiency of the In- formation Retrieval systems. Anastasios Tombros, Robert 4. PROPOSED SYSTEM Villa, and C.J. Van Rijsbergen [8] proposed a method for The proposed system solves the above disadvantages and improving the effectiveness of retrieval systems that use Hi- the procedure of our system is given in algorithm 1. erarchical Agglomerative clustering (HAC) methods for clus- Each of the phases in the system is explained in detail as tering the search results (query specific clustering). follows. However, the clustering methods used in these systems are affected by many disadvantages listed below. The pro- 4.1 Retrieval phase posed hybrid clustering method for clustering solves these Retrieval phase is first phase in our proposed system. In disadvantages. this phase, an input query is matched with a set of doc- Algorithm 1 PHA-ClusteringGain-K-Means Hybrid Clus- terms (words) and each column represents a set of doc- tering Method uments. Each column Aj represents weighted term Input: Given document collection and input query vector of document j. Each element Aij in the term- Output: Summary from the set of clusters document matrix is a weighted tf − idf value of term t in document d. 1. Retrieval phase (a) Pre-processing the given document collection. Aij = TF − IDF (t, d) (1) (b) Pre-processing the input query. TF − IDF can be defined as, (c) Find the set of matching documents with the input query using cosine similarity. TF − IDF (t, d) = TF (t, d) ∗ IDF (t) (2) 2. Clustering phase TF (t, d) or term frequency is the ratio between num- (a) Perform Potential based Hierarchical Agglom- ber of times term t appears in document d and total erative clustering (PHA) method on the ob- number of unique terms. tained set of matched documents with corre- n(t, d) TF (t, d) = (3) sponding cosine similarity scores. |T | (b) Compute the clustering gain at each iterative step of the algorithm. where |T | is total number of unique terms. (c) Fix the no. of clusters when the clustering gain IDF (t) or inverse document frequency is defined as, reaches its maximum value. |D| (d) Take this number of clusters as the value for k IDF (t) = log( ) (4) in the k-means algorithm. dt (e) Perform the k-means method with the com- where |D| is the total number of documents in the puted k value. given input document set and dt is the number of doc- uments containing the term t. 3. Summarization phase (a) Rank the sentences from each of the obtained 4.1.2 Query pre-processing document clusters using TextRank. Query pre-processing consists of several steps as follows. (b) Form the summary with the extracted top ranked sentences. • Tokenizing: Input Query is tokenized into individual set of words.

• Stopword Filtering: Stopwords are removed using the uments in the collection and retrieves a set of documents Stopword list built by Gerard Salton and Chris Buck- relevant to the query. Cosine similarity is used as the sim- ley at Cornell University. ilarity measure for finding the similarity between the query and set of documents. Before computing the cosine simi- • Spell checking: Spell checking on individual query words larity, the given document collection and the input query is done using enchant package in Python. have undergone pre-processing in order to represent them in vectors using vector space model and is explained as follows. • Word : Porter’s stemming algorithm is used to perform word stemming. 4.1.1 Document pre-processing Document pre-processing consists of several steps as fol- • Query Vector Creation: A tf − idf vector is created lows, for the query similar to the document vector. • Segmentation: The given document collection is rep- 4.1.3 Querying the documents resented as a set of individual documents. Each in- The jth column of term-document matrix A represents the dividual document is converted to their standardized jth document. The jth document have the cosine similarity format. Set of terms are extracted from the document score sj that determines the relevance of the documents to set. an input query q and is compute as,

• Stopword Filtering: Stopwords are removed using the q · Aj sj = (5) Stopword list built by Gerard Salton and Chris Buck- kqkkAj k ley at Cornell University. This word list is 571 words where 0 <= sj <= 1. The document with similarity score in length. Set of unique terms are extracted. greater than a cutoff value is relevant to a particular query. • Word Stemming: Porter’s stemming algorithm is used 4.2 Clustering phase to perform word stemming. Stemming can reduce the memory space required to store the words and makes The Clustering phase is the second phase in our system, computation easier. in which we extend the PHA clustering method to PHA- ClusteringGain-K-Means hybrid clustering approach for ef- • Input Matrix Creation: The given input document col- ficient and accurate clustering. Clustering organizes the re- lection is represented by m ∗ n term-document matrix trieved documents into different topic groups. The topic A, where m is the number of rows and n is the num- can be defined as the terms given in the input query. Clus- ber of columns. Each row represents a set of unique tering improves the retrieval system by providing organized and focused information. In the Clustering phase, the re- where N is the number of the data objects. trieved documents are clustered based on their cosine sim- ilarity scores computed in the first phase. The proposed The procedure of PHA method is as follows. method is described as follows. Step 1: Let the data objects be p1, p2, p3, p4, p5 and p6, representing cosine similarity scores of retrieved document, 4.2.1 PHA-ClusteringGain-K-Means hybrid in euclidean space as shown in figure 1. clustering approach In PHA-ClusteringGain-K-Means hybrid clustering appro ach, Potential based Hierarchical Agglomerative clustering (PHA) and k-means are combined. A clustering evaluation criterion concept called clustering gain is incorporated into PHA for finding the optimum number of clusters automat- ically. The k-means method takes this optimum number as the value of k instead of giving it as user input. The cluster- ing gain is computed at each iterative step of PHA clustering method. The iteration is converged when the value of clus- tering gain reaches at its maximum value and the number of clusters at that stage is taken as the optimal number of clusters. The PHA method is affected by chaining problem, where as the data objects are partitioned homogeneously in Figure 1: Data objects in the Euclidean space k-means algorithm. The k-means method usually produces spherical shaped or symmetrical shaped clusters. So the fi- Step 2: Compute the total potential at each object using nal clusters obtained from the proposed hybrid clustering equation (9) and sort these values in ascending order. approach are free from the chaining effect and thus the dis- Step 3: Let the ascending order of potential be [p4, p6, advantages of clustering are solved. The PHA method and p5, p2, p3, p1 ], which means the data object p4 has the clustering gain computation are explained in detail as fol- lowest potential become the root of the Weighted Edge Tree lows. and the data object p1 has the highest potential. Then, the method tries to find the parent of each data object. The 4.2.1.1 Potential-based Hierarchical Agglomerative parent of a given data object is defined as the nearest visited Clustering (PHA). data object to it. Potential-based Hierarchical Agglomerative clustering [5] Step 4: The parent of second object p6 is set as p4 because is a Hierarchical Agglomerative clustering method in which p4 is the nearest object visited so far. data objects are clustered based on a hypothetical poten- Step 5: Parent of third object p5 is taken as p6 because tial field existing between all the data objects in the eu- p5 is nearest to p6 than p4. clidean space. PHA considers both global data distribu- Step 6: Similarly find parent of p2 and so on up to p1. tion information by incorporating potential field and local Step 7: The W eighted Edge T ree is drawn as shown in data distribution information by including distance matrix figure 2. The weight of the edge between a data object in the algorithm while conventional Hierarchical Agglomera- and its parent is defined as the euclidean distance between tive clustering algorithms such as Single Linkage clustering, them. Complete Linkage clustering only consider local data distri- Step 8: The data objects are sorted based on the edge bution information. In this work, cosine similarity scores of the retrieved documents computed in the first phase is taken as the data objects and euclidean distance is taken as the distance measure. In the PHA method, the potential at a data object i from another data object j is defined as,

( 1 − if rij >= δ φ (r ) = rij (6) ij ij 1 − δ if rij < δ where ri,j is the euclidean distance between i and j and the parameter δ is for avoiding singularity when ri,j becomes zero. The δ value is compute as,

δ = mean(MinDi)/S (7) Figure 2: Weighted Edge Tree

MinDi = minrij 6=0,j=1...N (rij ) (8) weight between the data object and its parent as shown in where MinDi is the minimum distance from data object i to the tree and let the order be [p6, p5, p3, p2, p1, p4 ]. This all the other objects, and S is scale factor taken as 10. Total means that the distance between p6 and its parent p4 is the potential at a data object i is the sum of potential from all smallest. other objects and is computed as, Step 9: The first data object p6 from the above queue is X φi = φij (rij ) (9) merged with its parent p4 and form cluster (p6, p4). The j=1...N second data object p5 is merged with the cluster (p6, p4) th (j) th th because parent of p5 is p6. The rest of the objects up to p4 in the j cluster, pi is the i instance in the j cluster in the queue get merged with their respective parent. (j) and p0 is the centroid point. The intra-cluster error sum is Step 10: The given data objects are clustered hierarchi- increased during a clustering process from zero at the initial cally in each iteration and the clustering process is shown stage and maximum at the final stage. by a dendrogram in figure 3. The inter-cluster error sum Γ is the sum of squared Eu- clidean distances from the centroid of a cluster to the global centroid for all clusters and is given by,

k X (j) 2 Γ = ||p0 − p0||2 (12) j=1

where p0 (global centroid) is the centroid of all data objects and is given by, n 1 X p = p (13) 0 n i Figure 3: Dendrogram i=1

The set of retrieved documents are clustered using PHA 4.2.1.3 K-Means clustering method. method into different topic groups. The PHA has the time 2 The k-means [9] method is performed with the computed complexity of O(n ) and is more efficient than conventional k value as input. The k-means method produces k no. of HAC method such as Single Linkage clustering with time 3 homogeneous clusters. Thus the output clusters obtained complexity O(n ). The clustering gain computation is ex- from the proposed hybrid clustering approach is free from plained as follows. chaining effect. The time complexity of the proposed hybrid clustering 4.2.1.2 Clustering Gain. approach, the sum of time taken by both PHA and k-means Clustering gain [7] can be regarded as a computationally methods, is O(n2) + O(nidk), which is less than that of efficient evaluation criterion for finding an optimal cluster Single Linkage clustering with time complexity O(n3), where configuration and there by finding an optimal number of n is no. of instance in d dimension, i is the no. of instances clusters required in the clustering process. The value of clus- and k is the no. of clusters to be obtained. tering gain is always greater than or equal to zero. The op- timal number of clusters can be computed at the maximum 4.3 Summarization phase value of clustering gain. Complete clustering is needed to be The Summarization phase is the last phase in our pro- done in order to determine the maximum value of clustering posed system, in which the actual summarization process is gain, as the value of clustering gain varies at each iteration performed by ranking individual sentences from the obtained of the clustering process from zero at initial and final stages document clusters. Sentence level extraction approach is of the clustering to maximum about at middle stage of the used for creating the summary. The summarization is done clustering. The clustering gain is computed at each itera- by forming a single summary from each of the clusters of tive step of the PHA clustering method. The PHA method documents by extracting top ranked sentences and presents is converged at the maximum value of cluster gain. The no. a set of summary as the output of the system. The ranking of clusters obtained at that stage is taken as the value for k method, TextRank is used for ranking the sentences. in k-means algorithm. The Clustering gain, ∆j computed for the cluster Cj in 4.3.1 TextRank a particular stage of a clustering process can be defined as TextRank [6] is a graph based method for ranking sen- difference between the decreased inter-cluster error sum γ j tences. These graphs are built from natural language texts compared to the previous stage and increased intra-cluster with sentences in the texts form the vertices of the graph. error sum λ compared to the previous stage. This can be j The Edge e between two vertices v and v represents the written as, ij i j similarity between the ith sentence and jth sentence and the ∆j = γj − λj (10) edge weight wij represents the value of similarity between them. In this work, the edit distance between two sentences where γ is the decreased inter-cluster error sum and λ is j j is taken as the similarity value between them. In TextRank, the increased intra-cluster error sum for the cluster C . j PageRank method is employed for ranking purposes. The weighted PageRank method is called with the constructed The intra-cluster error sum (within-group error sum) Λ is graph as input. The weighted PageRank PRW of vertex v the sum of distances for all clusters such that the distance i is computed as, is the sum of squared Euclidean distances from the centroid of a cluster to every data objects in that cluster and is given W W X PR (vj ) by, PR (vi) = (1 − d) + d ∗ wji P (14) v ∈Out(v ) wkj vj ∈In(vi) k j k nj X X (j) (j) 2 Λ = ||p − p ||2 (11) W i 0 where PR (vj ) is weighted PageRank of vj and d is a pa- j=1 i=1 rameter having value between 0 and 1, usually set at 0.85. where k is number of clusters, nj is the number of instances In(vi) is the set of vertices that points to vi (predecessors) and Out(vj ) is the set of vertices that vj points to (succes- It is clear from the above tables that our proposed hybrid sors). Also wji is the edge weight between vertices j and i clustering method obtains better values for silhouette coef- and wkj is the edge weight between vertices k and j. ficient than PHA clustering and Single Linkage HAC. So we The PageRank algorithm starts with arbitrary values as- can conclude that our method produced better clusters than signed to each vertex in the graph. The computation pro- that produced by PHA and Single Linkage clustering. ceeds until a given threshold value is achieved. Importance The time taken for execution by our proposed hybrid clus- of each vertex in the graph is indicated by a final value tering method is compared with the Single Linkage HAC associated with them and the corresponding sentences are method for an input query hurricane earthquake and is given ranked based on these scores. Top ranked sentences are ob- by a graph shown in figure 4. tained by sorting the sentences in the decreasing order of their score.

5. EXPERIMENTATION AND RESULT 5.1 Dataset In this work, DUC 2002 [10] dataset is used for the ex- perimentation. The test data contains 59 document sets with each set containing 5 to 15 documents. For the experi- mentation purpose, document sets describing about natural disaster events such as hurricane, earthquake, flood, drought and volcano erupt are collected from the entire collection. The performance comparison of our proposed hybrid clus- tering method with PHA clustering method and Single Link- age clustering algorithms is done using silhouette coefficient method. The silhouette coefficient obtained for the PHA Figure 4: Time taken by PHA-ClusteringGain-K- clustering, the Single Linkage HAC and our proposed hy- Means hybrid clustering method and Single Linkage brid clustering method for different query is shown in table HAC 1, 2, 3 respectively. The first column, no. of topics denotes the number of terms contained in a particular input query The X-axis denotes different dimensions of the input docum and the second column, number of clusters denotes the op- ent-term matrix and Y-axis denotes the time taken for ex- timal number of clusters computed from the cluster gain for ecution in millisecond. It is clear from the graph that the that query. time taken by our proposed method is significantly less than that of Single Linkage clustering under all dimensions of in- put document-term matrix. The obtained results agree with Table 1: Silhouette Coefficient values for PHA clus- their theoretical time complexities tering The clustering gain graph for input query hurricane earth- No. of Topics No. of Clusters Silhouette Coefficient quake is shown in figure 5. The X-axis denotes each iterative 2 2 0.5691 step of the propose hybrid clustering method and Y-axis de- 3 3 0.6652 notes the clustering gain obtained at each iteration. 4 5 0.3505 5 6 0.5714

Table 2: Silhouette Coefficient values for Single Linkage HAC No. of Topics No. of Clusters Silhouette Coefficient 2 3 0.2555 3 3 0.6652 4 5 0.3505 5 6 0.5714

Figure 5: Clustering Gain Table 3: Silhouette Coefficient values for PHA- The chaining effect shown by the PHA clustering method ClusteringGain-K-Means hybrid clustering method is depicted by a graph in figure 6. Each point in the X- No. of Topics No. of Clusters Silhouette Coefficient axis corresponds to the Silhouette Coefficient value of each 2 2 0.5864 data object (document) and the Y-axis corresponds to clus- 3 3 0.6652 ter number. Each cluster is colored with unique color. The 4 5 0.5677 PHA clustering is affected with chaining problem as it pro- 5 6 0.6049 duce two clusters, as shown by blue and red color respec- 6. CONCLUSION We implemented a more efficient and integrated Informa- tion Retrieval system with three different phases: Retrieval phase, Clustering phase and Summarization phase. In the Clustering phase, we extended the PHA clustering method to PHA-ClusteringGain-KMeans hybrid clustering approach by combing PHA method and k-means method by using the clustering gain evaluation criteria to determine the ideal k value for k-means. The DUC 2002 dataset is used for con- ducting the experimentation. Our results show that the pro- posed PHA-ClusteringGain-KMeans hybrid clustering ap- proach is more efficient and accurate than conventional Hi- erarchical Agglomerative Clustering (HAC) algorithm and PHA. 7. ACKNOWLEDGMENTS Figure 6: Chaining effect shown by PHA clustering This work derives direction and inspiration from the Chan- cellor of Amrita University, Sri Mata Amritanandamayi Devi. We thank Dr. M. Ramachandra Kaimal, Head of Com- tively, with zeroth cluster contains only one data object puter Science Department, Amrita University for his valu- (document) and first cluster contains rest of the data ob- able feedback. jects. This problem is solved by the proposed hybrid cluster- ing approach as the method produced homogeneous clusters as shown in figure 7. 8. REFERENCES [1] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze.¨ 2008, Introduction to Information Retrieval, Cambridge University Press, New York, NY, USA. [2] Ricardo A. Baeza-Yates and Berthier Ribeiro-Neto. 1999, Modern Information Retrieval, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA. [3] David A. Grossman and Ophir Frieder. 1998, Information Retrieval: Algorithms and Heuristics, Kluwer Academic Publishers, Norwell, MA, USA. [4] Daniel M. Dunlavy, Dianne P. O Leary, John M. Conroy, Judith D. Schlesinger, QCS: A system for querying, clustering and summarizing documents, Information Processing and Management, ScienceDirect, 2007. Volume 43, Issue 6, November 2007, Pages 1588–1605. [5] Yonggang Lu, Yi Wan, PHA: A fast potential-based Figure 7: Homogeneous clusters obtained by PHA- hierarchical agglomerative clustering method, Pattern ClusteringGain-K-Means hybrid clustering method Recognition, ScienceDirect, 2013, Volume 46, Issue 5, May 2013, Pages 1227–1239. [6] Rada Mihalcea. 2004, Graph-based ranking algorithms 5.2 Summary evaluation measure for , applied to text summarization, In Proceedings of the ACL 2004 on Interactive poster ROUGE [11] [14], a software package for automated eval- and demonstration sessions (ACLdemo ’04), uation of summaries, is used for evaluating the summary Association for Computational Linguistics, Stroudsburg, produced by our proposed hybrid clustering approach. The PA, USA, Article 20. ROUGE evaluation is conducted for the input query hur- ricane earthquake. The query produces two clusters, with [7] Yunjae Jung, Haesun Park, A Decision Criteria for the first cluster corresponds to the topic hurricane and second Optimal Number of Clusters in Hierarchical Clustering, cluster corresponds to the topic earthquake. The Precision, 2002, Kluwer Academic Publishers. Recall and F-Score for ROUGE-1, ROUGE-2, ROUGE-L is [8] Anastasios Tombros, Robert Villa, & C. J. Van computed and given in table 4. Rijsbergen, The effectiveness of query-specific hierarchic clustering in information retrieval, Information Processing and Management, ScienceDirect, 2002, Table 4: ROUGE Evaluation Volume 38, Issue 4, July 2002, Pages 559–582. Rouge/Parameter ROUGE-1 ROUGE-2 ROUGE-L [9] Velmurugan T., Performance based analysis between Precision 0.5000 0.3092 0.5000 k-Means and Fuzzy C-Means clustering algorithms for Recall 0.4604 0.2846 0.4604 connection oriented telecommunication data, Applied F-Score 0.4794 0.2964 0.4794 Soft Computing, ScienceDirect, 2014, Volume 19, June 2014, Pages 134–146. [10] DUC, 2002, Document Understanding Conference (DUC), 2002, http://www- nlpir.nist.gov/projects/duc/guidelines/2002.html. [11] Lin, Chin-Yew. 2004a, ROUGE: a Package for Automatic Evaluation of Summaries, In Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004), Barcelona, Spain, July 25 - 26, 2004. [12] Ju-Hong Lee, Sun Park, Chan-Min Ahn, & Daeho Kim, Automatic generic document summarization based on non-negative matrix factorization, Information Processing and Management, ScienceDirect, 2009, Volume 45, Issue 1, January 2009, Pages 20–34. [13] A. K. Jain, M. N. Murty, and P. J. Flynn. 1999, Data clustering: a review, ACM Comput. Surv. 31, 3 (September 1999), 264-323. [14] Lin, Chin-Yew and E.H. Hovy 2003, Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics, In Proceedings of 2003 Language Technology Conference (HLT-NAACL 2003), Edmonton, Canada, May 27 - June 1, 2003.