Investigations in Document Clustering and Summarization

Investigations in Document Clustering and Summarization Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy by Naveen Saini Roll No. 1621CS12 Under the supervision of Dr. Sriparna Saha Prof. Pushpak Bhattacharyya Department of Computer Science and Engineering Indian Institute of Technology Patna Patna - 801106, India March-2020 c : 2020 by Naveen Saini. All rights reserved. APPROVAL OF THE DOCTORAL COMMITTEE Certified that the thesis entitled \Investigations in Document Clustering and Summa- rization" submitted by Naveen Saini to Indian Institute of Technology Patna for the award of the degree of Doctor of Philosophy, has been accepted by the doctoral committee members after the successful completion of the synopsis seminar held on 07 January 2020. Dr. Sriparna Saha Prof. Pushpak Bhattacharyya Supervisor Supervisor Department of Computer Sc. and Engg. Department of Computer Sc. and Engg. Indian Institute of Technology Patna Indian Institute of Technology Patna Dr. Asif Ekbal Dr. Jimson Mathew Chairperson, Doctoral Committee Member, Doctoral Committee Department of Computer Sc. and Engg. Department of Computer Sc. and Engg. Indian Institute of Technology Patna Indian Institute of Technology Patna Dr. Yogesh Mani Tripathi Member, Doctoral Committee Department of Mathematics Indian Institute of Technology Patna iii DECLARATION BY THE SCHOLAR I certify that: • The work contained in this thesis is original and has been done by me under the guidance of my supervisors. • The work has not been submitted to any other Institute for any degree or diploma. • I have followed the guidelines provided by the Institute in preparing the thesis. • I have conformed to the norms and guidelines given in the Ethical Code of Conduct of the Institute. • Whenever I have used materials (data, theory and text) from other sources, I have given due credit to them by citing them in the text of the thesis and giving their details in the reference section. • The thesis has been checked by anti-plagiarism software. Naveen Saini v CERTIFICATE This is to certify that the thesis entitled \Investigations in Document Clustering and Summarization", submitted by Naveen Saini to Indian Institute of Technology Patna, is a record of bonafide research work under our supervision and we consider it worthy of consideration for the degree of Doctor of Philosophy of the Institute. Dr. Sriparna Saha Prof. Pushpak Bhattacharyya Supervisor Supervisor Department of Computer Sc. and Engg. Department of Computer Sc. and Engg. Indian Institute of Technology Patna Indian Institute of Technology Patna Place: Indian Institute of Technology Patna Date: vii Acknowledgement First and foremost, I would like to express my deep and heartfelt gratitude to my supervisors, Dr Sriparna Saha and Professor Pushpak Bhattacharyya, for their blessings and valuable advice throughout my research journey at IIT Patna. Their continuous support and encouragement have always inspired and motivated me to give my best. They assisted me in shaping my research ideas and helping me in getting through all the obstacles in my four-year PhD journey. I could not have imagined having better advisors for mentoring my PhD. I am indebted to them for their constant support, time, suggestions, and positive attitude towards my research. Beside my supervisors, I am also grateful to the members of my Doctoral Committee (Dr Asif Ekbal, Dr Jimson Mathew, and Dr Yogesh Mani Tripathi) for examining my work and providing their valuable comments and suggestions. Moreover, I am thankful to all members of AI-NLP-ML Group of IIT Patna for supporting me and for all the fun we have had in the last four years. I would also like to extend my appreciation to my seniors, batch-mates and juniors for their support. Furthermore, I would like to express my sincere gratitude to my parents, brother and sisters for their never-ending love, care, and affection in each and every step of my life. They always encouraged me to achieve my goals. I am grateful to my lovely wife Nisha and my in-laws for their perpetual understanding, patience, and the support they provided endlessly throughout my research. Last but not least, I thank the Department of Computer Science and Engineering and IIT Patna itself for giving me an opportunity to do my research while providing all the research facilities and travel grants. Place: Indian Institute of Technology Patna Date: Naveen Saini ix Abstract A tremendous amount of text-content is available in the form of documents, microblogs, scientific articles, and other sources, and this keeps on growing exponentially over time with the arrival of new data from multiple sources. In order to scan through such a large volume of data, there is a need to develop some efficient text-mining techniques. In this direction, the development of several supervised methods is an increasingly urgent need to prevent decision makers from being overwhelmed by too much information. However, these supervised methods require a massive amount of labelled data. Moreover, for developing a supervised information extraction system, data annotation is a very time-consuming and costly process. Therefore, these challenges de- mand developing an unsupervised method to overcome them. In the current thesis, two areas of text-mining have been deeply investigated, namely, document clustering and summarization, by developing unsupervised techniques to solve them. In document clustering, the task is to find the optimal partitioning given a set of documents in an automatic way. In summarization, on the other hand, the aim is to compress relevant information and make it concise from the available data. Different facets of summarization, like document summarization, figure-summarization, microblog summarization, and multi-modal microblog summarization, were explored in this thesis. The task of summarization is presented as a multi-objective optimization problem where multiple quality measures like cohesion, readability, anti-redundancy, among others, are opti- mized simultaneously. A meta-heuristic optimization technique, namely differential evolution, is used as the under- lying optimization strategy. Several new genetic operators inspired by the concepts of a self- organizing map are also incorporated in the optimization process. We employed the ROUGE-N measure to ensure the extraction of good quality summary. Extensive experimentation has veri- fied that all our proposed methods outperform the existing methods when tested on task-related data-sets. Keywords: Unsupervised learning, Clustering, Document Summarization, Figure-summarization, Microblog Summarization, Multi-modal microblog summarization, Multi-objective Optimiza- tion, Binary Optimization, Evolutionary Algorithm, Textual Entailment, Image Dense-Captioning, Word Mover Distance, Cosine Distance, Word2vec, Cluster Validity Indices, Self-organizing Map, Syntactic and Semantic Similarity. xi List of Tables 2.1 Definitions of Cluster validity measures/indices. Here, K: number of clusters; N: number of data points; dist: distance function; Opt. in the last column refers to optimization. 22 3.1 Parameter setting for our proposed approach . 55 3.2 Results obtained after application of the proposed clustering algorithm on text documents in comparison to other clustering algorithms. Here, Rep. denotes representation, N: Number of scientific articles; F: Vocabulary size; OC: Obtained number of clusters; DI: Dunn Index; xx: all data points assigned to single cluster 58 3.3 Results obtained after application of the proposed clustering algorithm on text documents in comparison to other clustering algorithms. Here, Rep. denotes representation, N: Number of scientific articles; F: Vocabulary size; OC: Obtained number of clusters; DB: Davies-Bouldin Index; xx: all data points assigned to single cluster . 59 3.4 Values of different components of the Dunn Index for tf, tf-idf and Glove representation with 100 dimension on WebKB dataset. Here, Rep. denotes representation, OC: obtained cluster, DI: Dunn Index, a: minimum distance between two points belonging to different clusters, b: maximum diameter amongst different clusters. 61 3.5 Results reporting DB index value obtained after application of the proposed clustering algorithm on WebKB documents using Doc2vec representation in comparison to other clustering algorithms. Here, Rep. denotes representation, N: Number of scientific articles; F: Vocabulary size; OC: Obtained number of clusters; DB: Davies-Bouldin Index . 62 3.6 p-values obtained after conducting t-test comparing the performance of proposed SMODoc clust algorithm with other existing clustering techniques with respect to Dunn index values reported in Table 3.2. Here, xx: values are absent in Table-3.2. 63 3.7 Comparative complexity analysis of existing clustering algorithms. Here, R is the number of reference distributionsp [1]; K is the maximum number of clusters present in a data set which is N; N is the number of data points; T otalIter is the number of iterations used and chosen in such a way that number of fitness evaluations of all the algorithms become equal. 64 4.1 Brief description of datasets used for single document summarization . 75 4.2 Experiment results on ESDS SMODE on different parameter combinations. The values of CR, F and eta correspond to levels (1; 2; 3) are (0:4; 0:6; 0:8), (0:3; 0:8; 1:5) and (19; 20; 21), respectively. Here, SNRA is the Signal to Noise Ratio, MEAN is mean of uncontrolled factor values (ROUGE-1 score values) of different documents. 77 xiii LIST OF TABLES 4.3 ROUGE Scores of different methods on DUC2001 and DUC2002 data sets . 77 4.4 Improvements obtained by our proposed approach over other methods based on ROUGE−2 score . 79 4.5 Improvements obtained by our proposed approach over other methods using ROUGE−1 score on DUC2002 dataset . 81 4.6 Improvements obtained by DE over other methods using ROUGE−1 score on DUC2001 dataset . 81 5.1 ROUGE Scores attained by different methods for DUC2001 and DUC2002 data sets. 101 5.2 ROUGE Scores attained by proposed Approach-1 and Approach-2 utilizing word mover distance (WMD) on CNN dataset. Here, SMaxRouge strategy is used for selecting a single best solution from the final Pareto front.

Load more