Investigations in Document Clustering and Summarization

Submitted in partial fulfillment of the requirements for the degree of

Doctor of Philosophy

by

Naveen Saini Roll No. 1621CS12

Under the supervision of Dr. Sriparna Saha Prof. Pushpak Bhattacharyya

Department of Computer Science and Engineering Indian Institute of Technology Patna Patna - 801106, India March-2020

c : 2020 by Naveen Saini. All rights reserved.

APPROVAL OF THE DOCTORAL COMMITTEE

Certified that the thesis entitled “Investigations in Document Clustering and Summa- rization” submitted by Naveen Saini to Indian Institute of Technology Patna for the award of the degree of Doctor of Philosophy, has been accepted by the doctoral committee members after the successful completion of the synopsis seminar held on 07 January 2020.

Dr. Sriparna Saha Prof. Pushpak Bhattacharyya Supervisor Supervisor Department of Computer Sc. and Engg. Department of Computer Sc. and Engg. Indian Institute of Technology Patna Indian Institute of Technology Patna

Dr. Asif Ekbal Dr. Jimson Mathew Chairperson, Doctoral Committee Member, Doctoral Committee Department of Computer Sc. and Engg. Department of Computer Sc. and Engg. Indian Institute of Technology Patna Indian Institute of Technology Patna

Dr. Yogesh Mani Tripathi Member, Doctoral Committee Department of Mathematics Indian Institute of Technology Patna

iii

DECLARATION BY THE SCHOLAR

I certify that:

• The work contained in this thesis is original and has been done by me under the guidance of my supervisors.

• The work has not been submitted to any other Institute for any degree or diploma.

• I have followed the guidelines provided by the Institute in preparing the thesis.

• I have conformed to the norms and guidelines given in the Ethical Code of Conduct of the Institute.

• Whenever I have used materials (data, theory and text) from other sources, I have given due credit to them by citing them in the text of the thesis and giving their details in the reference section.

• The thesis has been checked by anti-plagiarism software.

Naveen Saini

v

CERTIFICATE

This is to certify that the thesis entitled “Investigations in Document Clustering and Summarization”, submitted by Naveen Saini to Indian Institute of Technology Patna, is a record of bonafide research work under our supervision and we consider it worthy of consideration for the degree of Doctor of Philosophy of the Institute.

Dr. Sriparna Saha Prof. Pushpak Bhattacharyya Supervisor Supervisor Department of Computer Sc. and Engg. Department of Computer Sc. and Engg. Indian Institute of Technology Patna Indian Institute of Technology Patna

Place: Indian Institute of Technology Patna Date:

vii

Acknowledgement

First and foremost, I would like to express my deep and heartfelt gratitude to my supervisors, Dr Sriparna Saha and Professor Pushpak Bhattacharyya, for their blessings and valuable advice throughout my research journey at IIT Patna. Their continuous support and encouragement have always inspired and motivated me to give my best. They assisted me in shaping my research ideas and helping me in getting through all the obstacles in my four-year PhD journey. I could not have imagined having better advisors for mentoring my PhD. I am indebted to them for their constant support, time, suggestions, and positive attitude towards my research. Beside my supervisors, I am also grateful to the members of my Doctoral Committee (Dr Asif Ekbal, Dr Jimson Mathew, and Dr Yogesh Mani Tripathi) for examining my work and providing their valuable comments and suggestions. Moreover, I am thankful to all members of AI-NLP-ML Group of IIT Patna for supporting me and for all the fun we have had in the last four years. I would also like to extend my appreciation to my seniors, batch-mates and juniors for their support. Furthermore, I would like to express my sincere gratitude to my parents, brother and sisters for their never-ending love, care, and affection in each and every step of my life. They always encouraged me to achieve my goals. I am grateful to my lovely wife Nisha and my in-laws for their perpetual understanding, patience, and the support they provided endlessly throughout my research. Last but not least, I thank the Department of Computer Science and Engineering and IIT Patna itself for giving me an opportunity to do my research while providing all the research facilities and travel grants.

Place: Indian Institute of Technology Patna Date: Naveen Saini

ix

Abstract

A tremendous amount of text-content is available in the form of documents, microblogs, scientific articles, and other sources, and this keeps on growing exponentially over time with the arrival of new data from multiple sources. In order to scan through such a large volume of data, there is a need to develop some efficient text-mining techniques. In this direction, the development of several supervised methods is an increasingly urgent need to prevent decision makers from being overwhelmed by too much information. However, these supervised methods require a massive amount of labelled data. Moreover, for developing a supervised system, data annotation is a very time-consuming and costly process. Therefore, these challenges de- mand developing an unsupervised method to overcome them. In the current thesis, two areas of text-mining have been deeply investigated, namely, document clustering and summarization, by developing unsupervised techniques to solve them. In document clustering, the task is to find the optimal partitioning given a set of documents in an automatic way. In summarization, on the other hand, the aim is to compress relevant information and make it concise from the available data. Different facets of summarization, like document summarization, figure-summarization, microblog summarization, and multi-modal microblog summarization, were explored in this the- sis. The task of summarization is presented as a multi-objective optimization problem where multiple quality measures like cohesion, readability, anti-redundancy, among others, are opti- mized simultaneously. A meta-heuristic optimization technique, namely differential evolution, is used as the under- lying optimization strategy. Several new genetic operators inspired by the concepts of a self- organizing map are also incorporated in the optimization process. We employed the ROUGE-N measure to ensure the extraction of good quality summary. Extensive experimentation has veri- fied that all our proposed methods outperform the existing methods when tested on task-related data-sets.

Keywords: Unsupervised learning, Clustering, Document Summarization, Figure-summarization, Microblog Summarization, Multi-modal microblog summarization, Multi-objective Optimiza- tion, Binary Optimization, Evolutionary Algorithm, , Image Dense-Captioning, Word Mover Distance, Cosine Distance, , Cluster Validity Indices, Self-organizing Map, Syntactic and .

xi

List of Tables

2.1 Definitions of Cluster validity measures/indices. Here, K: number of clusters; N: number of data points; dist: distance function; Opt. in the last column refers to optimization...... 22

3.1 Parameter setting for our proposed approach ...... 55 3.2 Results obtained after application of the proposed clustering algorithm on text documents in comparison to other clustering algorithms. Here, Rep. denotes representation, N: Number of scientific articles; F: Vocabulary size; OC: Obtained number of clusters; DI: Dunn Index; xx: all data points assigned to single cluster 58 3.3 Results obtained after application of the proposed clustering algorithm on text documents in comparison to other clustering algorithms. Here, Rep. denotes representation, N: Number of scientific articles; F: Vocabulary size; OC: Obtained number of clusters; DB: Davies-Bouldin Index; xx: all data points assigned to single cluster ...... 59 3.4 Values of different components of the Dunn Index for tf, tf-idf and Glove represen- tation with 100 dimension on WebKB dataset. Here, Rep. denotes representation, OC: obtained cluster, DI: Dunn Index, a: minimum distance between two points belonging to different clusters, b: maximum diameter amongst different clusters. 61 3.5 Results reporting DB index value obtained after application of the proposed clus- tering algorithm on WebKB documents using Doc2vec representation in compari- son to other clustering algorithms. Here, Rep. denotes representation, N: Number of scientific articles; F: Vocabulary size; OC: Obtained number of clusters; DB: Davies-Bouldin Index ...... 62 3.6 p-values obtained after conducting t-test comparing the performance of proposed SMODoc clust algorithm with other existing clustering techniques with respect to Dunn index values reported in Table 3.2. Here, xx: values are absent in Table-3.2. 63 3.7 Comparative complexity analysis of existing clustering algorithms. Here, R is the number of reference distributions√ [1]; K is the maximum number of clusters present in a data set which is N; N is the number of data points; T otalIter is the number of iterations used and chosen in such a way that number of fitness evaluations of all the algorithms become equal...... 64

4.1 Brief description of datasets used for single document summarization ...... 75 4.2 Experiment results on ESDS SMODE on different parameter combinations. The values of CR, F and eta correspond to levels (1, 2, 3) are (0.4, 0.6, 0.8), (0.3, 0.8, 1.5) and (19, 20, 21), respectively. Here, SNRA is the Signal to Noise Ratio, MEAN is mean of uncontrolled factor values (ROUGE-1 score values) of different documents. 77

xiii LIST OF TABLES

4.3 ROUGE Scores of different methods on DUC2001 and DUC2002 data sets . . . . 77 4.4 Improvements obtained by our proposed approach over other methods based on ROUGE−2 score ...... 79 4.5 Improvements obtained by our proposed approach over other methods using ROUGE−1 score on DUC2002 dataset ...... 81 4.6 Improvements obtained by DE over other methods using ROUGE−1 score on DUC2001 dataset ...... 81

5.1 ROUGE Scores attained by different methods for DUC2001 and DUC2002 data sets...... 101 5.2 ROUGE Scores attained by proposed Approach-1 and Approach-2 utilizing word mover distance (WMD) on CNN dataset. Here, SMaxRouge strategy is used for selecting a single best solution from the final Pareto front...... 102 5.3 ROUGE Scores obtained using Approach-1 (WMD) when the best solution is selected using any of the strategies under UMaxRouge strategy. All the strategies explored here for selecting a single best solution from the final Pareto front are unsupervised in nature. Bold entries indicate they are able to beat the state-of- the-art algorithms...... 104 5.4 Improvements attained by the proposed approach, Approach-1 (WMD) with SOM based operators over other methods considering ROUGE scores. Here, xx indi- cates non-availability of results on the DUC2001 dataset...... 107 5.5 The p-values obtained by Approach-1 (WMD) with SOM and without SOM based operators (under SMaxRouge scheme) considering ROUGE-1 and ROUGE-2 score values...... 110 5.6 The p-values obtained by Approach-1 (WMD) with SOM based operators (under SMaxRouge scheme) considering respect to existing methods...... 114

6.1 Description of symbols used in describing objective functions (mathematical for- mulation)...... 121 6.2 Statistics of the used datasets. Here, U1 and U2 are the average number of unique sentences per figure in FigSumGS1 and FigSumGS2 dataset, respectively; #SentInGS is the number of sentences present in the gold summary; ‘-’ implies 18th and 19th articles are not used in FigSumGS2 dataset...... 129 6.3 Parameter setting for our proposed approach. Here, Q is the number of sentences in the actual summary specific to a figure...... 130 6.4 Average precision (P), recall (R) and F-measure (F1) values obtained for both datasets using reduced set of sentences. Here, the decimal number in the left of ‘’is the standard deviation...... 131 6.5 Average precision (P), recall (R) and F-measure (F1) values obtained by the proposed approach for both datasets namely, FigSumGS1 and FigSumGS2, by varying the objective function combinations. Here, the decimal number in the left of ‘’is the standard deviation. Note that here all sentences in the article are used for the experiment...... 132 6.6 Comparison of the best results obtained by our proposed approach with (a) un- supervised methods; (b) supervised methods, in terms of average precision (P), recall (R) and F-measure (F1) for both datasets namely, FigSumGS1 and Fig- SumGS2. Here, the decimal number in the left of ‘’is the standard deviation. Note that here all sentences in the article are used for the experiment...... 133

xiv LIST OF TABLES

7.1 Dataset descriptions for Microblog Summarization ...... 153 7.2 Average ROUGE Scores over all datasets attained by the proposed method using supervised information. Here, † denotes the best results; it also indicates that results are statistically significant at 5% significance level...... 154 7.3 ROUGE Scores obtained by the proposed approach for different datasets using SBest selection method. Bold entries indicate the best results considering ‘with SOM’ and ‘without SOM’ based operators...... 155 7.4 ROUGE Scores obtained by the proposed approach for different datasets using UBest selection method...... 158 7.5 Average ROUGE Scores over all datasets attained by existing methods in compar- ison with the best results obtained by the proposed approach using SBest (Table 7.2) and UBest (Table 7.4) selection methods . Here, WOSOM refers to without SOM, SBest and UBest are the supervised and unsupervised selection methods. 158 7.6 Sensitivity analysis on the parameters used in the proposed algorithm utilizing SOM-based operator and optimizing two objectives, Ob1 and Ob2. Here PS and #TO stand for Parameter setting and the number of tweets obtained in predicted summary, respectively...... 166 7.7 Sensitivity analysis of the parameters used in the proposed algorithm utilizing SOM-based operator and optimizing two objectives, Ob1 and Ob2. Here PS and #TO stand for Parameter setting and the number of tweets obtained in the pre- dicted summary, respectively. Note that this table is a continuation of Table 7.6...... 167 7.8 Range of possible values for each of 5 parameters ...... 167 7.9 Average ROUGE scores over all datasets corresponding to different parameter settings shown in Table 7.6 and Table 7.7. Here PS stands for Parameter Setting. 168 7.10 Statistics about the DUC2002 topics and corresponding average ROUGE-2 scores. 168

8.1 Notations used with their descriptions. Here, tf-idf refers to Term frequency- inverse document frequency...... 173 8.2 Dataset Statistics. Here, #ITWIM: Informative tweets with informative images; #PITWIM: Pre-processed Informative tweets with informative images; #GT: Number of tweets in Gold Summary...... 176 8.3 Comparison of ROUGE scores obtained using proposed approach MMTweetSumm and MOOTweetSumm. Here, R and Obj in second row refer to objective functions used and ROUGE; A1 and A represent MaxAntiRedundancy calculation using text+image and text, respectively; T, L and BM25 represent MaxSumTFIDF, MaxLength, and MaxSumBM25 objective functions, respectively...... 183 8.4 Comparison of ROUGE scores attained by our method with the existing methods. 183

xv

List of Figures

1.1 A set of documents clustered into three categories...... 2 1.2 Similarity finding between two living objects: (a) Tiger; (b) Cat...... 3 1.3 An example of extractive and abstractive summarization. In extractive, coloured lines form the summary. In abstractive, using the coloured words in the document, a new sentence is constructed...... 5 1.4 Layout of the thesis...... 10

2.1 Working model of K-means clustering algorithm...... 14 2.2 Working model of K-medoid clustering algorithm...... 15 2.3 Working model of single-linkage clustering algorithm...... 16 2.4 Most similar words for ‘sweden’ obtained using word2vec model...... 18 2.5 Word-pair relationship obtained using word2vec model [Source: Internet]. . . . . 19 2.6 An illustration for WMD calculation between two texts. Here, a, b, c and d denote the distances between words. The bold words are the non-stop words embedded into a word2vec space...... 20 p p p p 2.7 SOM Architecture. Here x = x1, x2.....xn is the input vector, Z1 and Z2 denote the axis of 2-D Map, wu is the weight vector of uth neuron...... 22 2.8 A real-life example of SOM where wealthy nations like USA, Canada, etc., come close to each other (left side of SOM grid), while, poorer nations like Nepal, Bangladesh, etc., are on the opposite of the SOM grid...... 23 2.9 An example of Textual Entailment...... 24 2.10 Comparison among SOO and MOO...... 25 2.11 Dominance and non-dominance between solutions obtained using MOO...... 26 2.12 The steps of evolutionary procedures...... 27 2.13 Mating pool construction for current solution using SOM...... 32

3.1 Flow chart of proposed algorithm for automatic multi-objective document clus- tering. Here, P: population containing solutions, |P |: size of the population, wi: weight vector of ith neuron, gmax: maximum number of generations, A: archive (copy of population P), Q: Mating pool; S: training data for SOM...... 44 3.2 Steps of population initialization ...... 45 3.3 Generation of trial solution ...... 47 3.4 Generation of new solution. Here rand() is a function which generates some random number between 0 to 1 ...... 48 3.5 Ranking of solutions...... 50 3.6 Word Cloud of (a) NIPS 2015 ; (b) AAAI 2013 ; (c) WebKB datasets ...... 52

xvii LIST OF FIGURES

3.7 Relevant cluster-keywords for (a) NIPS 2015; (b) AAAI 2013 data set correspond- ing to the best partitioning result obtained by the proposed approach ...... 57 3.8 Pareto optimal fronts obtained after application of the proposed clustering algo- rithm on scientific articles (a) NIPS 2015 ; (b) AAAI 2013 ; (c) WebKB datasets 60

4.1 Flow chart of the proposed architecture, ESDS SMODE, where, gmax is the user- defined maximum number of generations, g is the current generation number. . . 70 4.2 Results generated by Taguchi method. Here, SN is the Signal to Noise Ratio which we have to maximize. SN is maximum for CR, F and η (eta m) at levels 3, 2 and 2, respectively...... 78 4.3 Pareto Fronts obtained by ESDS SMODE over three documents of DUC2001 dataset. In (b) and (c), all solutions are of rank-1...... 80 4.4 An example of good quality-generated summary with respect to reference sum- mary for the document, AP 880316 − 0061, of topic d21d under DUC2001 dataset. 81 4.5 An example of low-quality summary. (a) Some sentences of the document, AP 891101− 0150, of topic d16c under DUC2001 dataset. (b) reference summary and predicted summary of the same document...... 82

5.1 Proposed architecture. Where g is the current generation number initialized with 0; gmax is the maximum number of generations which is defined by the user; |P | is the number of solutions in the population. After step-8, g is incremented by 1 and the process continues until maximum number of generations is reached. . . . 93 5.2 Pareto optimal fronts obtained after application of the proposed approach. Here, Proposed approach refers to Approach-1 (WMD) with SOM-based operators. Sub-figures (a), (b), (c) and (d) are the Pareto optimal fronts obtained after first, fourteen, nineteen and twenty-fifth generation, respectively. Red color dots represent Pareto optimal solutions; three axes represent three objective functional values, namely, sentence position, readability, coverage...... 102 5.3 Convergence plots. Sub-figures (a), (b), (c) and (d) show the convergence plots for four random documents. At each generation/iteration, maximum Rouge-1 and Rouge-2 scores are plotted...... 106 5.4 An example of reference summary and predicted summary for document AP 881109− 0149 of topic d21d under DUC2001 dataset...... 108 5.5 An example of reference summary and predicted summary for document SJMN91− 06106024 of topic d60k under DUC2001 dataset...... 109 5.6 Box plots. Sub-figures (a) and (b) for DUC2001 and DUC2002 dataset, respec- tively, show the variations of average Rouge-1/Rouge-2 values of highest ranked (rank-1) solutions in each document. In each colored box, the horizontal colored line indicates the median value of rank-1 solutions...... 111 5.7 Box plots. Sub-figures (a), (b) and (c) show the Rouge-1/Rouge-2 score variations per document over DUC2001 dataset. In each colored box, the horizontal colored line indicates the median value of Rouge-1/Rouge-2 score using rank-1 solutions of a document...... 112 5.8 Box plots. Sub-figures (a), (b) and (c) show the Rouge-1/Rouge-2 score variations per document over DUC2002 dataset. In each colored box, the horizontal colored line indicates the median value of Rouge-1/Rouge-2 score using rank-1 solutions of a document...... 113

xviii LIST OF FIGURES

6.1 ith solution representation in the population. Here, 12 is the number of sentences in the article, ‘0’ denotes that the sentence will not be a part of extractive summary and vice-versa...... 121 6.2 Flow chart of the proposed architecture where, g is the current generation number initialized with 0 value, tmax is the user-defined maximum number of generations, |P | is the size of the population...... 124 6.3 Flow chart of generation of solutions from the current solution, ~xc,t at generation ‘t’ using two DE variants. Here, F and CR are the pool of some values; y1 and y2 are the trial vectors generated using current-to-rand/1/bin and current-to- best/1/bin scheme, respectively...... 127 6.4 Pareto optimal solutions obtained after applying our proposed approach at the end of 24th generation. (a) Figure illustrating objective functional values of SAR TE (denoted as SAR v2 in the figure), STE, and, SRF; (b) Figure illustrating the objective functional values of SRF, SOC1, and, SOC2. ‘fr-0’ in legend denotes solutions are of rank-1...... 135 6.5 An example of Summary obtained by our proposed approach. (a) Figure-4 of the article available at http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1159166; (b) Caption of the figure; (c) Actual and predicted summaries. Coloured lines (ex- cluding black colour lines) in actual and predicted summary indicate the matched lines...... 136 6.6 Box plots showing variations of the best F-measure values obtained for (a) Fig- SumGS1; (b) FigSumGS2 datasets. The symbols namely, A, B, C, D, E, F, and, G represent objective functions namely, SAR CS, SAR TE, STE, SRF, SFC, SOC1, and, SOC2, respectively...... 138

7.1 Figure showing (a) classification of tweets into situational and non-situational categories; (b) summarization of situational tweets...... 144 7.2 Example of Microblog Summarization...... 145 7.3 Flow chart of the proposed architecture where, g is the current generation number initialized to 0 and gmax is the user-defined maximum number of generations (termination condition), |P | is the size of population...... 149 7.4 Word clouds of disaster events, namely (a) Sandyhook (SH); (b) Uttarakhand flood (UK); (c) Typhoon Hagupit in Philippines (TH); and (d) Bomb blasts in Hyderabad (HB)...... 152 7.5 Figures showing the number of new solutions generated over the generations by our proposed approach using two objectives, Ob1+ Ob2; a comparative study between ‘with SOM’ and ‘without SOM’ based operators. Here, (a), (b), (c), and (d) correspond to SH, UK, TH and HB datasets, respectively...... 156 7.6 Generation objective function values using MOOTweetSumm (Without SOM, Ob1+ Ob2). Here, (a), (b), (c) and (d) correspond to SH, UK, TH and HB datasets, respectively...... 157 7.7 Box plots in sub-figures (a), (b), (c) and (d) for SH, UK, TH and HB datasets, respectively, show the variations of average Rouge-2/Rouge-L values of highest ranked (rank-1) solutions of each document. In each colored box, the horizontal colored line indicates the median value of rank-1 solutions...... 160 7.8 Pareto optimal fronts obtained at the end of {0, 10, 20}th generation corresponding to TH dataset using ‘With SOM’ version...... 161 7.9 Pareto fronts obtained at the end of {0, 10, 20}th generation corresponding to TH dataset using ‘Without SOM’ version...... 162

xix LIST OF FIGURES

8.1 Image available with the tweet-text during earthquake in Mexico...... 170 8.2 An example of dense captioning model taken from https://cs.stanford.edu/ people/karpathy/densecap ...... 172 8.3 Representation of a solution...... 176 8.4 Box plots in sub-figures (a), (b) and (c) for Harvey, Srilanka and Irma disaster events, respectively. These figures illustrate the range of Rouge-L values using different sets of objective functions...... 182 8.5 Maximum ROUGE scores per generation attained by MMTweetSumm over Har- vey dataset...... 184 8.6 Maximum ROUGE scores per generation attained by MMTweetSumm over Irma dataset...... 184 8.7 Number of new good solutions per generation by MMTweetSumm using Harvey dataset...... 184 8.8 Number of new good solutions per generation by MMTweetSumm using Irma dataset...... 185 8.9 Informative tweets with informative images provided by annotators of CrisisMMD dataset [2]...... 185 8.10 Four informative images with same tweet...... 185 8.11 Informative tweet with its image in the form of newspaper cutting...... 186 8.12 An example of caption generation by dense-caption model...... 187 8.13 Another example of caption generation by dense-caption model...... 188

xx Abbreviations

MOO Multi-objective Optimization EA Evolutionary Algorithm MEA Multi-objective Evolutionary Algorithm DE Differential Evolution MODE Multi-objective Differential Evolution MOBDE Multi-objective Binary Differential Evolution PSO Particle Swarm Optimization TE Textual Entailment ESDS Extractive Single Document Summarization CVI Cluster Validity Indices PBM Pakhira-Bandyopadhyay-Maulik DI Dunn Index SI Silhouette Index DB Davies-Bouldin XB Xie-Beni SOM Self-organizing Map NDS Non-dominated Sorting CDO Crowding Distance Operator

xxi

Contents

Certificate of Approval iii

Declaration v

Certificate vii

Acknowledgement ix

Abstract xi

List of Tables xii

List of Figures xvii

List of Abbreviations xxi

1 Introduction 1 1.1 Document Clustering ...... 2 1.1.1 Objectives ...... 2 1.1.2 Problem Statement ...... 3 1.1.3 Applications ...... 3 1.2 Summarization ...... 4 1.2.1 Objectives ...... 5 1.2.2 Approaches for Summarization ...... 5 1.2.3 Problem Statement ...... 6 1.2.4 Applications ...... 6 1.3 Motivation and Objective of the Thesis ...... 7 1.4 Contributions Outline ...... 9

2 Preliminaries and Literature Review 13 2.1 Preliminaries ...... 14 2.1.1 Clustering algorithms ...... 14 2.1.2 Text Representations ...... 17 2.1.3 Distance/Similarity Measures ...... 19 2.1.4 Cluster Validity indices ...... 21 2.1.5 Self-organizing Map ...... 21 2.1.6 Textual Entailment ...... 23 2.1.7 Multi-objective Optimization (MOO) ...... 24

xxiii CONTENTS

2.1.8 Evolutionary Algorithms (EAs) ...... 26 2.1.9 Multi-objective Evolutionary Algorithms (MOEAs) ...... 27 2.1.10 Mathematics of Genetic operators in MODE framework ...... 29 2.1.11 Number of Fitness Function Evaluations ...... 30 2.1.12 SOM as a Mating Pool Construction Tool ...... 30 2.2 Literature Survey ...... 32 2.2.1 Document Clustering ...... 33 2.2.2 Extractive Single Document Summarization (ESDS) ...... 34 2.2.3 Figure-associated Text Summarization ...... 36 2.2.4 Microblog Summarization ...... 37 2.2.5 Multi-modal Microblog Summarization ...... 38 2.3 Evaluation Measures ...... 38 2.3.1 Document Clustering ...... 39 2.3.2 Summarization ...... 39 2.4 Chapter Summary ...... 40

3 Automatic Document Clustering: Fusion of MODE and SOM 41 3.1 Introduction ...... 42 3.1.1 Overview ...... 42 3.1.2 Key-contributions ...... 43 3.2 Proposed Methodology ...... 44 3.2.1 Solution Representation and Population Initialization: ...... 44 3.2.2 SOM Training ...... 45 3.2.3 Objective Functions Used ...... 45 3.2.4 Extracting Closer Solutions using Neighborhood Relationship of SOM . . 47 3.2.5 Offspring Reproduction (New Solution Generation) ...... 47 3.2.6 Selection Operation ...... 49 3.2.7 Termination Condition ...... 50 3.2.8 Selection of a Single Solution based on User Requirement ...... 50 3.3 Experimental Setup ...... 51 3.3.1 Datasets ...... 51 3.3.2 Evaluation Measures ...... 53 3.3.3 Comparative Approaches ...... 53 3.3.4 Preprocessing ...... 54 3.3.5 Representation Schemas Used ...... 55 3.3.6 Parameter settings ...... 55 3.4 Analysis of results obtained ...... 56 3.4.1 Results on NIPS 2015 Articles ...... 56 3.4.2 Results on AAAI 2013 Articles ...... 58 3.4.3 Results on WebKB dataset ...... 60 3.4.4 Results using XLNET Language Model ...... 62 3.4.5 Statistical Significance ...... 63 3.4.6 Complexity of proposed framework ...... 63 3.5 Chapter Summary ...... 65

xxiv CONTENTS

4 Multi-objective Clustering based Framework for Extractive Single Document Summarization 67 4.1 Introduction ...... 68 4.1.1 Overview ...... 68 4.1.2 Contributions ...... 69 4.2 Problem definition ...... 69 4.3 Proposed Method ...... 70 4.3.1 Representation of Solution and Population Initialization ...... 70 4.3.2 Assignment of Sentences to Sentence Clusters ...... 71 4.3.3 Objective Functions Used ...... 71 4.3.4 SOM Training ...... 71 4.3.5 Genetic Operators ...... 72 4.3.6 Selection of Best Solutions for Next Generation and Termination Criteria 72 4.3.7 Summary Generation ( Module) ...... 73 4.4 Experimental Setup ...... 74 4.4.1 Datasets ...... 74 4.4.2 Evaluation Measure ...... 75 4.4.3 Comparing methods ...... 75 4.4.4 Parameter setting ...... 75 4.5 Experimental Results and their Discussion ...... 78 4.5.1 Comparison with Existing algorithms ...... 78 4.5.2 Improvements obtained ...... 79 4.5.3 Analysis of Results ...... 81 4.5.4 Statistical significance t-test ...... 83 4.5.5 Complexity of the proposed framework ...... 83 4.6 Chapter Summary ...... 84

5 Extractive Single Document Summarization using Multi-objective Binary Dif- ferential Evolution 87 5.1 Introduction ...... 88 5.1.1 Overview ...... 88 5.1.2 Contributions ...... 88 5.2 Statistical Features or Objective Functions ...... 89 5.2.1 Sentence Position ...... 90 5.2.2 Similarity with Title ...... 90 5.2.3 Sentence Length ...... 90 5.2.4 Cohesion ...... 91 5.2.5 Coverage ...... 91 5.2.6 Readability Factor ...... 91 5.3 Problem Definition ...... 92 5.4 Proposed Methodology ...... 92 5.4.1 Preprocessing ...... 93 5.4.2 Representation of Solution and Population Initialization ...... 93 5.4.3 Objective Functions Used ...... 94 5.4.4 SOM Training ...... 94 5.4.5 Genetic Operators ...... 94 5.4.6 Selection of the Best |P | Solutions for Next Generation ...... 95 5.4.7 Updation of SOM Training Data ...... 95 5.4.8 Termination Condition ...... 95

xxv CONTENTS

5.4.9 Selection of Single Best Solution and Generation of Summary ...... 95 5.5 Experimental Setup ...... 97 5.5.1 Datasets ...... 98 5.5.2 Evaluation Measure ...... 98 5.5.3 Comparing Methods ...... 98 5.5.4 Parameter Settings ...... 99 5.6 Experimental Results ...... 99 5.6.1 Discussion of Results Obtained using Normalized Google Distance (NGD) 100 5.6.2 Discussion of Results Obtained using Cosine Similarity (CS) ...... 100 5.6.3 Discussion of Results Obtained using Word Mover Distance (WMD) . . . 103 5.6.4 Study on Different Methods of Selecting a Single Best Solution from Final Pareto Front ...... 103 5.6.5 Convergence Plots ...... 105 5.6.6 Improvements Obtained ...... 107 5.6.7 Error-analysis ...... 108 5.6.8 Study on Effectiveness of SOM based Operators on DUC2001 and DUC2002 datasets ...... 109 5.6.9 Statistical Significance t-test ...... 112 5.6.10 Complexity Analysis of the Proposed Approach ...... 114 5.7 Conclusive Remarks ...... 115

6 Textual Entailment based Figure Summarization for Biomedical Articles 117 6.1 Introduction ...... 118 6.1.1 Overview ...... 118 6.1.2 Contributions ...... 120 6.2 Problem Definition ...... 121 6.3 Proposed Approach ...... 123 6.3.1 Pre-processing ...... 124 6.3.2 Population Initialization and Solution Representation ...... 124 6.3.3 Calculation of Objectives Functions ...... 125 6.3.4 Genetic Operators ...... 125 6.3.5 Selection of Best |P | Solutions for Next Generation ...... 128 6.3.6 Termination Condition ...... 128 6.3.7 Selection of Single Best Solution and Generation of Summary ...... 128 6.4 Experimental Setup ...... 128 6.4.1 Datasets ...... 129 6.4.2 Evaluation Measures ...... 129 6.4.3 Experimental Settings ...... 130 6.4.4 Comparative Methods ...... 130 6.5 Results and Discussion ...... 131 6.5.1 Comparison with Existing Unsupervised Methods ...... 133 6.5.2 Pareto fronts obtained ...... 134 6.5.3 An Example of Summary Obtained ...... 135 6.5.4 Error Analysis ...... 137 6.5.5 Box-plots ...... 137 6.5.6 Statistical Significance of Results ...... 139 6.5.7 Complexity Analysis of the Proposed Approach ...... 139 6.6 Conclusive Remarks ...... 140

xxvi CONTENTS

7 Multi-objective Based Approach for Microblog Summarization 143 7.1 Introduction ...... 144 7.1.1 Overview ...... 144 7.1.2 Contribution ...... 146 7.2 Problem Definition ...... 147 7.3 Proposed Methodology ...... 148 7.3.1 Representation of Solution and Population Initialization ...... 149 7.3.2 Objective Functions Used ...... 149 7.3.3 SOM Training ...... 149 7.3.4 Genetic Operators ...... 149 7.3.5 Selection of Best |P | Solutions for Next Generation ...... 150 7.3.6 Updating SOM Training Data and Termination Condition ...... 150 7.3.7 Selection of Single Best Solution and Generation of Summary ...... 150 7.4 Experimental Setup ...... 151 7.4.1 Datasets ...... 152 7.4.2 Comparative Methods ...... 153 7.4.3 Evaluation Measure ...... 153 7.4.4 Parameters Used ...... 153 7.5 Discussion of Results ...... 154 7.5.1 Discussion of results obtained using SBest selection method ...... 154 7.5.2 Discussion of results obtained using UBest selection method ...... 156 7.5.3 Comparative Analysis ...... 159 7.5.4 Quality of Summaries for Different Solutions ...... 159 7.5.5 Pareto Fronts Obtained ...... 160 7.5.6 Sensitivity Analysis on the Parameters Used ...... 161 7.5.7 Statistical significance test ...... 163 7.6 An Application to Multi-document Summarization ...... 163 7.6.1 Comparative Approaches and Differences with Our Approach ...... 163 7.6.2 Results Obtained ...... 164 7.7 Conclusive Remarks ...... 164

8 Multi-modal Microblog Summarization 169 8.1 Introduction ...... 170 8.1.1 Overview ...... 170 8.1.2 Major Contributions ...... 172 8.2 Tweet-scoring Functions ...... 173 8.3 Dataset Creation ...... 174 8.4 Problem Statement ...... 176 8.5 Proposed Methodology ...... 177 8.5.1 Population and Parameter Initialization ...... 177 8.5.2 Objective Functions Calculation ...... 177 8.5.3 Grouping of Similar Solutions ...... 179 8.5.4 New Solution Generation ...... 179 8.5.5 Selection of Top Best Solutions ...... 179 8.5.6 Update Survival Length and Mating Restriction Probability ...... 179 8.5.7 Selection of Single best Solution ...... 180 8.6 Experimental Setup ...... 180 8.6.1 Evaluation Measure ...... 181 8.6.2 Parameters Used ...... 181

xxvii CONTENTS

8.6.3 Comparative Approaches ...... 181 8.7 Discussion of Results ...... 181 8.7.1 Box-plots showing qualities of summaries corresponding to different solutions181 8.7.2 Comparison among MMTweetSumm and MOOTweetSumm ...... 183 8.7.3 Comparison of MOOTweetSumm with Existing Methods ...... 186 8.7.4 Error-analysis ...... 188 8.7.5 Statistical t-test ...... 189 8.8 Conclusive Remarks ...... 189

9 Conclusions and Future Works 191 9.1 Conclusions ...... 192 9.2 Suggestions for Further Work ...... 194

References 197

List of Publications 207

xxviii CHAPTER 1

Introduction

This chapter provides a brief introduction to document clustering and summarization. It then presents the scope of the thesis and concludes with the contributions of the thesis.

1 Introduction

Over the past two decades, the fast growth of computer and information technology has fundamentally altered every discipline in science and engineering, transforming many areas from data-poor to increasingly data-rich. A vast amount of new information and data are generated every day through academic, social network, and web-based interactions. They have significant ------potential economic and societal value. Furthermore,------these data keep growing exponentially over ------time with the arrival of new data from multiple--- sources.------This has led to a surge in interest for the ------text------mining------community to extract relevant information------from the available data. This dissertation ------presents------investigations------in two fields of :------document------clustering------and summarization. ------1.1 Document------Clustering ------Set of text-documents ------Document clustering [1, 3] is defined as the ‘partitioning------of a given------collection of documents into ------various K-groups/clusters. For example, in Figure------1.1, a set------of------text-documents is partitioned ------into three categories: sports, technology and education.- For--- clustering,------the value of K may or ------may not be known a priori. Clustering can also be referred to------as--- an unsupervised classification ------technique because it does not utilize any labelled data. This is different------from other classification --- models like supervised and semi-supervised. ------

Cluster-1: Sports

Representative ------Compactness ---- of the cluster ------Clustering ------Separation ------algorithm ------Set of text-documents ------Cluster-2: Education----- Cluster-3: Technology

Figure 1.1: A set of documents clustered into three categories.

1.1.1 Objectives

There exist many clustering algorithms in the literature, with two in particular: K-means [4], and K-medoid [4], among others. For any clustering algorithm, two primary goals/objectives must be satisfied:

• high compactness within clusters (low intra-cluster distance)

2 1.1 Document Clustering

(a) (b)

Figure 1.2: Similarity finding between two living objects: (a) Tiger; (b) Cat.

• maximum separation between clusters (high inter-cluster distance)

Here, high compactness means the distance between the points belonging to the same cluster (or distance between points with regard to their cluster centres) should be less. Alternatively, maximum separation means that the distance between clusters should be high. These notions are illustrated in Figure 1.1.

1.1.2 Problem Statement

If {D1,D2,...,DN } are the ‘N’ documents, then the task is to find the K document-clusters,

{C1,C2,...,CK } which satisfy the following conditions:

i i i i 1. Ci = {D1,D2,...,Dnpi }, npi: number of documents in cluster i, Dj: jth document of cluster i.

K 2. ∪i=1 | Ci |= N and Ci ∩ Cj = ∅ for all i 6= j.

In our daily life, we always try to determine the similarity between various objects. For example, in Figure 1.2, two living objects are shown having some similarity values. Here, ‘sim- ilarity’ means likeliness in terms of features. Similarly, the similarity between two documents can be defined in terms of (a) number of overlapping words, and (b) semantic similarity, among others. Based on this similarity, these documents are partitioned into various groups.

1.1.3 Applications

In real-life, there are many applications in which document clustering can be utilized. Below we mention some of the applications:

• Scope detection of journals/conferences: In any academic peer-review system, the process starts from the editor’s desk whose job is to identify whether the submitted paper

3 Introduction

is within the scope of the journal/conference or not [5]. Scientific document clustering can help in identifying different sub-topics/ sub-themes covered by the journal. The similarity between a newly submitted document with scientific document clusters already extracted can help the editor in making an appropriate decision.

• Document classification: Considering the document clusters obtained after scientific document clustering, a new set of documents can be classified. For example, in Figure 1.1, clusters of three categories are shown. If a new document has a similarity more than the threshold concerning the sports cluster, then the new document will be a member of the sports category.

• Search-results Optimization: While feeding a query to a , a user gets an innumerable number of web pages, but not all the pages are relevant and sometimes the user gets tired of searching for the relevant pages. In this scenario, providing a set of clusters of web documents/snippets to the user may help him/her in deciding the appropriateness of the cluster about a given query.

• Text Document Summarization: Document clustering plays a vital role in document summarization. After getting the clusters from scientific documents, top-scoring sentences based on various measures like the position of the sentence in the document, among other locations, can be extracted as a part of the summary.

Other applications of document clustering can be in novelty detection, recommendation systems, topic modelling, and organizing an extensive collection of documents in a library, among others.

1.2 Summarization

Summarization [6, 7] focuses on shortening a given text yet maintaining the essential meaning or content of the information. As per Eduard Hovy [8], a summary can be defined as: ‘a text that is produced from one or more texts, that contains a significant portion of the information in the original text(s), and that is no longer than half of the original text(s).’ The rapid increase in text-based data and the ability to simplify the information has brought attention to developing automatic summarizing techniques. Over the last decade, is one of the principal and challenging issues in Natural Language Information Processing [9].

4 1.2 Summarization

1.2.1 Objectives

The overall goal of any summarization system is to (a) compress the available data; (b) cover the central theme of the data; (c) save the time of the user; (d) increase active reading; and (e) help in decision making, among other goals. There exist many applications of summarization that cut down a users’ time and cognitive effort instead of going through the available data.

1.2.2 Approaches for Summarization

Summarization includes two types of approaches: extractive and abstractive. Extractive summa- rization [7, 10] involves extracting relevant passages/sentences/paragraphs from a given corpus and then merging them to generate a summary. However, in abstractive summarization [11], the task is not only to identify the salient passages/sentences/paragraphs but also to reconstruct the text and thus, the knowledge of natural language as well as understanding and generation is required. In Figure 1.3, an example of extractive vs abstractive summarization is shown.

[English is the dominant language in the writing and publishing of

scientific research in the form of scientific articles.]1 [However, many non-natives users of English suffer the interference of their English is the dominant language in the writing and mother tongues when writing scientific papers in English.]2 [These users face problems concerning rules of grammar and style, and/or publishing of scientific research in the form of feel unable to generate standard expressions and clauses, and the scientific articles. In order to ease these users' longer linguistic compositions which are conventional in this problems, we developed a learning environment for scientific writing named AMADEUS (Amiable genre.]3 [In order to ease these users' problems, we developed a learning environment for scientific writing named AMADEUS extractive Article Development for User Support). The main goal of this research is to implement AMADEUS as (Amiable Article Development for User Support).]4 [AMADEUS consists of several interrelated tools reference, support, critic and an agent -based architecture with collaborative agents tutoring tools and provides the context in which this dissertation is communicating with a special agent embodying a inserted.]5 [The main goal of this research is to implement dynamic user model. We also provide details about AMADEUS as an agent -based architecture with collaborative intelligent agents which were used to implement the agents communicating with a special agent embodying a dynamic user model for the AMADEUS environment.

user model.]6 [In order to do that we introduce the concept of adaptivity in computer systems and describe several user model shells.] [We also provide details about intelligent agents which 7 A detained iranian-american were used to implement the user model for the AMADEUS academic accused of acting environment.]8 against national security has iranian-american been released from a tehran abstractive academic held in prison after a hefty bail was tehran released on posted, a to p judiciary bail.

1 official said tuesday.

Figure 1.3: An example of extractive and abstractive summarization. In extractive, coloured lines form the summary. In abstractive, using the coloured words in the document, a new sentence is constructed.

5 Introduction

1.2.3 Problem Statement

Consider a document/event D consisting of N sentences, {s1, s2, . . . , sN }. Then, our main task is to find a subset of sentences, S ∈ D, such that

X li ≤ Smax (1.1)

si∈S

Where, S represents the central theme/topic of the document or subset of the sentences which cover the relevant information from the document, si is the sentence belonging to S while reducing redundancy in the summary, li measures the length of ith sentence in terms of the number of words, Smax is the maximum number of words allowed in the generated summary.

1.2.4 Applications

Some examples are web-page summarization [12], bug-report summarization [13], single/multiple document summarization [14, 15], entity timelines summarization (given a timeline and an entity, the task is to generate a summary of memorable events involving this entity) [16], scientific document summarization [17], email-summarization [18], personal assistance summarization [19], figure-associated text summarization [20], microblog summarization [21], financial documents summarization (like earning reports and financial news which can help analysts quickly derive market signals from content) [22], among others. Below, some of these paradigms with the corresponding advantages are described as they are explored in this thesis:

Single/multi-document Summarization: Given a single or a set of multiple documents, the task is to a create a compressed version of the single/multiple text document(s) that should be concise, relevant, non-redundant and representative of the main idea of the text.

Microblog Summarization: Nowadays, social networking sites such as Twitter have become the leading source for gathering real-time information on ongoing events such as political issues, and human-made and natural disasters, and other kinds of important events. In the literature [23, 24], the importance of accessing microblogging sites for gathering information is shown. The vast number of tweets are posted every day, and this makes the relevant information extraction from such data a cumbersome process. Moreover, it has been seen that ordinary people stay connected through microblogging sites when natural disasters occur and much useful information can be extracted from such tweets that can further help in managing the situation by the Government. Therefore, this task aims to select relevant tweets automatically based on various

6 1.3 Motivation and Objective of the Thesis tweet scoring features.

Figure-associated Text Summarization: Biomedical literature is used to incorporate mil- lions of figures that are especially useful for the researchers to validate their research findings. These figures are always difficult to interpret by humans as well as machines. According to Futrelle [25], 50% of the texts in biomedical articles are related to figures only. Moreover, as per [26], only the caption of figures and the title of the article with an abstract convey 30% of the information related to the figure. Therefore, this task aims at summarizing figures in biomedical articles using the associated texts.

1.3 Motivation and Objective of the Thesis

In this thesis, we investigated the two areas of text mining, including document clustering and summarization. These tasks have a wide variety of applications such as solving real-life problems as discussed in sections 1.1.3 and 1.2.4. Through summarization, multiple aspects of extractive summarization like single document summarization, figure-summarization, microblog summa- rization, multi-document summarization, and multi-modal microblog summarization were ex- plored. For each task, different benchmark datasets were utilized. In the literature, a significant amount of work was carried out to solve these tasks, but their performances were not up to the mark. This motivates us to develop more sophisticated algorithms in improving the performance of such systems. In this direction, the concept of Multi-objective optimization (MOO) [27] which is an impor- tant paradigm used in daily-life scenarios, seems to be useful. MOO can optimize more than one objective function simultaneously depending on the task. For example, while purchasing a car, a customer may consider minimizing the cost and maximizing the comfort as his/her optimization criteria, and to satisfy these objectives, MOO provides many options. Similarly, for a document clustering task, it is quite natural that for a given dataset, it may have clusters of different shapes (e.g. hyper-spherical [1], convex [28]) which are difficult to be determined after the applications of conventional clustering algorithms like K-means [4]. Moreover, the number of clusters may not be known beforehand. To detect the appropriate number of clusters and to discover clusters having different shapes automatically, MOO is ideal. After conducting a thorough literature survey on existing document clustering techniques, we arrived at a conclusion that:

• Existing approaches considered a fixed number of clusters, which may or may not be known beforehand.

• In the existing multi-objective clustering algorithms, usually, reproduction operators like

7 Introduction

roulette wheel selection, tournament selection [29] etc., popularly used in a single-objective optimization framework, are used to generate new solutions. But, some newly designed self-organizing map [30] based genetic operators are never explored in fusion with multi- objective clustering techniques. way.

Similarly, after conducting a thorough literature survey on single document summarization, figure summarization, microblog summarization, and multi-modal microblog summarization, we identified that:

• Existing works on single-document summarization considered the weighted sum of objective functions as their optimization criterion and demonstrated that their results are better than state-of-the-art results. However, combining the values of different objective functions using weighted criteria into a single value may not be meaningful [31].

• The recent existing technique on microblog summarization uses the ensemble approach, which generates a summary after considering the summaries generated by various algo- rithms like LexRank [32] and TexRank [33], among others, as discussed in [34]. Never- theless, in real-time, the application of the ensemble approach for summarizing tweets is time-consuming because firstly, we must generate the summaries by different algorithms and then produce the final summary by considering these individual summaries.

• For the figure-summarization task, only one or two sentence-scoring measures (or objec- tive functions) are used. The similarity of sentences in the article with the figure’s caption and sentences referring to that figure are considered to generate the summary. However, there can be other objective functions like whether or not a sentence in the article infers the figure’s caption and the number of N-gram overlapping words, among other objective functions, to improve the quality of the summary. The authors have considered only syn- tactic similarity instead of semantic similarity, to measure the similarity between sentences. Moreover, the textual entailment, which is a challenging problem in the field of natural language processing, is never utilized in the literature for calculating the anti-redundancy objective function.

• Currently, users post tweets utilizing multi-media content like images along with tweet- text. These images may convey some important information for microblog summarization task as everything cannot be described by tweet-texts due to length limitations. Only a few works exist in this area. Moreover, no work has explored the image-dense captioning model [35] for this task to extract the image text-features.

8 1.4 Contributions Outline

Beyond this, in the literature, there exists no MOO-based framework for summarizing figures, microblogs and multi-modal microblogs, which simultaneously optimizes different aspects of the summary. All the above limitations of various works are the primary motivations for writing this thesis. In the current thesis, we have developed different optimization-based frameworks for solving the above-mentioned problems. All the above-mentioned tasks can be posed as op- timization problems and then can be solved using some MOO based frameworks. Note that in all works, multi-objective differential evolution (MODE) [36] is utilized as the underlying optimization strategy and SOM-based genetic operator is also explored in fusion with MODE for solving all the tasks (excluding figure-summarization). MODE is a kind of population-based multi-objective evolutionary algorithm (MOEA) [37] inspired by the biological phenomenon of human beings. MOEA starts from a set of candidate solutions and explores the search space by optimizing various objective functions following the evolutionary procedure. Note that in the literature, there exist other optimization strategies, like AMOSA [38], particle swarm optimiza- tion (PSO) [39, 40], and NSGA-II [29], but MODE has a faster, convergence rate and an efficient global search capability for solving different real-life application problems [36].

1.4 Contributions Outline

• In Chapter 3, a bio-inspired multi-objective automatic document clustering technique is proposed utilizing multi-objective differential evolution as the underlying optimization strategy. The variable number of cluster centres are encoded in different solutions of the population to determine the number of clusters from a data set in an automated way. These solutions undergo various genetic operations during evolution. Some neural network (self-organizing map) based operators are also incorporated in the proposed framework to determine the optimal clustering solution. To measure the goodness of a clustering solution, two cluster validity indices, namely, PBM and Silhouette index, are optimized simultaneously. Different representation schemes, including tf [41], tf-idf [42] and word- embedding [43, 44] are employed to convert articles in vector-forms. The effectiveness of the proposed approach is shown for the automatic clustering of three text datasets related to scientific articles and web-documents.

• In Chapter 4, we proposed a multi-objective based extractive single document summariza- tion technique. Firstly, the clustering of sentences is performed, and qualities of clustering solutions are optimized using the framework developed in Chapter 3. In the second phase, sentences present in the optimized cluster are ranked using the weighted sum of various sentence scoring features. After that, higher-ranked sentences are selected from each clus-

9 Introduction

ter to form a summary, and this process continues until the summary length constraint is satisfied. For evaluation, two standard summarization datasets, namely, DUC2001, and DUC2002, are utilized. The obtained results show that our model is better than various existing supervised and unsupervised systems.

• In Chapter 5, a MOO-based system for single document summarization is proposed that aims to select a subset of sentences by simultaneously optimizing different quality/objective functions. These are the position of the sentence in the document, the similarity of a sen- tence with the title, length of the sentence, cohesion, coverage, and readability. Moreover, in any text-based summarization system, readability is an essential factor as the generated summary should be readable to end-users. Therefore, in our approach, the readability feature is considered as a sixth objective function. All these objective functions must be maximized simultaneously using some multi-objective optimization framework. To mea- sure the similarity or dissimilarity between sentences, different existing distance measures like word mover distance, cosine distance and normalized google distance are explored. It was also shown that the best performance not only depends on the objective functions used but also on the correct choice of similarity/dissimilarity measure between sentences. After application of any MOEA, a set of optimized solutions are generated, and our system is also based on MOEA; therefore, to select the best solution (representing the best sum- mary) various unsupervised methods are explored. The efficacy of our proposed approach is shown on three datasets, DUC2001, DUC2002, and CNN. The results clearly illustrate that our approach performs better than state-of-the-art techniques.

CHAPTER-3: Automatic Multi-objective Document Clustering

CHAPTER-4: Multi-objective Clustering based Single Document Summarization

CHAPTER-5: Single Document Summarization as a Binary Optimization Problem CHAPTER-1: CHAPTER-2: Preliminaries CHAPTER-9: Conclusion and Introduction and Literature Review Future Scope CHAPTER-6: Multi-objective based Figure- associated Text Summarization

CHAPTER-7: Multi-objective based Microblog Summarization

CHAPTER-8: Multi-modal Microblog Summarization

Figure 1.4: Layout of the thesis.

10 1.4 Contributions Outline

• In Chapter 6, a novel unsupervised approach (FigSum++) for automatic figure summa- rization in biomedical scientific articles is proposed using a multi-objective evolutionary algorithm. It simultaneously optimizes various objective functions based on syntactic and semantic similarity with the figure’s caption, like the similarity between sentences and fig- ure’s caption, number of overlapping words between sentences and figure’s caption, etc. For conducting an efficient search or to reach towards the optimal global solution, an en- semble of two different differential evolution variants is used in the proposed framework. To represent the sentences of the article in the form of numeric vectors, the recently proposed BioBERT [45], which is a pre-trained language model in , is utilized and is further used in calculating the anti-redundancy measure. A textual entailment-based measure is also proposed to avoid redundancy in the summary. An ablation study has also been presented to determine the importance of different objective functions. For evalua- tion of the proposed technique, two benchmark biomedical datasets, namely, FigSumGS1 and FigSumGS2, are considered, and performance is compared with several supervised and unsupervised systems.

• In Chapter 7, a microblog/tweet summarization technique (MOOTweetSumm) is pro- posed using the concepts of multi-objective optimization (MOO). Several tweet scoring features/objective functions like the length of the tweet [21], tf-idf score of the tweet [21] are simultaneously optimized using the multi-objective binary differential evolution al- gorithm (MOBDE) [46], which is an evolutionary algorithm (EA). For evaluation, four benchmark datasets related to disaster events are used, and the results obtained are com- pared with various state-of-the-art techniques. At the end, an extension of the proposed approach to solve the multi-document summarization task is also illustrated.

• In Chapter 8, a MOO-based framework for microblog summarization is developed by con- sidering the multi-media content (images) with the tweet-text as opposed to work done in Chapter 7. Two more objective functions: the BM25 score (a ranking function designed for short texts) [46] and the re-tweet score are explored in addition to the objective functions discussed in the previous chapter.

The layout of the thesis is shown in Figure 1.4. In the next chapter, we provide brief explanations of background knowledge like clustering algorithms, cluster validity indices, self-organizing map (SOM), and multi-objective evolution- ary algorithms (MOEA), among others. These are followed by a literature review of existing approaches developed for document clustering and various summarization tasks.

11

CHAPTER 2

Preliminaries and Literature Review

In this chapter, we discuss the preliminaries which are the primary basis of the other chapters of the thesis. These include clustering algorithms, self-organizing map, cluster validity indices, word2vec model, textual entailment, word mover distance, normalized Google distance, cosine similarity, differential evolution and evaluation measures. In addition, we also provided the brief literature survey on document clustering, single document summarization, figure-associated text summarization, microblog summarization and multi-modal microblog summarization.

13 Preliminaries

2.1 Preliminaries

2.1.1 Clustering algorithms

K-means

K-means [47] is a well known unsupervised clustering algorithm in which the given dataset is partitioned into K clusters by using the procedure of minimum center distance criterion. It assumes that the number of clusters (K) is known apriori. The principle behind K-means clustering technique is to minimize the following squared error function, which is calculated as PK Pn (j) 2 (j) 2 follows : j=1 i ||xi − µj|| where K is the number of clusters, ||xi − µj|| is a Euclidean (j) distance between a data point xi and the cluster center µj.

4 3 2 New centers Initial centers

center

t t

s

axis

axis

-

-

axis

centers

-

Y

Y Y

Updatecluster

the close the Assign each point to to point each Assign X-axis X-axis X-axis Loop Assign each point to Randomly select 2 Re-assign the data points the closest center cluster centers based on minimum distance 1 6 5 K=2

Updated centers

axis

-

axis

axis

-

Y

-

Y

Y centers

Updatecluster X-axis X-axis X-axis Figure 2.1: Working model of K-means clustering algorithm. The basic steps of K-means are as follows: 1. Select the value of number of clusters (K). 2. Initialize cluster centers by randomly picking K number of sample-points from the data set as: µi = random sample from data set where, i = 1, 2,...,K

3. Generate partitioning of the data using nearest center based criterion as given below:

Ci = {xj : d(xj, µi) ≤ d(xj, µl), j = 1, 2, . . . , n, l = 1, . . . , K, l 6= i, }, ∀i ∈ K

14 e f c d b a 1 2 3 4 5 6 7 2.1 Preliminaries

where, Ci is the ith cluster whose center is µi, d(xj, µi) denotes Euclidean distance between

the data sample, xj and the cluster center, µi.

4. Update cluster centers by using the following Equation. µ1 = P X /|C |, ∀i ∈ K i xj ∈Ci j i

where, |Ci| is the number of data points in cluster Ci.

1 5. If algorithm has converged that means if µi = µi, ∀i ∈ K then stop; otherwise go to step-3.

The working model of K-means clustering algorithm is shown in Figure 2.1.

K-medoid

The K-medoid clustering algorithm [48] is somehow related to K-means. Both algorithms par- tition the dataset into K-groups with the following differences:

• The main objective of K−means is to minimize the sum of squared errors between points in a cluster, while K-medoid minimizes the sum of dissimilarities between points in a cluster with respect to the medoid (center) in that cluster.

• In K-means, cluster center is obtained by averaging the data points belonging to the same cluster, while, in K-medoid, cluster center is chosen among the data points in that cluster, i.e., cluster center will be that data point whose average dissimilarity to all the data points in the same cluster is minimal.

3 4 2 Initial medoid Total cost=20

non-medoid

t medoid t s

medoid

axis

axis

-

axis

-

-

-

Y

Y

Y non

select Randomly the close the

Assign each point to to point each Assign Total cost=18 X-axis X-axis X-axis Loop Randomly select 2 Assign each point to compute cost of swapping cluster medoid the closest medoid with its medoid 6 5 1 K=2 New medoid Total cost=16

medoid

axis -

-

axis

medoid medoid

axis

-

Y

-

Y

Y

non

Swap with Total cost=10 X-axis X-axis X-axis Figure 2.2: Working model of K-medoid clustering algorithm.

15

e f c d b a 1 2 3 4 5 6 7 Preliminaries

The most common realization of K-medoid clustering is the Partitioning Around Medoids (PAM) [49] algorithm and it is described as follows:

1. Initialize the value of K and select K data points randomly from the dataset as the initial medoids.

2. Assign each data point to the closest medoid.

3. For each medoid m and for each data point p associated with m: (a) swap m and p; (b) compute the average dissimilarity of p to all the data points associated to m. Consider

that point as the new medoid, which has the lowest cost. 4 3 2 New centers Initial centers

center

t t

4. Repeat step-2 and 3 until there is no changes in the medoid.

axis

axis

-

-

axis

centers

-

Y

Y Y

Updatecluster

The working model of K-medoid clusteringclose the algorithm is shown in Figure 2.2. Assign each point to to point each Assign X-axis X-axis X-axis Single-linkage Loop Assign each point to Randomly select 2 Re-assign the data points the closest medoid cluster centers based on minimum distance Single-linkage clustering1 [50] is a type of hierarchical clustering technique proposed in 1967, 6 5 K=2

whose objective is to build a hierarchy of clusters. It usuallyUpdated centers has following steps:

axis cluster cluster

-

axis

axis

-

Y

-

Y Y 1. Initialize each data point as atomic cluster. centers

Update X-axis 2. Calculate distances (similarity) between all clusters.X-axis X-axis

3. Merge two clusters that are closest to each other based on the shortest distance (most similar).

4. Return to step 2 until there is only a single cluster or predetermined number of clusters.

In Figure 2.3, the working model of single-linkage clustering is shown. Here, bullets are the data points, alphabets a to f show the cluster formation in increasing order based on the shortest distance between two points belonging to different clusters.

e f c d b a 1 2 3 4 5 6 7

Figure 2.3: Working model of single-linkage clustering algorithm.

16 2.1 Preliminaries

2.1.2 Text Representations

Before applying any machine learning tool, there is a need to represent the text in the form of numeric vector. There exist various syntactic and semantics schemas in the literature for representation of text. Syntactic representation includes tf (bag-of-word model using 1-gram) [41] and tdf-df, while semantic representation includes Word2vec [43, 51, 52], Glove [44]. These representations are briefly described below.

Term-frequency or Term-document Count (tf): Term-document count [41] is a type of representation for representing text documents or any object in the form of real vectors where each component denotes the number of times a particular word appear (called as the weight of word) in the document. It is denoted as tft,d, the number of times term “t” appears in document “d”). Example: Let two documents contain the following texts: Doc1: John likes to watch movies. Mary likes movies too. Doc2: John likes to watch football games. Here vocabulary comprises of list of words (excluding stop words and “.”) like: . Now document vector is represented as: Doc1: < 1, 2, 1, 2, 1, 0, 0 > Doc2: < 1, 1, 1, 0, 0, 1, 1 >

Term-frequency Inverse document frequency (tf-idf): tf-idf [42] pair is another well known scheme for weighting the terms in a document by utilizing the concept of vector space model [41]. After assigning tf-idf weight to each term, document vector “v” of a document “d” can be represented as

vd = [w1d, w2d, w3d, ...... , wnd] (2.1) where  1 + |D|  w = tf . 1 + log (2.2) t,d t,d 1 + {d0 ∈ D|t ∈ d0} and

• tft,d is the term frequency of term t in document d in normalized form;

1+|D| • log 1+{d0∈D|t∈d0} +1 is the inverse document frequency. |D| is the total number of documents in collection and {d0 ∈ D|t ∈ d0} is the number of documents containing the term t. Here 1 is added in the numerator and in the denominator to avoid division by zero error.

17 Preliminaries

Example: Consider a document consisting of 300 words where the word cat appears 5 times. The term frequency (i.e., ’tf’) for cat is (5/300) = 0.016 (Using ’l1’ normalization). Now, as- sume that we have 20 million documents (D) and the word cat appears in two thousand (df) of D documents. Then, idf = 1 + log(20, 000, 001/2, 001) = 3.99. Thus, the tf-idf weight of the term cat is: 0.03 ∗ 3.99 = 0.119. Similarly, document vector can be generated corresponding to vocabulary as given below Doc1: < 0.12, 0.24, 0.12, 0.34, 0.17, 0, 0, > Doc2: < 0.17, 0.17, 0.17, 0, 0, 0.24, 0.24 >

Word2vec: Word2vec [43] is a model that is used to generate . The model was developed by Mikolov et al. and uses two layered neural network which takes large corpus of text as the input and generates unique vector of several hundred dimensions for each word in the corpus. Main principle behind it is to place the words nearby to each other in vector space, having the common context. It can easily capture syntactic (for example, present vs. past tense) and semantic relation between two words (for example, country/capital relationships, male/female designation) using the context words. An example is shown in Figure 2.4 where most similar words of the word ‘sweden’ obtained using word2vec1 model, are shown. Another example is illustrated in Figure 2.5 where word-pair relationship is shown. To get the sentence/tweet (or may be document) vector, we can do the averaging of word vectors present in the document/sentence. However, there exist other schemes like concatenation of word vectors, etc.

WORD COSINE DISTANCE norway 0.7601 denmark 0.7155 finland 0.6200 switzerland 0.5881 Highest cosine sweden Most Similar words similarity values in belgium 0.5858 vector space of the netherlands 0.5746 nearest words iceland 0.5624 estonia 0.5476

Figure 2.4: Most similar words for ‘sweden’ obtained using word2vec model.

Glove: Glove [44] provides vector-representations of words similar to word2vec. Glove learns by constructing a co-occurrence matrix (words X context) that basically counts how frequently

1https://github.com/mmihaltz/word2vec-GoogleNews-vectors 1https://www.tensorflow.org/tutorials/representation/word2vec

18 2.1 Preliminaries

Figure 2.5: Word-pair relationship obtained using word2vec model [Source: Internet]. a word appears in a context and then this matrix is reduced to lower dimension, where, each row represents a word vector.

2.1.3 Distance/Similarity Measures

Here, we have discussed three distance/similarity measures which are described below. All can be evaluated either between sentences (in the case of single/multi-document and figure summarization) or tweets (in the case of microblog summarization).

Word Mover Distance: Word Mover Distance (WMD) [53, 54] calculates the dissimilarity between two texts as the amount of distance that the embedded words [43] of one text needs to travel to reach the embedded words of another text [53]. Here, text means a sentence. To obtain word embeddings of different words, word2vec [43] model is used. If two sentences are similar, then WMD will be 0. An example of WMD calculation between two texts is illustrated in Figure 2.6.

Cosine Similarity: Cosine similarity [55] is a measure of similarity between two non-zero vectors that measures the cosine of the angle between vectors. It can be defined as:

V~ .V~ Pn V V cos(θ) = 1 2 = i=1 1i 2i (2.3) ~ ~ pPn pPn k V1 kk V2 k i=1 V1i i=1 V2i

~ ~ where, V1 and V2 are the vectors of length n, Vji is the ith component of jth vector, j = 1, 2. The value of this similarity lies between -1 to 1. 1 means two vectors are overlapping or exactly similar to each other, -1 means two vectors are opposite to each other, and 0 indicates they are orthogonal to each other. Note that cosine similarity requires text-vectors which can be obtained

19

Preliminaries

WMD(Text1, Text2) = a+b+c+d

Text1 Text2 greets Obama Obama The a b speaks president to Obama speaks greets the the media Chicago press c d media in in

Illinois Illinosis press Chicago

Word embedding

Figure 2.6: An illustration for WMD calculation between two texts. Here, a, b, c and d denote the distances between words. The bold words are the non-stop words embedded into a word2vec space. using tf [41], tf-idf [42] and word2vec/glove [43, 44] representations.

Normalized Google Distance: Normalized Google Distance (NGD) measures the semantic relationship between two sentences using terms present in the sentences. It was first proposed in [55]. Two terms tend to be close to each other if they are having similar sense. It is important to note that it is a dissimilarity measure, not a distance function. NGD between two sentences, si and sj, can be defined as:

P P NGD(t , t ) t1∈si t2∈sj 1 2 dNGD(si, sj) = (2.4) nti × ntj where, t1 and t2 are the terms belonging to sentences, si and sj, respectively; nti and ntj are the number of terms in sentence si and sj, respectively; NGD can be expressed as:

max{log(ft1), log(ft2)} − log(ft1,t2) NGD(t1, t2) = (2.5) log N − min {log(ft1), log(ft2)} where, ft1 denotes the number of sentences in the document (D) containing term t1, ft2 denotes the number of sentences in the document (D) containing term t2, ft1,t2 indicates the number of sentences in the document (D) containing both terms, t1 and t2, N is the number of sentences in the document. Three important properties of NGD are listed below:

1. The range of dNGD(si, sj) lies in the scale of 0 to ∞.

2. If t1=t2 or if t1 6= t2, but ft1=ft2=ft1,t2 > 0, then dNGD(si, sj) = 0

3. For every sentence si, dNGD(si, si)=0 .

20 2.1 Preliminaries

0 Note that if N = 1, then we have ft1=ft2=ft1,t2. In this case, dNGD(t1, t2)= 0 , will be considered as 0 by the 2nd property of NGD.

2.1.4 Cluster Validity indices

Cluster validity indices [56] measure the quality of a partitioning obtained using a given clustering technique. These indices also help in determining the correct number of clusters from a dataset in an iterative way. Generally, there are two types of cluster validity indices:

1. External Cluster Validation Indices: These indices require external knowledge provided by the user (ground truth/original labels) to measure the goodness of the obtained partition- ing. Minkowski Scores [28], Adjusted Rand Index [57] etc. are some examples of external cluster validity indices.

2. Internal Cluster Validation Indices: These indices generally rely on the intrinsic structure of the data and do not require ground truth labels. Most of the internal validity indices measure the intra-cluster distance (compactness within clusters) and inter-cluster sepa- ration (separation between clusters). Silhouette index (SI) [58], Dunn index (DI) [59], Davies-Bouldin index (DB)[60], Xie-Beni (XB) index [28], PBM index [61] etc. are some popular internal cluster validity indices.

Out of these indices, PBM index [61], SI [58], DI [59], XB [28] and DB [60] index are used in this thesis. The formal definitions of these indices are presented in Table-2.1.

Algorithm 1 SOM Framework(η0, σ, S, T ) 1: Initialize learning constant η0 and neighborhood size σ0; maximum iteration count T for SOM training; Initialize each map unit by assigning a weight vector randomly chosen from training data S. 2: while t 6= T do . t is the current iteration no. t t 3: Adjust learning rate (η) and neighborhood size (µ) as η = η0 ∗(1− T ), σ = σ0 ∗(1− T ). 4: Randomly select a training input pattern x ∈ S ‘ u 5: Find winning map unit: u = arg min1≤u≤D k x − w k2 u u‘ 6: Find the neighboring neurons: U = {u|1 ≤ u ≤ D k z − z k2 < σ} u u u u‘ u 7: Update all neighboring neurons u∈U: w = w + η ∗ exp(− k z − z k2) ∗ (x − w ) 8: return The weight vectors corresponding to map units, wu, u = 1, 2 ...,D

2.1.5 Self-organizing Map

Self Organizing Map [62, 30] or SOM developed by Kohonen is a type of artificial neural network which learns the data presented to it in an unsupervised way. It generates a low-dimensional output space for the given input space which is consisting of high-dimensional training data.

21 Preliminaries

Table 2.1: Definitions of Cluster validity measures/indices. Here, K: number of clusters; N: number of data points; dist: distance function; Opt. in the last column refers to optimization. Measure Definition Description Opt. type !2 1 E1 PBM = X XDK -E : total within-cluster scatter; K EK K -[µ ] : membership matrix of the data; K sj K×N E = P E th PBM [61] K s=1 s -cs: s cluster center; Maximum PN Es = j=1 µsj dist(xj , cs) -c: cluster center of the whole data set; P E1 = x∈X dist(x, c) -DK : maximum separation between clusters K 2 DK = maxi,j=1,i6=j k ci − cj k -zm1: average distance of a point xm in kth cluster to the remaining points of the N ! 1 X zi2 − zi1 same cluster ; SI [58] SI = Maximum N max(z , z ) -zm2: minimum of the average distances i=1 i2 i1 of the same point xm from points belonging to other clusters. -i and j denote the data points; -= : any clustering algorithm; minC ,C ∈=,C 6=C (mini∈C ,j∈C dist(i, j)) C ,C ,C : different clusters; DI [59] DI = k l k l k l k l m Maximum -diam(Cm) : the diameter of mth cluster maxCm∈= diam(Cm) calculated using the distance between two points of the same cluster. 1 PK DB = K i=1 Di -Mi,j be the separation between the ith DB [60] Di = maxi6=j Ri,j and the jth cluster; Minimum Si+Sj Ri,j = -Si: within-cluster scatter for cluster i. Mi,j PK P dist(s, c ) k=1 s∈ck k XB [58] XB = -ci and cj : ith and jth cluster center. Minimum N × mini,i6=j dist(ci, cj )

z1

1 Map unit ‘u’ x u u u z = (z 1,z 2) wu = (wu ,wu ,…..wu ) An Input x2 1 2 n sample

x3 ……

xp

z2

Input Space Output Space

p p p p Figure 2.7: SOM Architecture. Here x = x1, x2.....xn is the input vector, Z1 and Z2 denote the axis of 2-D Map, wu is the weight vector of uth neuron.

Usually, low-dimensional space (also called an output space) consists of a 2-D regular grid of xi= xi , xi , xi ..…xi neurons (but can be 1D, 3D also depending on the user). These neurons are called as map units. Let S be a set of training data in n-dimensional space, then each map unit u ∈ D (number of map units) has:

u u u 1. a predefined position in the output space: z =(z1 , z2 )

u u u u 2. a weight vector w = [w1 , w2 ....wn], where n is the input vector dimension, u is the index of map unit in 2-dimensional Map.

Figure-2.7 shows the typical architecture of SOM. In this example, input space and output

22 2.1 Preliminaries space are n-dimensional and 2-dimensional, respectively. The main principle of SOM is to create a topographical map such that input patterns which are similar in nature in the input space map to neurons next to each other. In our work, the sequential learning algorithm [62] is utilized for the training of SOM as shown in Algorithm-1. This algorithm returns the updated weight vectors of different map units at the output. Before training of SOM, there is a need to assign a weight vector to each neuron, randomly chosen from the available training data. At each iteration, when an input pattern is presented to the grid, then all neurons fight to become the winning neuron. This process is known as winner-take-all-learning. After finding the winning neuron, weight vector of that neuron (closest to the presented input pattern) and neighboring neurons are updated to make them close to the input pattern. It can be broadly used in clustering, data compression, visualization, etc. A real-life example taken from web2 is shown in Figure 2.8 where wealthy and poorer nations come on the opposite side after mapping them to SOM grid.

Yellows and oranges wealthy nations, while purples Wealthy nations Poorer nations and blues the poorer nations.

(a) Wealthy and Poorer nations (b) Mapping of nations to SOM

Figure 2.8: A real-life example of SOM where wealthy nations like USA, Canada, etc., come close to each other (left side of SOM grid), while, poorer nations like Nepal, Bangladesh, etc., are on the opposite of the SOM grid.

2.1.6 Textual Entailment

Textual entailment (TE) [63] is a task in natural language processing domain (NLP) and is an active research area [64, 65]. Definition of TE states that a sentence ‘p’ (called as hypothesis) is said to be entailed by sentence ‘q’ (called as premise) if ‘p’ can be inferred from ‘q’ [63]. It also describes whether relationship between ‘p’ and ‘q’ is contradictory or neutral. An example

2http://home.cc.umanitoba.ca/ umsidh52/PLNT7690/presentation/SOM.html

23 Preliminaries of entailment taken from medical domain MedNLI3 dataset is shown below: p : Patient had aphasia. q : Patient was not able to speak, but appeared to comprehend well.

where, ‘p’ is entailed by ‘q’ and represented as q → p. Another example taken from Wikipedia4 is shown below where three relations entail, contradictory and neutral, between text and hypothesis are represented. Note that TE relationship is unidirectional, i.e., if q → p, then it is not necessary that p → q . Usually, TE has many applications of NLP domain like information extraction, , summarization, , etc.

q If you help the needy, God will reward you.

Entails Contradiction Neutral

Giving money to a poor Giving money to a poor Giving money to a poor p man hass good man has no man will make you a better consequences. consequences. person.

Figure 2.9: An example of Textual Entailment.

2.1.7 Multi-objective Optimization (MOO)

Multi-objective optimization (MOO) [27] is an important paradigm used in the daily-life scenario, to take any decision. Consider an example: while purchasing a new car, a customer wants to achieve two main objectives, i.e., high comfort and less cost. But, the increase in comfort may increase in the cost. Hence, these functions are contradictory objectives and thus, need to be optimized properly. Similarly, while booking the flight ticket, a customer wants to minimize two objectives- price and duration. These examples illustrate that there exist several real-life problems in which simultaneous optimization of more than one objective is required. These problems are referred as Multi-objective Optimization Problem (MOOP). Note that MOO is able to provide a set of alternative solutions to the decision maker satisfying the objectives and thus, provides flexibility to the end-user to choose any solution. While, in the case of single objective optimization (SOO), a single objective is optimized to get a single best solution. These phenomenon (SOO vs. MOO) are illustrated in Figure 2.10 using two mathematical functions. In this thesis, the problems of document clustering and summarization are posed as MOOPs.

3https://physionet.org/physiotools/mimic-code/mednli/ 4https://en.wikipedia.org/wiki/Textual entailment

24

d

2.1 Preliminaries

2 2 Curve of f1(x)= x2 Curve of f1(x)=x and f2(x)=(x-2)

axis

axis

- -

f1(x) Y Y f2(x) f1(x)

X-axis X-axis -1 0 1 - 1 0 1 2

Task: Minimize f1(x) Task: Minimize both f1(x) and f2(x)

Single-objective optimization (SOO) Multi-objective optimization (MOO)

(SOO) (SOO) f1(x) is minimum at x=0; thus, f1(x) is minimum at x=0 and f2(x) is minimum

it provides a single best at x=2. Both functions can’t attain a single solution at x=0. minimum value; thus it provide a set of solutions lying between 0 to 2.

Y-axis Provide aY -setaxis of non-dominating solutions

1

1

0

0

1

-

1

-

X-axis X-axis

- 1 0 1 2

- 1 0 1 2

Figure 2.10: Comparison among SOO and MOO.

Cost () Cost () Comfort (↑)

In the case of document clustering, various cluster validity indices [58] measuring the goodness of a partitioning in terms of low compactness and high separability, can be simultaneously optimized. While, in the case of summarization, we can optimize different quality measures to Single-objective optimization (SOO) Multi-objective optimization (MOO) get the good quality summary. For example, in the case of single document summarization, these (SOO) (SOO) quality measures can beProvide anti-redundancy single solution (to avoidProvide redundancy a set of non in-dominating the summary), solutions readability

(provide readable summary), etc.

Formal Definition:

∗ ∗ ∗ ∗ MOO can be formulated as: Find the vector X ={~x1, ~x2...~xn} by simultaneously optimizing M (≥ 2) objective function values:

∗ {f1(~x), f2(~x)...fM (~x)} such that ~x ∈ X (2.6) which satisfy m inequality constraints:

gi(~x) ≥ 0, i = 1, 2, . . . , m, (2.7)

25 Preliminaries

and p equality constraints:

hi(~x) = 0, i = 1, 2, . . . , p (2.8)

where, X∗ is a set of optimal solutions and optimization can be of maximization/minimization or mixture of both. These constraints define the feasible boundary in which optimal solutions can lie.

Dominance and Non-dominance Criteria between Solutions

It is one of the important concepts in MOO. It helps in deciding the optimality of the solutions amongst the set of solutions obtained using MOO. Consider the scenario of flight booking system where objectives are to minimize both cost and duration. MOO will generate a lot of solutions after optimizing these objectives and among those solutions, a solution ~xj is said to be dominated by solution xi if ∀k ∈ 1, 2, . . . , M, fk(xi) ≤ fk(xj) and ∃k ∈ 1, 2,...,M such that fk(xi) < fk(xj). The set of solutions dominating all other solutions are called as Pareto optimal solutions and the surface on which they lie is called Pareto optimal front. Note that in this set, all solutions will be non-dominating to each other. For example, let MOO provide 12 solutions numbered from 1 to 12 in Figure 2.11. Using the dominance rule, the solutions numbered from 1 to 6 are non-dominating to each other and remaining solutions are dominated by at least one solution among these solutions (1 to 6). Therefore, we can call solutions from 1 to 6 as Pareto optimal

solutions.

Pareto optimal front

Duration

) 1 7 10 Solution-2 and 3 are non-dominating 2 8 11 Solutions 12 Solution-3 is Minimize 3 ( 9 dominating solution 8 4 5 6

(Minimize) Cost

Figure 2.11: Dominance and non-dominance between solutions obtained using MOO.

2.1.8 Evolutionary Algorithms (EAs)

EAs [66] are population-based meta-heuristic optimization algorithms inspired by the biological phenomenon in the human beings like parent selection, crossover (exchange of genes), mutation

26 2.1 Preliminaries

(change in the gene value), environmental selection, etc. These operators are called genetic operators. An EA starts from a set of candidate solutions called as population. These solutions play the role of individuals in the population. Each solution is associated with some fitness value determining the quality of the solutions. Over the number of generations or until a specified time or until we get good quality solutions in terms of fitness values, these solutions are evolved using the above genetic operators. Due to their population-based nature, EAs are able to provide a set of global (near) optimal solutions in a single run at the final stage of the algorithm. The steps of evolutionary procedures are illustrated in Figure 2.12. Some of the examples of EA techniques are differential evolution [67], genetic algorithm [29], particle swarm optimization [68], ant colony optimization [69], etc. All these algorithms follow the same evolutionary steps, but, differ in performing computational steps like crossover and mutation.

New solution replacing old solution Fitness Fitness

Fitness Fitness

Crossover Merge population with child population and mutation ……….

Fitness

Mating pool construction Child population (Parent selection) (New solutions) Compare child population with population using fitness value(s)

Fitness Fitness

Update Select top best Population Population solutions (set of individuals/solutions)

Figure 2.12: The steps of evolutionary procedures.

2.1.9 Multi-objectiveCrossover: Exchange Evolutionary of genes; Algorithms Mutation: change (MOEAs) in the gene value

1 Nowadays, EAs are getting popular as (a) they can be adapted to solve any real-life problem, including clustering [3], [70], bioinformatics [71], social network [72], etc.; (b) their ability to provide multiple solutions after a single run. Moreover, to improve the quality of solutions, multiple fitness/objective functions can be optimized simultaneously which is nothing but incorporation of MOO concept. Thus, it will be appropriate to call EAs utilizing the MOO+EA concept as multi-objective evolutionary algorithms (MOEAs) [73, 74]. Note that the nature of MOEA is different from the single objective evolutionary algorithm (SOEA). In SOEA [75], the task is to optimize (either maximization or minimization) the single objective (referred as SOO) using the EA, to obtain a single optimal solution. In this thesis, document clustering and summarization tasks are tackled using the concept of MOEA. In the literature, a

27 Preliminaries number of different MOEAs have been suggested to solve the MOOPs. Some of them are NSGA- II [29], MODE [67], PSO (particle swarm optimization) [76], ant-colony optimization [77]. etc. Out of them NSGA-II and MODE are briefly described here.

NSGA-II: It is a non-dominated sorting genetic algorithm proposed by [29]. It was developed to remove the three drawbacks of NSGA [78]: (a) high complexity; (b) lack of elitism (preserv- ing the good solutions found so far); (c) sharing parameter (to maintain the diversity in the population). To preserve the elitism and maintain diversity, it combines the old population and new population (having new solutions generated using genetic operators) and then, the best so- lutions in the objective functional space are identified using non-dominating sorting (NDS) and crowding distance based operators. NDS algorithm assigns ranks to the solutions using their objective functional values and puts them in different fronts based on their rankings. Crowding distance operator determines which solution in a front lies in the more crowded region. For the selection of the best solutions, solutions are selected in a rank-wise manner until the number of solutions equals to the size of the population. In the case of a tie, a solution having high crowding distance [29] is selected. For more information, one can refer to [29].

Multi-objective Differential Evolution (MODE): Another MOEA is Differential Evo- lution (DE) [67] proposed by Storn and Price in 1995 to solve real-parameter optimization problems. In MODE framework, for each solution (called as target vector), a new solution is generated called as trial vector, using mutation, crossover and environmental selection (selection of best solutions) operations, in sequence. In NSGA-II, mutation is performed by changing the component value of the solution. But in MODE, mutation operator involves addition of weighted differences between pairs of solutions (called vectors) with the third solution. The aim of muta- tion operation is to find a search direction based on the distribution of solutions in the current population. The solution obtained is called as mutant vector. This mutant vector is mixed with the target vector using some crossover probability, called as the trial vector. The obtained set of trial vectors are merged with the parent population and then, environmental selection operator is applied to select the top solutions which are then used for the next generation. For effective solution selection and to promote diversity and preserve elitism, the concept of non-dominated sorting and crowding distance operator [29] can be utilized. Thus, the process of generating new solutions using mutation and crossover, and then selection of top solutions for next generation will continue until a maximum number of generations is reached. It was shown in the literature that multi-objective DE (MODE) performs better than other MOEAs like PSO [39, 40], NSGA-II [29], etc., due to its faster convergence rate and efficient

28 2.1 Preliminaries global search capability for solving different real-life application problems [3]. Moreover, in recent years, a lot of works are going on improving DE [79] which clearly indicate that it is promising than other techniques. This motivates us to explore the MODE for our document clustering and summarization tasks. Below we have described the mathematical equations used to perform genetic operators.

2.1.10 Mathematics of Genetic operators in MODE framework

There exist many variants of the MODE; each differs in representation (real-coded or binary- coded) of the solution, new solution generation strategy and in the use of parameters. Let [~x1,t, ~x2,t . . . ~x|P |,t] be the population at generation ‘t’, where, | P | is the size of the population. For each current solution, a trial vector vc,t+1 is generated using mutation and crossover oper- ations. Let xc,t be the current solution (target vector) at generation ‘t’ for which we want to generate a new solution.

Mating Pool Construction: Three solutions, xr1,t, xr2,t, and xr3,t, from the population are selected randomly such that xr1,t 6= xr2,t 6= xr3,t 6= xc,t

Mutation: There exist various mutant schemes proposed in the literature [80], but, generally, DE/rand/1 is utilized. It generates mutant vector, uc,t+1, for the current solution. If solutions are real-coded then Eq. 2.9 is used to generate the mutant vector as

c,t+1 r1,t r2,t r3,t uj = xj + F × (xj − xj ) (2.9)

r1,t r2,t r3,t where, F is the scaling/weighted factor generally lying between [0, 2], xj , xj and xj are the jth components of randomly chosen solutions at generation ‘t’. If solutions are binary-encoded, then firstly, a probability estimation operator P (xc,t+1), is performed using the solutions in mating pool and current solution as per Eq. (2.10) and then, the obtained probability vector is converted into binary space using a randomized procedure as per Eq. (2.11) to give rise to uc,t+1

c,t+1 1 P (x ) = r1,t r2,t r3,t (2.10) j 2b×[x +F ×(x −x )−0.5] − j j j 1 + e 1+2F

c,t+1 t t t where P (xj ) is the probability estimation operator, (xr1,j + F × (xr2,j − xr3,j) − 0.5) is the mutation operation, b is a real positive constant. Then the mutant vector uc,t+1 for the current solution, xc,t is generated as

29 Preliminaries

  c,t+1 c,t+1 1, if rand() ≤ P (xj ) uj = (2.11) 0, otherwise where rand() is a random probability between 0 to 1.

Crossover: The trial vector vc,t+1 is generated by performing crossover between current solu- tion, xc,t, and mutated solution, uc,t+1 obtained in Eq. (2.9).

  c,t+1 c,t+1 uj , if rand() ≤ CR vj = (2.12)  c,t xj , Otherwise where rand() is a random probability between 0 to 1, j = 1, 2,...,N, N is the length of c,t c,t+1 c,t solution, CR is the crossover probability, xj and uj are the jth component of x and uc,t+1, respectively. In the case of solutions having binary representation, MBDE will be called as MOBDE.

2.1.11 Number of Fitness Function Evaluations

Generally, in any evolutionary based optimization strategy, the number of fitness function eval- uations (NFE) [81] is reported which is generally expressed as

NFE = a + b (2.13) where, a =| P | ×#Objectives F unctions Used (2.14)

b = tmax × a (2.15) tmax, |P | and #Objectives F unctions Used are the maximum number of generations, number of solutions in the population and the number of used objective functions, respectively. Note that while evaluating NFE, a is there as population is initialized with an initial set of solutions before starting of the generation number.

2.1.12 SOM as a Mating Pool Construction Tool

In any evolutionary algorithm, the qualities of new solutions generated from the old solutions play vital role in reaching the global optimum solutions. The optimal solution for any problem can lie in the local optima region (nearby to the existing solutions) or global optimal region. Therefore,

30 2.1 Preliminaries to capture the first one, a newly designed reproduction operator based on self-organizing map (SOM) [30] is introduced in the MODE (MOBDE) framework as the mating pool construction tool. SOM is first trained using the current population to discover the localities of solutions (chromosomes), and then a mating pool of fixed size is constructed for each solution using the neighborhood relationships (nearby solutions) extracted by SOM. This set of nearby solutions forms the mating pool, Q, for the current solution. Only these solutions can take part in mating to generate a new solution from the current solution. Series of steps to construct the mating pool, Q, for the current solution xc ∈ P are described in Algorithm 2. Firstly, winning neuron “b” for the current solution needs to be selected (Line 1). Thereafter neighboring neurons near to “b” and the corresponding mapping solutions ∈ P are extracted one by one to form the mating pool (Line 2) and it will be continued until we get desired size of the mating pool (in Figure 2.13, it is 6). The neighboring (closer) solutions present in the mating pool for the current solution can take part in the reproduction (mutation and crossover) operation to generate a new solution. This phenomenon is illustrated in Figure 2.13. Different parameters used in the algorithm are- P: the population containing solutions (x1, x2, . . . , x|P |), γ: threshold probability for selecting the neighboring solution, D: distance matrix formed using position vectors of neurons in the grid, H: mating pool size and xc: current solution for which the mating pool is generated.

Algorithm 2 Construct MatingPool(xc,γ, P, H, D) 1: Find the winning neuron “b” in SOM architecture corresponding to solution xc based on minimum Euclidean distance. 2: Sort bth row of D in ascending order and store the sorted indices in J. ( ∪H {xk} if rand() < γ, and m < H Q = m=1 P otherwise

Where rand() gives a random number lying between 0 and 1. xk is the solution ∈ P mapped to neuron k and k ∈ J. 3: return mating pool Q for solution xc

Example: Let us assume that we have to generate a new solution for the current solution, xc. Firstly a mating pool is required to be constructed. Let, the number of neurons in SOM grid are 9 having index values {0, 1, 2, 3, 4, 5, 6, 7, 8} with position vectors {(0, 0), (0, 1), (0, 2), (1, 0), (1, 1), (1, 2), (2, 0), (2, 1), (2, 2)}, respectively. To build the mating pool, firstly the winning neuron cor- responding to ~xcurrent is determined using the shortest Euclidean distance criterion. Let it be the 4th neuron. Secondly, the Euclidean distances between 4th neuron and other neurons are required to be calculated using position vectors of the neurons, which are [1.41, 1, 1.41, 1, 0, 1, 1.41, 1, 1.41]

31 Preliminaries

(with respect to neuron indices {0, 1, 2, 3, 4, 5, 6, 7, 8}). After that the calculated distances are sorted in ascending order and correspondingly neuron indices are recorded, i.e., after sorting we obtain the list of distances as [0, 1, 1, 1, 1, 1.41, 1.41, 1.41, 1.41] with corresponding neuron index values as J=[4, 1, 3, 5, 7, 0, 2, 6, 8]. Consider the mating pool size (H) as 4. Now a random proba- bility “r” is generated. If “r” is less than some threshold probability, γ, then solutions mapped to H neurons having indices [1, 3, 5, 0] will form the mating pool. This further helps in exploitation. Note that here we have excluded first neuron index in the sorted list as it represents the winning Biclustering using Self-Organizingneuron Multiobjective and distance of the winning neuron with itself will always be zero. If “r” is greater Evolutionary Algorithmthan some threshold probability, γ, then any solution from the population can participate in the Under the guidance of Dr. Sriparna Saha mating pool construction. This step helps in the exploration of the search space to findPrinting: optimal Presented By Chirag Soni (1401CS13) solution in global optimal region. This poster is 48” wide by 36” high. It’s designed to be printed on a ABSTRACT ALGORITHM large Biclustering is a popular technique for the study of gene expression data, especially for discovering functionally related gene sets under different subsets of experimental 1 SOM Training conditions. Multiple measures or cost functions determine the quality of biclusters. Initialize the SOM neurons with the population In this project we implement an evolutionary algorithm which uses neighborhood instances. Use previously trained SOM from previous generation and train only with newly generated information from self-organizing maps to mate solutions and fast-non-dominated- solutions in next generation. h sorting procedure from NSGA-II to select best solutions based on multiple objective Customizing the Content: functions to form new population. The differential evolution used in the algorithm 2 Map Solutions to SOM Neurons allows to incorporate vastly different solution in the population. We compare our Map solutions in the population to their closest SOM neuron. The placeholders in this results with already existing biclustering algorithms on different datasets. formatted for you. 3 Form Mating Pool Form a mating pool of solutions using neighborhood Current Solution Neighboring solutions (solution mapping to nearby neurons to h) BACKGROUND information from SOM. (With some probability) placeholders to add text, or click SOM Neuron Neighborhood Radius Winning neuron ‘h’ 4 Crossover, Mutation, Selection (NSGA-II) h an icon to add a table, chart, Biclustering Generate new solution by mating within mating pool. Mutate the solution with a probability p. Select the best solutions with NSGA-II for next generation. Solution mapping to other neurons (lying outside neighborhood radius) SmartArt graphic, picture or C1 C2 C3 C4 C5 C6 C7 G1 G2 multimedia file. G3 Figure 2.13: Mating pool construction for current solution using SOM. G4 ENCODING MSR RV Volume Algorithm G5 Average Best (Min) Average Best (Max) Average Best (Max) T G6 Proposed 164.6 95.96 11637.88 16548.53 796.94 1235 G7 MOGAB 185.85 116.34 932.04 3823.46 329.93 1848 from text, just click the Bullets G8 SGAB 198.88 138.75 638.23 3605.93 320.72 1694 G1 G2 G3 G4 G5 G6 C1 C2 C3 C4 2.2 Literature Survey button on the Home tab. A Chromosome (Solution) Table 1. Comparison of the biclusters of different algorithms for Yeast Data

Gi Genes Gene Cluster G1 C1 G1 C2 G1 C3 G1 C4 Center MSR RV Volume Algorithm If you need more placeholders for Bicluster Evaluation Measures G2 C1 G2 C2 G2 C3 … In this section, we provideAverage a surveyBest (Min) andAverage analysis ofBest the (Max) existingAverage approachesBest (Max) for document clus- • Mean Squared Residue (MSR). Evaluates coherence of bicluster. MSR should be low. Biclusters within solution Proposed 372.06 182.65 25671.92 52112.35 415.41 1944 K-Means Clustering tering, summarization of single-document, figure, microblogs and multi-modal microblogs.titles, Sub- • Row Variance (RV). Used to eliminate trivial biclusters. RV should be high. MOGAB 801.37 569.23 2378.25 5377.51 276.48 1036 SGAB 855.36 572.54 2222.19 5301.83 242.93 996 • Bicluster Size (BS). Ratio of volume of bicluster to that of dataset. Larger biclusters section 2.2.1 gives a brief survey of the works related to document clustering. In themake subsection a copy of what you need and are preferred. Table 2. Comparison of the biclusters of different algorithms for Lymphoma Data • Bicluster Index (BI). Ratio of MSR to RV. Should be low. CONVERGENCE 2.2.2, we provide the literature survey on extractive single document summarization.drag Subsec- it into place. PowerPoint’s Self-Organizing Map (SOM) tion 2.2.3 describes existing approaches related to figure-summarization. In the subsectionSmart 2.2.4, Guides will help you align it SOMs provides a low-dimensional discretized representation of input space. It is a some existingFUTURE works on microblogWORK summarization have been described. Finally, subsection 2.2.5 competitive unsupervised learning as opposed to error driven learning. After mapping the input data, it preserves the topological properties of the training points. presents existingIn addition works to the gene in expression the field data, of the multi-modal proposed algorithm microblog can also besummarization. applied to any general with everything else. We use SOM to extract neighboring solutions of a solution to form mating pool. machine learning training set to have an insight about which features are important for which class of training instances. We will apply the algorithm to the SMS feature dataset generated in previous Fast-Non-Dominated Sorting (NSGA-II) project to see the features’ importance. The biclustering32 technique can be used to determine the Want to use your own pictures Non-dominated sorting is used when we have to sort based on multiple objectives. subset of features which makes the classification effective and hence it is worthwhile to examine Fast non-dominated sorting procedure from NSGA-II is used to partition the its scope in feature selection. instead of ours? No problem! Just population into L different non-dominated fronts. The solutions within a front are Proposed method converges just after 30 generations in comparison further sorted based on crowding distance. We select the top P solutions from the to MOGAB which converges around 100 generations. right sorted population to generate new population. Change Picture. Maintain the References: 1. A Self-Organizing Multiobjective Evolutionary Algorithm, Hu Zhang, Aimin Zhou, Shenmin Song, Qingfu Zhang, Xiao-Zhi Gao, and Jun Zhang proportion of pictures as you resize 2. Finding Multiple Coherent Biclusters in Microarray Data Using Variable String Length Multiobjective Genetic Algorithm, Ujjwal Maulik, Anirban Mukhopadhyay, Sanghamitra Bandyopadhyay 3. Biclustering of Expression Data, Yizong Cheng, George M. Church 4. The Self-organizing Map, Teuvo Kohonen by dragging a corner. 5. A Fast and Elitist Multiobjective Genetic Algorithm: NSGA-II, Kalyanmoy Deb, Amrit Pratap, Sameer Agarwal, and T. Meyarivan 2.2 Literature Survey

2.2.1 Document Clustering

In [82], an approach for clustering the text-documents was proposed in which Particle Swarm Optimization (PSO) [39] technique is used as the optimization technique to improve the quality of the clusters. It makes use of only one objective as the fitness/objective function. Authors of the paper [83] have explored the noun phrases and semantic relationships such as WordNet (a lexical database for English) [84] for document clustering. Cai et al. [85] proposed an improved version of DBSCAN [47] clustering. Karaa et al. [86] presented a framework for the clustering of the biomedical abstracts and utilized the search capability of the genetic algorithm [29] to search for the optimal clustering solution. In [87], a statistical-based method considering volume minimization (VolMin)-structured matrix factorization, is proposed for text data clustering. Abualigah et al. [88] utilized three meta-heuristic algorithms (genetic algorithm, harmony search, and particle swarm optimization) to select the informative feature for document representation and finally, used K-means [47] clustering algorithm on the features obtained. Note that all these works consider text-document datasets. No work explores the multi-objective optimization for document clustering. But, there exist some works [1, 28, 89] which show their potentiality on some numeric datasets. In [28], some symmetry based automatic multi-objective clustering techniques uti- lizing archived multi-objective simulated annealing [38] process as the underlying optimization technique is proposed. In [89, 90], Bandyopadhyay et al. proposed multi-objective genetic clus- tering algorithm utilizing NSGA-II [29] as the underlying optimization strategy. Here, authors simultaneously optimized two internal cluster validity indices, Fuzzy C-Means [91] and Xie-Beni index [59]. Efficacy of these approaches is shown on gene expression data [92]. In [93], extension of same work [89, 90] was shown for partitioning categorical data by simultaneously optimizing fuzzy separation and fuzzy compactness. Handl et al. [1] proposed an automatic multi-objective clustering technique, MOCK, optimizing two objective functions simultaneously. Main limitation of MOCK is that it can determine only some hyper-spherical shaped or well-separated clusters and can not detect overlapping clusters. In addition, complexity of MOCK increases linearly as the number of data points increases.

Limitations of Existing Approaches: After an extensive literature survey on document clustering, we arrived at the conclusion that:

• All existing document clustering approaches consider fixed number of clusters which may not be known beforehand.

• Only PSO and genetic algorithm have been explored for solving the document clustering

33 Preliminaries

task. But, they have considered only single objective function while performing the clus- tering. In the literature, differential evolution [39, 94] is shown to perform better than PSO. Therefore, incorporation of DE can help in improving the document clustering task.

• Self-organizing map based genetic operator was never explored in fusion with multi-objective clustering techniques. It helps in generating good-quality solutions by utilizing the neigh- borhood property and search nearby to the existing clustering solution to determine the optimal solution.

2.2.2 Extractive Single Document Summarization (ESDS)

We have divided the related works on single document summarization into four categories : (a) supervised; (b) unsupervised; (c) neural-network; and, (d) meta-heuristic. Brief descriptions of these methods along with their drawbacks are described below:

Supervised methods: SVM [95] considered pre-existing document-summary pair for learning. In [96], summarization problem is treated as a sequence labeling problem and is solved using Condition Random Field (CRF) [97]. In [98], a method named, Manifold Ranking was proposed in which a ranking score was assigned to each sentence in the document based on its information richness and diversity. Then, sentences having high ranking scores are only selected to generate the final summary. In [99], regression-based model was proposed using Integer Linear Program- ming [100] which uses three features to select the candidate summary from the set of available summaries. Main limitation of the methods proposed in these papers is that they make use of labeled data for training (i.e., whether sentence belongs to the summary or not) which requires manual effort and this is also a time-consuming step.

Un-supervised methods: In [101], QCS, a query-based method was proposed by Dunlavy et al. to generate the summary. It uses Hidden Markov Model (HMM) which predicts the probability of a sentence to be included in the summary. Note that the method developed was a graph-based method which was adopted for simultaneous summarization of single as well as multiple documents. Main drawback of this approach was that it considers only three features: sentence position, local salience (for single-document summarization) and global salience (for multi-document summarization) scores of the sentences. Ferreira et al. [102] developed a context- based summarization system and have shown that quality of generated summary obtained using different combinations (sum) of sentence scoring functions/features depends on the type of text

34 2.2 Literature Survey

(news, article, blog). Their sentence scoring features include word-based scoring (like term frequency, etc.), graph-based scoring (obtained using Text Rank algorithm [33]) and sentence- based scoring (sentence position, sentence similarity with the title, etc.). Main limitation of the discussed unsupervised methods [101, 102] is that they have not explored the feature like readability which is important in understanding the generated summary by the end-user.

Meta-heuristics based methods: Aliguliyev et al. [55] proposed an optimization based auto- matic text summarization method. Here, the sentences in the document are assigned to different clusters and cluster quality functions are optimized using differential evolution algorithm. Then in every cluster, sentences are sorted based on some sentence scoring features. Finally, high ranked sentences are selected as a part of the summary. The author of the published thesis [103] had discussed the principle approaches for solving the task of automatic text summarization and showed that meta-heuristic approaches like genetic algorithm, etc., are practical for summariza- tion as they are capable of yielding high-scoring summaries. But, our work on summarization is completely different in terms of using the meta-heuristic approach and other concepts. In [104], fuzzy evolutionary optimization model (FEOM) was developed and applied to extractive summarization. In [105], the method, MA-SingleDocSum was proposed by Mendoza et al. using optimization algorithm named as Memetic algorithm [106]. It makes use of guided local search to solve the summarization problem. In [107], a method named ESDS-GHS-GLO is proposed based on Global-best Harmony Search meta-heuristic and a greedy local search procedure. It considers extractive single document summarization as a binary optimization problem. Rasim et al. [108] proposed a COSUM method utilizing clustering and optimization technique optimizing coverage and diversity of the summary, simultaneously. Main drawbacks of these meta-heuristic algorithms are their low convergence rate and low ROUGE score. Moreover, they optimized sum (in some of the cases, weighted sum) of different objective functions, thus, converting multiple objective values to a single value.

Neural-network based methods: In [109], a neural network based method was developed namely, NetSum, which uses the RankNet [110] algorithm to assign ranks to the sentences in the document and then identify informative sentences. In recent years [6, 7], some deep learning models like a recurrent neural network, etc. have been used for solving single document extractive summarization task. Note that these methods make use of supervised information while training.

35 Preliminaries

2.2.3 Figure-associated Text Summarization

In the literature, a few works have been done on this task. Passonneau et al. [111] proposed a system in which summaries of workflow diagrams are generated. The main drawback of this approach was that it requires a list of attribute values describing the diagrams. Futrelle [112] discusses various challenges and related issues. In Futrelle [25], the authors used two sources of information to summarize figures: structure of the diagram and the other one is the text of the figure’s caption and text in the article. Agarwal et al. [113] proposed a system, FigSum, to generate summary of images related to biomedical domain. It assumes that information related to figures were scattered throughout the various sections of the scientific articles like the introduction, proposed method, results and so on. The top scoring sentences having high tf-idf cosine similarity with the figure’s caption and article’s main theme were considered as a part of the summary. Peng et al. [114] proposed the idea of generating the summary of information graphics by using the paragraphs in a multi-modal document related to news domain. Bhatia et al. [115] used a supervised approach to generate figure summary. The authors identified the relevant sentences from the articles based on the similarity of sentences in the article with the figure’s caption and sentences referring to that figure. FigSum+ method was proposed by Ramesh et al. [20] which is an extended version of the method namely, FigSum, discussed in the paper [113]. Authors of this paper have explored various supervised and unsupervised approaches to generate the summary of bio-medical images present in the scientific article. Some of the approaches are developed using surface-cue words, for example, identifying paragraphs and sentences referring to the figure. To the best of our knowledge, no further work is reported in the literature survey for figure-summarization task.

Limitations of Existing Approaches: After an extensive literature survey on Figure-associated text summarization methods, we conclude that the existing approaches have the following draw- backs:

• No algorithm had utilized the language representation model at the semantic level. They used only syntactic representation which may cause the data sparseness problem.

• Only one or two quality measures or their weighted sum are considered for computing the sentence score.

• Anti-redundancy function is an important measure while generating a summary, but none of the existing techniques had explored the textual-entailment as an anti-redundancy func-

36 2.2 Literature Survey

tion.

• None of the methods utilizes the concept of multi-objective evolutionary algorithm consid- ering simultaneous optimization of various syntactic and semantics-based objective func- tions, to generate the near-optimal summary.

2.2.4 Microblog Summarization

In the literature, a lot of works have been done on tweet summarization. In [116], the problem of summarization of tweets related to sports event was solved. But, summarization of disaster event related tweets is more important as it may convey relevant information to the higher authorities and help them in taking the desired action. In [117], first clustering of tweets is performed and some representative tweets are selected from each cluster. Then arrangement of these tweets is carried out using graph based LexRank [32] algorithm. In [118], abstractive summarization is proposed for online summarization of tweets. Some other techniques for online summarization of tweets are discussed in [117, 119]. Dutta et al. [34] showed the comparison among various extractive summarization techniques to summarize disaster related tweets. These techniques include Cluster-rank [120], Lex Rank [32], LSA [121], Luhn [122], MEAD [123], SumBasic [124], SumDSDR [125] and COWTS [126]. COWTS technique uses the content words of the tweets to generate the summary of situa- tional tweets. Situational tweets are those tweets which provide information like status update, i.e., current situation in the effected region by the disaster event. The extension of COWTS work was done in [127]. In [128], time aware knowledge is extracted from the tweets for microblog sum- marization task. Recently, [21] proposed an ensemble approach for microblogs summarization which generates the summary after considering the summaries of various algorithms discussed in [34].

Limitations of Existing Approaches: After an extensive literature survey on microblog summarization methods, we conclude that the existing approaches have the following drawbacks:

• The existing recent best methods proposed in the year 2018 use the ensemble approach, which generates the summary after considering the summaries generated by various algo- rithms like LexRank, MEAD, Luhn, etc., discussed in [34]. But, in real-time, the applica- tion of the ensemble approach for summarizing tweets is time-consuming as firstly, we have to generate the summaries by different algorithms and then produce the final summary by considering these individual summaries.

• Existing algorithms consider only single objective function to generate the summary. There

37 Preliminaries

exists no MOO-based framework which considers different goodness measures of a summary for simultaneous optimization.

2.2.5 Multi-modal Microblog Summarization

This work is an extension of microblog summarization in which only tweet-text is considered for summarization purpose. Only a few works exist which consider multi-media tweets (tweet-text with images) for summarization [129, 130, 131]. The work by Amato et al. [132] proposed an approach for multimedia summarization summarizing social media content, but the used dataset was for general domain, containing images for animal, nature, and landscape. In paper [133], a multi-modal approach has been proposed for classifying disaster-related tweet images. The papers by Bian et al. [130, 131] generate the visualized summaries using the microblog multimedia content of trending topics. Their works were on Chinese tweets and utilized the probabilistic model (LDA) [129].

Limitations of Existing Approaches: After going through the literature survey on multi- modal microblog summarization methods, we arrived at a conclusion that:

• The existing methods work on the multi-media microblog datasets of trending topics, including social trends and product events. But, the summarization of disaster events related to multi-media tweets is more important as it may convey relevant information to the higher authorities and help them in taking the desired action. Only one method exists which develops a classification-based framework to classify disaster-related images. Thus, dataset to handle the multi-modal microblog summarization task is not available.

• For image feature extraction, existing algorithms use different features for scene under- standing like SIFT, RGB histogram, GLCM, GIST, Gabor, etc., which provide vectors of varying size and thus, difficult to interpret the meaning. The image-captioning model provides the visual features in text-format, but no one utilizes this.

• No evolutionary-based framework has been developed for this task.

2.3 Evaluation Measures

To test the performance of any approach, there is a need of evaluation measures/metrices. Follow- ing are the descriptions of evaluation metrices used for document clustering and summarization.

38 2.3 Evaluation Measures

2.3.1 Document Clustering

In order to measure the goodness of the obtained partitioning, two internal cluster validity indices, namely Dunn Index [59] and Davies-Bouldin (DB) Index [60] are used. Detailed descrip- tions of Dunn and DB index are given in Table-2.1. The maximum and minimum values of Dunn index and DB index, respectively, imply better clustering results. Internal cluster validity indices are not utilized as the used datasets are not available with the ground-truth cluster labels.

2.3.2 Summarization

Single/multi-documents, microblogs, and multi-modal microblogs summarization: To evaluate the performance of these systems, we have utilized the ROUGE measure [134]. It measures the overlapping units between the actual/gold summary and our predicted summary. High value of ROUGE score indicates that obtained summary is very close to the actual summary. The mathematical definition of ROUGE score is given below:

P P Countmatch(N − gram) ROUGE − N = S∈Summaryactual N−gram∈S (2.16) P P Count(N − gram) S∈Summaryactual N−gram∈S

Where N represents the length of n-gram, Countmatch(N-gram) is the maximum number of overlapping N-grams between actual summary and the generated summary, Count(N-gram) is the total number of N-grams present in the actual summary. In our thesis, N takes the values of 1, 2 and 3 for ROUGE−1, ROUGE−2 and ROUGE−L, respectively.

Figure-summarization: For figure-summarization task, we have reported the precision, re- call and F-measure (or F1-score) [20] values which are the well-known measures in . Mathematically, this F-measure can be defined as

(2 * Precision * Recall) F1-score = (2.17) (Precision + Recall) where, P recision is the ratio of number of correctly identified sentences in the predicted summary and total number of sentences in the predicted summary, while, Recall is the number of sentences correctly identified out of total number of sentences present in the actual summary. Mathematical definition of P recision and Recall are given below:

#Sentences correctly identified Precision = (2.18) Total #Sentences identified by the system

39 Preliminaries

#Sentences correctly identified Recall = (2.19) Total #Sentences in the actual summary where, #Sentences denotes the number of sentences.

2.4 Chapter Summary

In first part of this chapter, we have discussed the preliminaries of some topics which are the primary basis of the other chapters of the thesis. In the second part of this chapter, existing approaches on document clustering, single document summarization, figure-associated text sum- marization, microblog summarization and multi-modal microblog summarization, are described. The third part discusses about evaluation measures for different tasks. The next chapter is the first contributory chapter where we have proposed a novel multi- objective clustering technique. As an application, we have chosen the task of scientific/web document clustering from the domain of text mining.

40 CHAPTER 3

Automatic Document Clustering: Fusion of MODE and SOM

In this first chapter, we propose a bio-inspired multi-objective automatic document clustering technique that is a fusion of a self-organizing map (SOM) and a multi-objective differential evolution approach. The variable number of cluster centres are encoded in different solutions of the population to determine the number of clusters from a data set in an automated way. The concept of SOM is utilized in designing new genetic operator for the proposed cluster- ing technique. To measure the goodness of a clustering solution, two cluster validity indices: Pakhira-Bandyopadhyay-Maulik index and Silhouette index, are optimized simultaneously. Re- sults obtained clearly show that our approach is better than existing approaches. The validation of the obtained results is also shown using statistical significance t-tests.

41 Automatic Document Clustering: Fusion of MODE and SOM

3.1 Introduction

3.1.1 Overview

Document clustering [4] refers to the partitioning of a given collection of documents into various K-groups based on some similarity/dissimilarity criterion so that each document in a group is similar to other documents in the same group. For clustering, the value of K may or may not be known a priori. To determine the value of K in the collection of documents, traditional clustering approaches [59] like K-means [47], K-medoid, bisecting K-means [135], and hierar- chical clustering techniques [47] are required to be executed multiple times with various values of K. Then, the qualities of different partitionings are measured with respect to several cluster validity indices, measuring the goodness of a partitioning. Finally, the partitioning, which cor- responds to the optimal value of any cluster validity index, is selected as the final partitioning. The Davies-Bouldin (DB) index [60], Silhouette index (SI) [58, 136], Xie-Beni (XB) index [28], and Pakhira-Bandyopadhyay-Maulik (PBM) [61] index, among others, are some popularly used cluster validity indices. The existing traditional clustering techniques implicitly optimize an internal evaluation func- tion or objective function. These objective functions in general measure the compactness of clusters [137], spatial separation between clusters [137], connectivity between clusters [138], and density or cluster symmetry [28]. However, in real life, all these properties cannot be captured using a single objective function. Also, for a given data set possessing clusters of different geometrical shapes (like hyper-spherical, convex etc.), the use of a single objective function mea- suring the cluster quality may not be suitable for determining all types of clusters. Application of any multi-objective optimization (MOO) technique [38, 37] optimizing different cluster valid- ity indices appears to be an alternative and promising direction in clustering research in recent years. This motivates researchers to develop some multi-objective-based clustering algorithms [57, 58, 139]. Also, determining the appropriate number of clusters from a given data set in an unsupervised way is another important consideration. Simultaneous optimization of multiple cluster validity indices can also address this issue. The current work has proposed a bio-inspired multi-objective clustering framework for automatically partitioning a given collection of scientific documents exploiting syntactic and semantic information to identify possible subtopics. To measure the quality of a partitioning, different internal cluster validity measures are used. The values of these different cluster validity indices are simultaneously optimized using the search capability of MODE. The proposed clustering approach is automatic in nature as it can determine the number of clusters present in a dataset automatically. Centre-based encoding is used in the

42 3.1 Introduction current approach where a set of cluster centres are coded in the form of a chromosome/solution. The number of cluster centres present in different chromosomes varies over a range. A new genetic operator utilizing the neighbourhood information extracted using SOM are incorporated in the proposed approach. Two data sets containing a few scientific articles with varying complexities and a data set containing a variety of web-documents are chosen for the purpose of evaluation of the proposed clustering technique. In order to represent the articles/documents in the form of vectors, different representation schemas like tf [41], tf-idf [41], word embeddings (word2vec and glove) [43, 44, 51] are exploited. Like any MOO-based approach, our proposed clustering approach also generates a set of solutions on the final Pareto optimal front. A single solution can be selected by the user, depending on the requirement. In the current study, a single best solution is selected using some internal cluster validity indices, namely the Dunn Index [59] and Davies-Bouldin index [60]. The obtained partitioning results are compared with those obtained by some existing state-of-the-art clustering techniques with respect to different performance measures.

3.1.2 Key-contributions

The key contributions of this chapter are summarized below :

1. A document clustering approach, namely SMODoc clust is proposed, which is the fusion of a self-organizing map and a multi-objective differential evolution.

2. The proposed approach using variable length chromosomes is capable of automatically detecting the number of clusters from any given data set.

3. In the proposed framework, two cluster validity indices, the PBM index [61] and the Silhouette index [58, 136] are simultaneously optimized for the automatic determination of the appropriate number of clusters and also to improve the quality of clusters.

4. A new type of genetic operator is proposed in the framework of MODE. The mating pool constructed for the crossover and mutation operation given a solution only contains the neighbouring solutions identified by SOM. For the training of SOM, the solutions of the current population are utilized. The constructed mating pool takes part in generating some new solutions.

5. The results of the proposed technique are shown for clustering two document data sets containing scientific articles with varying complexities and a document data set contain- ing some web-documents. The experimental results prove that the proposed clustering technique performs well for document clustering.

43 3. Points are 4. Calculate two objective 5. Convert variable 1. Randomly choose no. of 2. Randomly select assigned to different functions of various length strings to fix clusters K K cluster centers i i clusters using K- clusters (PBM and length strings to form Ki=(rand () mod (Kmax-1)) +2 from data points Automatic Documentmeans algorithm Clustering: Fusion of MODEpopulation and P SOM Silhouette index)

2. Initialize SOM Training data S 1. Population Where S <-- P Initialization (P) No. of neurons=|P| 1 2 P Neuron’s weight: (w , w …w ) <-- P

11. Update SOM training Yes g< g 12. Obtained the set of Data (S <-- P\A) max Pareto optimal Solutions No

3. SOM Training 10. Select best |P| solutions 13. Select the best solution

4. Extract neighborhood relationship 9. Apply Non dominated sorting and Exit crowding distance algorithm (Q) for each solution using SOM and set A<-- P

8. Merge old population P and new 5. Generate new solutions utilizing extracted neighborhood relationship for population P` each solution

7. Calculate objective functional values 6. Apply K-means clustering using new solution centers generated in Step-5 and of P` form new population P`

Figure 3.1: Flow chart of proposed algorithm for automatic multi-objective document clustering.

Here, P: population containing solutions, |P |: size of the population, wi: weight vector of ith neuron, gmax: maximum number of generations, A: archive (copy of population P), Q: Mating pool; S: training data for SOM.

3.2 Proposed Methodology

The flow-chart of the proposed multi-objective document clustering (SMODoc clust) technique is shown in Fig. 3.1. Several new concepts are incorporated in the framework of the proposed clustering technique. The basic operations of SMODoc clust are described below.

3.2.1 Solution Representation and Population Initialization:

In SMODoc clust, solutions encode a set of different cluster centers. As the proposed algorithm attempts to determine the optimal set of cluster centers that can partition the document dataset appropriately, the number of cluster centers encoded in different solutions are varied over a √ range. The number of clusters is varied between 2 and N, where N is the total number of

points (documents). To generate ith solution, a random number (Ki) is selected between two √ values, i.e., Kmin = 2 and Kmax = N and these Ki number of initial cluster centers are chosen randomly from the dataset. As these solutions take part in SOM training to learn the distribution pattern of the population, lengths of input vectors (solution) and weight vectors of neurons are kept equal. Therefore variable length solutions are converted to some fixed length vectors by appending zeros at the end. If F indicates the number of features in the dataset, then

44 3.2 Proposed Methodology

3. Points are 4. Calculate two objective 5. Convert variable 1. Randomly choose no. of 2. Randomly select assigned to different functions of various length strings to fix clusters K K cluster centers i i clusters using K- clusters (PBM and length strings to form Ki=(rand () mod (Kmax-1)) +2 from data points means algorithm population P Silhouette index)

Figure 3.2: Steps of population2. Initialize SOM initializationTraining data S 1. Population Where S <-- P Initialization (P) No. of neurons=|P| 1 2 P Neuron’s weight: (w , w …w ) <-- P

maximum length11. Update of SOM the training solution can be (K × F + l), where K is theYes number of clusters present g< g 12. Obtained the set of Data (S <-- P\A) max in a solution, l is the number of appended zeros lying between ‘00 and (K × FPareto− 2 optimal× F ).Solutions Here No we have subtracted 2 × F because there must exist at3. leastSOM Training two clusters in the dataset. In terms 10. Select best |P| solutions 13 . Select the best solution √ of data points, the maximum length of a solution can be N*F. 4. Extract neighborhood relationship 9. Apply Non dominated sorting and Exit crowding distance algorithm (Q) for each solution using SOM and set This set of solutions with the varying number of clustersA<-- P forms the initial population. In

order to obtain a partitioning corresponding to a solution in the population, steps of K-means

8. Merge old population P and new 5. Generate new solutions utilizing extracted neighborhood relationship for population P’ (discussed in Section 2.1.1 of Chapter 2) clustering techniqueeach solution [47] are executed on the whole data set considering the cluster centers encoded in the solution as initial cluster centers. Population 7. Calculate objective functional values 6. Apply K-means clustering using new solution centers generated in Step-5 and (P) initialization stepof P’ is shown in Figure 3.2 and an example of solution encoding is given below. form new population P’

Example: Let K=3, F=2, N=16. Let three centers be C1 = (2.3, 1.4),C2 = (7.6, 12.9) and C3 = √ (2.1, 3.4). Here maximum length of solution= N × F =(4 ∗ 2)=8. Then, solution will be repre- sented as {(2.3, 1.4, 7.6, 12.9, 2.1, 3.4, 0.0, 0.0)} which encodes three cluster centers, with l = 2.

3.2.2 SOM Training

To learn the distribution pattern of the population and to find the neighborhood relationship among these solutions, SOM is utilized in our approach. It is trained using the solutions in the population. As the lengths of different solutions are same in the population after padding zeros between “0” to (K × F − 2 × F ), therefore, during Euclidean distance calculation between input solution and neuron’s weight vector, only minimum number of features available in both the vectors are considered. Example: Let F = 2 and the maximum length of the solution be 8 for N=16. Consider a vector be {(m, n, q, p, 0, 0, 0, 0)} having K1 = 2 and second vector be {(w, x, y, z, a, b, 0, 0)} having K2 = 3. Then during distance calculation or weight updation, only {min(K1,K2) ∗ F } number of features are considered and other features are ignored.

3.2.3 Objective Functions Used

In order to measure the goodness of the partitioning encoded in a solution, two internal cluster validity indices, Pakhira-Bandyopadhyay-Maulik (PBM) index [61] and Silhouette index (SI) [58, 136] are calculated and those are used as the objective functions of the current solution.

45 Automatic Document Clustering: Fusion of MODE and SOM

Mathematical definition of these indices are described in Table 2.1. Note that these two objective functions measure separation and compactness between the partitionings in two different ways. The superiority of PBM index over other cluster validity indices, namely, Dunn index [59], Davies−Bouldin index [60] and Xie–Beni index [28] in determining the appropriateness of clusters is established in [61]. While in [140], Silhouette index is compared with 29 other cluster validity measures (excluding PBM index) namely Davies−Bouldin index [60], Gamma index, C index, Dunn index [59], Xie–Beni index [28] etc. and it was found that Silhouette index achieved highest success rate compared to others. Inspired by these existing literature, PBM Index and Silhouette index are incorporated in our proposed framework as the objective functions.

Algorithm 3 y=Generate(Q, CR, MP, xcurrent, Kxcurrent ) 1: Randomly select two solutions x1 and x2 as the parent solutions from the mating pool Q of current 1 2 solution xcurrent in such a way that x 6= x 6= xcurrent 2: Generate a trial solution y 0 as per the following equation

0 1 2 0 if rand() ≤ CR, then yi = xcurrenti + F1 × (xi − xi ), Otherwise yi = xcurrenti

i = 1 . . . , nz. Here nz = (Kxcurrent × F ) and only nz feature values of xcurrent are considered with corresponding nz values of x1 and x2 during computation while keeping remaining values from (nz+1) to n (length of solution) unchanged. 3: Repair the trial solution using the lower (xL) and upper (xU ) boundaries of population to generate 00 y 0 00 0 00 00 0 if yi < xLi , then yi = xLi , elseif yi > xUi , then yi = xUi , Otherwise , yi = yi where, i = 1, 2, . . . , nz. 4: Now, trial solution is mutated to generate y using the following equation. Here i = (1, 2, . . . nz) (i) if 0 ≤ MP < 0.6:

00 00 if rand() ≤ pm, then yi = yi + ∆i × (xUi − xLi ), otherwise yi = yi

where r1 = rand() is a random number generated between 0 and 1, and

 1 " 00 # ηm+1  x −y  Ui i ηm+1  2r1 + (1 − 2r1)( x −x ) − 1, if r1 < 0.5  Ui Li ∆i = 1 " 00 # ηm+1  y −x  i Li ηm+1 1 − 2 − 2r1 + (1 − 2r1)( x −x ) Otherwise  Ui Li

(ii) if 0.6 ≤ MP < 0.8: A random input pattern is picked from the dataset and that pattern is added to solution y starting from position (nz + 1). (iii) if 0.8 ≤ MP ≤ 1.0: Select last cluster center and then delete that from solution y. 5: return solution y

46 3.2 Proposed Methodology

Mating Pool (Q) xcurrent (#features=6) x21 x22 x23 x24 x25 x26 0 0

x11 x12 x13 x14 x15 x16 0 0 x31 x32 x33 x34 x35 x36 x37 x38 x41 x42 x43 x44 0 0 0 0

Randomly select two solutions

x41 x42 x43 x44 0 0 0 0 x21 x22 x23 x24 x25 x26 0 0

Only First 6 feature s, + Consider remaining x F 1 as zero

x`11 x`12 x`13 x`14 x`15 x`16 0 0 Trial Solution

Figure 3.3: Generationof trial solution

3.2.4 Extracting Closer Solutions using Neighborhood Relationship of SOM x``11 x``12 x``13 x``14 x``15 x``16 0 0 Repaired solution

This step is equivalent to mating pool constructionMP = rand() of MODE framework. The nearby solutions for the current solution are identified using neighborhood relationship (NR) of SOM which is Normal Mutation (0<=MP<0.6) Delete Mutation (0.8<=MP<=1) trained using the solutions in the population. This set of nearbyx``11 solutions x``12 x``1 form3 x``14 the0 mating0 0 0 x```11 x```12 x```13 x```14 x```15 x```16 0 0 pool, Q, for the current solution. Only theseInsert solutions Mutation (0 can.6<=MP take<0.8) part in mating to generate a x``11 x``12 x``13 x``14 x``15 x``16 x17 x18 new solution from the current solution. Series of steps to construct the mating pool, Q, for the xcurrent ∈ P are described in Algorithm-2 of Chapter-2.

3.2.5 Offspring Reproduction (New Solution Generation)

In the previous step, the mating pool was constructed which can take part in crossover and mutation operations to generate a new solution. The detailed algorithm for generation of the new solution is shown in Algorithm 5. First, the crossover operator of differential evolution (DE) [36, 141] is used to generate the trial solution (Line 2) and then a repair mechanism is adapted to ensure the feasibility of the generated solution (Line 3). The lower and upper boundaries of the solutions present in a population are utilized in converting a solution into a feasible one. Finally, mutation operation is applied to that solution (Line 4). Some modifications are incorporated 0 in MODE algorithm. Firstly during trial solution generation y , only {Kxcurrent ∗ F } feature values of the current solution are considered for distance computation while others are treated as zero, where Kxcurrent is the number of clusters for the current solution, F is the number of features in the data set. Trial solution generation process is shown in Figure 3.3. Secondly, instead of a single mutation operator, three types of mutation operations are used which are - normal mutation (here polynomial mutation [142] is used as normal mutation), insert mutation

47 Automatic Document Clustering: Fusion of MODE and SOM

Mating Pool (Q) x (#features=6) x x x x x x 0 0 and delete mutation.current Polynomial mutation21 operator22 23 24 is25 used26 in generating a highly disruptive x11 x12 x13 x14 x15 x16 0 0 x31 x32 x33 x34 x35 x36 x37 x38 x41 x42 x43 x44 0 0 0 0 mutated vector to explore the search space in any direction. This further assists in converging Randomly select two solutions towards an optimal set of cluster centers. x41 x42 x43 x44 0 0 0 0 x21 x22 x23 x24 x25 x26 0 0 Use of different types of mutation operators aids in locating the appropriate number of clusters and the appropriate partitioning efficiently. Any of these mutation operations can be Only First 6 features, F selected based on probability MP +which Consider is remaining generated x with a uniform1 distribution lying in a as zero range [0, 1] as similar tox` Ref.11 x`12 [28].x`13 x`1 If4 x`MP15 x`16 < 0 0.06 thenTrial Solution normal mutation is selected, else if

0.6 ≤ MP < 0.8 then insert-mutation is adopted, else deletion mutation is applied. Details about these mutation operations are discussed in Line-4 of Algo. 5 and examples of these

different types of mutation operations are shown in Figure-3.4.

x``11 x``12 x``13 x``14 x``15 x``16 0 0 Repaired solution

MP = rand()

Normal Mutation (0<=MP<0.6) Delete Mutation (0.8<=MP<=1) x``11 x``12 x``13 x``14 0 0 0 0 x```11 x```12 x```13 x```14 x```15 x```16 0 0 Insert Mutation (0.6<=MP<0.8)

x``11 x``12 x``13 x``14 x``15 x``16 x17 x18

Figure 3.4: Generation of new solution. Here rand() is a function which generates some random number between 0 to 1

It should be noted that in case of (a) normal mutation, the number of clusters for new solution y will remain same as Kxcurrent , i.e., Ky = Kxcurrent . (b) insertion mutation: number of clusters for new solution increases by 1, i.e., Ky = {Kxcurrent + 1}. (c) delete mutation: number of clusters for new solution decreases by 1, i.e., Ky = {Kxcurrent − 1}. After generating the new solution, the following additional steps are required to be applied to obtain the final solution.

1. The steps of K-means clustering algorithm are applied to the new solution generated using Algo. 5. The centers present in the new solution will be considered as the initial set of cluster centers before application of K-means algorithm.

2. Cluster centers obtained after execution of the K-means algorithm are encoded into the new solution. Next, PBM and SI index values are calculated as the objective functions.

The following symbols are used in the algorithms : (a) F1 and CR (crossover probability) which are control parameters of DE. The ranges for F1 and CR are [0, 2] and [0, 1], respectively. (b) pm is the normal mutation probability for each component of a solution; MP is the current solution

(xcurrent)’s mutation probability and it decides the type of mutation to be performed, ηm denotes the distribution index of polynomial mutation. Note that higher the distribution index, more

48 3.2 Proposed Methodology diverse is the generated solution.

Example: Let F=2, xcurrent = {x11, x12, x13, x14, x15, x16, 0, 0}, Kxcurrent = 3 and Q (Mat- ing Pool) consists of three solutions which are {x21, x22, x23, x24, x25, x26, 0, 0}, {x31, x32, x33, x34, x35, x36, x37, x38}, {x41, x42, x43, x44, 0, 0, 0, 0}. Then at the time of generating a trial solution 0 y (Step-2), only Kxcurrent × F = 3 × 2 = 6 features of all the solutions are considered as the current solution has only 6 features. The remaining features are treated as zero as shown in Figure-3.3. To make the solution feasible, trail solution undergoes repairing using the lower and upper boundaries of the population and then mutation is applied based on some random probability, MP, as shown in Figure 3.4.

3.2.6 Selection Operation

In Section 3.2.5, after generating offspring (new solution) for each solution in the population P , a new population P 0 is formed. This is further merged with the old population, P . As |P |=|P 0 |, size of the merge population will be 2 × |P |. In the next generation, only best |P | solutions (in terms of diversity and convergence [29]) of the merged population are retained, while the rest of the solutions are discarded. This operation is performed using non-dominated sorting and crowding distance algorithm of the Non-dominated sorting genetic algorithm (NSGA-II) [29].

1. Non-dominated sorting algorithm: It sorts the solutions based on the concepts of domi- nation and non-domination relationships in the objective functional space and ranks the

solutions. It divides the solutions into k-fronts, F = {F ront1, F ront2...F rontk} such that

F ront1 contains higher ranked solutions and F rontk contains lower ranked solutions. Each front contains a set of non-dominated solutions. For example, in Fig. 3.5, solutions are ranked as shown in the Pareto-optimal front (or surface). After this step, top ranked solu- tions are selected and those are added to the population to proceed for the next generation. This process is continued until the number of solutions added equals to |P |. If the number of solutions added exceeds |P |, then crowding distance algorithm is applied to select the required number of solutions.

th 2. Crowding distance algorithm: The crowding distance CDi of i solution in a F rontk is computed as follows:

(a) for i = 1, 2...... |F rontk|, initialize CDi = 0

(b) For each objective function fm, m = 1, 2...M, do the following:

i. Sort the set F rontk according to fm in ascending order.

49 Automatic Document Clustering: Fusion of MODE and SOM

ii. Set CD1=CD|F rontk|=∞ max min iii. for j = 2 to (|F rontk|−1), set CDj = CDj +(fm(j +1)−fm(j −1)/(fm −fm ) max min th Where fm and fm are the maximum and minimum m objective functional values, respectively, M is the total number of objective functions.

Example: Let |P |=3 and their two objective functional values are (1, 2), (4, 2.5), (3, 4.5) for solutions e,d and c, respectively. After generating 3 new solutions f, a and b, let their objective function values be (2, 1), (5, 5), (6, 4) respectively. Suppose both the objective functions have to be maximized. After merging, total number of solutions will become 6 and for next generation, 3 solutions have to be selected. First these solutions are ranked based on dominance and non- dominance concept. Thus, ranked solutions are {(5, 5), (6, 4)} for rank-1; {(3, 4.5) , (4, 2.5)} for rank-2 and {(1, 2),(2, 1)} for rank-3. As rank-1 includes two solutions, therefore they will be propagated to the next generation. Out of all rank-2 solutions, (3 − 2) = 1 solution needs to be included in the next generation. Therefore, to select (3 − 2) = 1 solution, crowding distance operator is applied to rank-2 solutions and thus (3 − 2) = 1 solution is selected having highest crowding distance.

Pareto front 2 (Rank 2 solutions) Rank 1: Solutions a and b are non-dominating to each other f1(maximized) because in terms of objective f1, solution a is better. While in Pareto-optimal front 1 terms of f2, solution b is better. (Rank 1 solutions) 5 a Rank 2: Solutions c and d are non-dominating but dominated c 4 b by at least one solution of Rank 1 solutions. For example, here, solution c dominated by solution a because in terms of 3 d Pareto front 3 f1 and f2, solution a is better than c. e (Rank 3 solutions) 2 f Rank 3: Solutions e and f are non-dominating but dominated 1 by at least one solution of Rank 1 and Rank 2 solutions. For example, here, solution e is dominated by solutions c and a. 1 2 3 4 5 6 f2(maximized)

Figure 3.5: Ranking of solutions.

Rank 1: Solutions a and b are non-dominating to each other because in terms of objective f1, solution a is better. While in terms of f2, solution b is better. Rank 2: Solutions c and d are non-dominating but dominated by at least 3.2.7 one solutionTermination of Rank 1 solutions. Condition For example, here, solution c dominated by solution a because in terms of f1 and f2, solution a is better than c. Rank 3: Solutions e and f are non-dominating but dominated by at least The processone solution of of Rank generating 1 and Rank 2 solutions new. solutions For example, here, and solution then selection of best |P | solutions for next gener- e is dominated by solutions c and a. ation will continue until a maximum number of generations, gmax is reached. The final Pareto optimal set contains a set of optimal solutions.

3.2.8 Selection of a Single Solution based on User Requirement

Any multi-objective algorithm produces a large number of equally important (called as non- dominated) solutions on the final Pareto optimal front. All these solutions represent different

50 3.3 Experimental Setup ways of clustering the given data set. But sometimes decision-maker wants to select only a single solution based on his requirement or to report the performance of the algorithm. Therefore, in this work, to select a single solution from the Pareto optimal front, we have used some internal cluster validity indices. Two experiments are conducted. In the first experiment, Dunn Index (DI) [59] is used to select the single solution from the final Pareto front. Definition of Dunn Index suggests that a higher value indicates better partitioning. Thus we have calculated the DI values for all the partitioning solutions present on the final Pareto front and the solution having the highest value of DI is reported here. In another experiment, Davies-bouldin index (DB) [60] is utilized for selecting a single solution. DB-index value should be minimized for getting the optimal partitioning. Thus we have reported that solution which corresponds to the minimum value of DB-index. Selection of the best solution is shown in step-13 of Figure 3.1. This step is different from step-10 which shows that after merging old P and the new population P 0, only those solutions are selected for the next generation which are non-dominated to each other and are well-distributed over different fronts.

3.3 Experimental Setup

This section presents the datasets, evaluation measures and comparative state-of-the-art tech- niques. In addition, this section also discusses about various preprocessing steps applied, different representation schemas used to convert a document into a vector form followed by parameter set- tings. The results reported in this section are the average values over 20 runs. All the approaches were implemented on a Intel Core i7 CPU 3.60 GHz with 4 GB of RAM on Ubuntu.

3.3.1 Datasets

In order to show the efficacy of the proposed clustering technique over the existing algorithms, we have used the two types of datasets: scientific articles and web documents. Detailed descriptions of the data sets used in the current study are given below:

NIPS 2015: This data set is taken from kaggle site1. This contains 403 articles published in Neural Information Processing Systems (NIPS) conference which is an important core ranked conference in the machine learning domain. It has topics ranging from deep learning, computer vision to cognitive science and reinforcement learning. This dataset includes paper id, title of the paper, event type (poster/oral/spotlight presentation), name of the pdf file, abstract, paper text; out of which only, title, abstract and paper text are used during our experimentation.

1https://www.kaggle.com/benhamner/exploring-the-nips-2015-papers/data

51 Automatic Document Clustering: Fusion of MODE and SOM

(a) (b)

(c)

Figure 3.6: Word Cloud of (a) NIPS 2015 ; (b) AAAI 2013 ; (c) WebKB datasets

Here, most of the articles are related to machine learning and natural language processing. The corresponding word cloud is shown in Figure 3.6(a).

AAAI 2013: This data set is taken from UCI repository [143] which contains 150 accepted articles from another core ranked conference of AI domain, namely AAAI 2013. Each of the papers is having the following information: title of the paper, topics (author-selected low-level keywords from conference-provided list), keywords (author-generated keywords), abstract and high-level keywords (author-selected high-level keywords from conference-provided list). Most of the articles are related to artificial intelligence like multi-agent system, reasoning and machine learning like data mining, knowledge discovery etc. The corresponding Word cloud is shown in Figure 3.6(b).

WebKB: In order to show the potentiality of our approach, we have also used the out-domain dataset such as WebKB, in which documents are web pages, not scientific articles. WebKB [144] data set is consisting of web pages collected from computer science departments of 4 different universities, which are Texas, Cornell, Wisconsin and Washington. In our thesis, we have used

52 3.3 Experimental Setup total of 2803 documents out of 4199 documents. The corresponding word cloud is shown in Figure 3.6(c).

3.3.2 Evaluation Measures

In order to measure the goodness of the obtained partitioning, two internal cluster validity indices, namely Dunn Index [59] and Davies-Bouldin (DB) Index [60] are used as dicussed in Section 3.2.8. Detailed descriptions of Dunn and DB index are given in Table-2.1.

3.3.3 Comparative Approaches

In order to illustrate the efficacy of the proposed clustering technique, SMODoc clust, results are compared with several existing clustering techniques having different complexity levels. The approaches we have selected for comparison are traditional clustering techniques like K-means [47], single-linkage [47], SOGA (Single Objective Genetic Algorithm) based clustering [145] and MOO-based clustering approach namely, MODoc clust without using SOM based reproduction operator, MOCK [1], AMOSA based multi-objective clustering technique, VAMOSA [28] and NSGA-II based multi-objective clustering technique [89]. K-means and single-linkage clustering algorithms are some simple and well-known clustering algorithms having limited computational complexity and they assume that the number of clusters present in a data set is known be- forehand. Note that our proposed clustering technique is automatic in nature. It determines the number of clusters automatically from a given data set. For K-means and single linkage clustering algorithms, the number of clusters is fixed to K where K is the value of the optimal number of clusters determined by the proposed approach, SMODoc clust. Detailed description about K-means and single-linkage is provided in Section 2.1.1 of Chapter 2. Rest of algorithms are described below:

MODoc clust: MODoc clust, multi-objective based evolutionary algorithm for document clus- tering, is developed similar to our proposed clustering approach without utilizing the SOM-based genetic operators. It is also able to detect the appropriate number of clusters automatically from a given data set and optimizes PBM [61] and Silhouette index [58], simultaneously. Normal DE based genetic operators are used during the clustering process. It is developed to show the effec- tiveness of our newly designed genetic operators utilizing SOM based neighborhood information.

MOCK: MOCK [1] is a multi-objective clustering algorithm with automatic K-determination and it optimizes two objective functions (compactness and connectedness) simultaneously, where K is the number of clusters. Note that here we have executed MOCK with those document

53 Automatic Document Clustering: Fusion of MODE and SOM representations for which our proposed approach attains good results.

VAMOSA: VAMOSA [28] is a multi-objective clustering technique which optimizes cluster quality by utilizing two cluster validity indices as the objective functions, namely, PBM Index and Xie-Beni index. It is also able to determine the number of clusters, K, in an automated √ manner. Here, K lies between [2, N], N is the number of data points. It uses AMOSA [38] as the underlying optimization technique, which was developed inspired by annealing behavior of metals. In original VAMOSA, a point symmetry based distance was utilized for assigning data samples to different clusters. As computation of point symmetry based distance is a time- consuming task, and also to make a fair comparison with other approaches used in the current study, we have used Euclidean distance in VAMOSA for the purpose of distance computation.

NSGA-II-Clust: NSGA-II-Clust [89, 146] is a multi-objective clustering technique similar to VAMOSA [28] which optimizes PBM-index and Silhouette-index, simultaneously, to determine clusters having good quality in an automated way. It is also capable of determining the number √ of clusters, K, without human participation. The value of K varies between [2, N], N is the number of data points. It uses NSGA-II [29] as the underlying optimization strategy. In [89], this algorithm was successfully applied to solve image segmentation problems.

SOGA: SOGA [145] is a single objective clustering technique utilizing the search capabilities of genetic algorithm (GA). GA is utilized in optimizing a single cluster validity index. In our experiments, SOGA based clustering was executed multiple times with the number of clusters √ varying between 2 to N, where N is the number of articles/documents. The final partitioning is selected based on the maximum value of Dunn index as well as the minimum value of Davies- Bouldin index.

3.3.4 Preprocessing

In order to clean the text data corresponding to these scientific articles and web-documents, we have executed several preprocessing steps including removal2 (e.g., is, am are etc.), removal of special characters (like @, ! etc.), punctuation symbols, numbers and white spaces, removal of words having length less than three, lower case conversion (like Computer− > computer) and stemming3 [148]. [148] is the process of converting inflected words into their morphological base forms called word stems, base or root forms. Reason for performing stemming is to group together the inflected forms of a word so that they can be analyzed as a

2We have used python nltk toolkit [147] to remove the stop words which are 153 in numbers. 3Here SnowballStemmer [147] of nltk is used.

54 3.3 Experimental Setup single item and can help in clustering of documents. In addition to these preprocessing steps, words which appear in less than 5% and in more than 95% articles are removed. Moreover, for NIPS dataset, we have considered title, abstract and paper texts as the attributes for the given papers. For that purpose, topmost 5, 30 and 150 words are selected from title, abstract and paper text, respectively, which make vocabulary size as 183. While in case of AAAI 2013 data set, all the attributes are used. This makes vocabulary size as 673. For WebKB datasets, preprocessed text documents are already available in [144] having total vocabulary of size 7229.

Parameters Values

Maximum number of generations (gmax) 50 Population size (P) 50 Initial learning rate (η0) 0.1 Initial neighborhood size (σ0) 2 Number of training iterations in SOM |P| Mating Pool size (H) 5 DE control parameters (F1 and CR) 0.8, 0.8 Normal mutation probability [0,0.6[ Insertion mutation probability [0.6 to 0.8[ Deletion mutation probability [0.8,1[

Table 3.1: Parameter setting for our proposed approach

3.3.5 Representation Schemas Used

To represent the scientific/web articles in vector forms, tf (bag-of-word model using 1-gram) [41], tf-idf [41] and most popular representation schema, word2vec [43, 51, 52] and Glove [44] both with varying dimensions of 50, 100, 200, 300 are used in the current study. More details about these representations are provided in Section 2.1.2 of Chapter 2. Note that in case of word2vec/Glove, article/document vector is obtained by averaging the vector representations of all the vocabulary words present in the article. For Glove representation, we have utilized the pre- trained model available at https://github.com/stanfordnlp/GloVe. While, for Word2vec, we have used gensim4 tool in python to generate word vectors of varying dimensions.

3.3.6 Parameter settings

MOCK [1] and SOGA [145] are executed with default parameters (codes provided by authors). Parameter settings of other algorithms are explained below.

1. SMODoc clust and MODoc clust: Different parameter values used in our proposed clus- tering technique are shown in Table 3.1. These parameters are selected after conducting a

4https://radimrehurek.com/gensim/models/word2vec.html

55 Automatic Document Clustering: Fusion of MODE and SOM

thorough sensitivity study. It is important to note that mutation (normal, deletion and in- sertion) probabilities used here are same as reported in the existing literature [38, 145, 149]. Same parameters are used in MODoc clust approach (excluding SOM parameters).

2. VAMOSA: This algorithm is executed with Tmax =10, Tmin=0.01, SL=20, HL =10. Here, Tmax and Tmin denote the maximum and minimum values of temperature, respectively. SL and HL are two parameters associated with the size of the archive. They denote the soft-limit and hard-limit on the archive size, respectively. Initially archive of AMOSA is initialized with SL number of solutions. During the process, number of solutions in the archive can be reached upto SL. Once number of solutions crosses the threshold SL, clustering procedure is applied to reduce this to HL. At the end of the execution, an archive having HL number of solutions are provided to the user. Rest of the parameter values are kept similar as reported in [28].

3. NSGA-II-clust: Different parameters used in the NSGA-II based multi-objective clustering are: number of generations=50, population size=50, crossover probability=0.8, mutation

strength=0.2, normal (µn), insertion (µi) and deletion (µd) mutation probabilities are

taken as: µn < 0.7, 0.7 < µi ≤ 0.85 and µd ≥ 0.85, respectively.

3.4 Analysis of results obtained

In order to measure the goodness of the obtained partitionings by MOO based proposed ap- proach, two internal cluster validity indices, namely Dunn Index [59] and Davies-Bouldin (DB) Index [60] are used. The number of clusters detected by the proposed algorithm for different datasets are reported in Table-3.2 and Table-3.3. Detailed descriptions of Dunn and DB index are given in Table-2.1. The most relevant words of different clusters (obtained using Dunn index) corresponding to optimal partitionings identified by the proposed approach for NIPS 2015 and AAAI 2013 data sets are shown in Figure 3.7(a) and Figure 3.7(b), respectively. These keywords are extracted using topic modeling tool named Latent Dirichlet allocation (LDA) [129].

3.4.1 Results on NIPS 2015 Articles

On NIPS 2015 data set, our proposed approach performs better than all other existing approaches with different representation schemas used. Results obtained are shown in Table-3.2 and Table 3.3. The best result having DI = 0.64 was obtained using word2vec model with obtained cluster (OC)=2, where each word vector is of 100 dimensions. On the other hand, the best value of DB index=0.1323 was obtained using word2vec representation with same number of clusters, i.e., 2,

56 3.4 Analysis of results obtained

Cluster 1: feedforward, stochastic, feature, exploring, exponentially, extracted, experimentally, expression, fed, accurate, feasible, extremely, model, falls, maximum

Cluster 2: deep, images, convolutional, training, Bayesian, network, bound, distribution, convolutional, algorithm, neural, optimization, matrix, graph

(a) NIPS 2015

Cluster 1: multi agent, network, image, approach, rank, constraint, classification, game, learning, clustering, heuristic, model, method, game, learning, dynamic, data

Cluster 2: constraint, hidden, markov, sentiment, algorithm, transportability, similarity, kernel, solver, agent, temporal, causal, data, selection, learning, random, environment, complexity, preference, application

Cluster 3: grammar, semantic, , problem, minimax, structural, consistency, path, cluster, distance, euclidean, k-nn, measure, search, synchronous, property, dissimilarity, sentence, logical, uncover, heuristic, time

(b) AAAI 2013

Figure 3.7: Relevant cluster-keywords for (a) NIPS 2015; (b) AAAI 2013 data set corresponding to the best partitioning result obtained by the proposed approach where each word vector is of 50 dimensions. Thus, it can be inferred that optimal value of number of clusters for NIPS datasets is 2. Extracted relevant words for different clusters corresponding to the best result obtained by our approach are shown in Figure-3.7(a). This clearly indicates that two clusters correspond to the topics of deep learning and computer vision, respectively. Major observations related to the obtained clusters at the fine-grained level are as follows: articles in cluster-2 correspond to Deep Convolution Neural network applied on image data. Articles in cluster-1 correspond to simple feed-forward network with stochastic optimization in which features are extracted by the user and those are fed to the network. Pareto optimal solutions obtained after application of our proposed framework are shown in Figure-3.8(a). Here, we can see that after completion of the maximum number of generations, Pareto Optimal front converges to only three to four non-dominated solutions. Each point in the Pareto optimal front of Figure-3.8(a) represents a non-dominated solution. Note that our proposed aproach, SMODoc clust, attains the best results with word2vec based representation with dimension 100. MOCK is also executed with this configuration. Best result by MOCK corresponds to DI = 0.0151 and DB = 0.6401 with OC = 4. In most of the cases, MODoc clust, VAMOSA, NSGA-II-Clust, SOGA, K-means and single-linkage algorithms fail in achieving good scores for this data set. K-means and single-linkage are the well-known classical clustering techniques. Our proposed algorithm is based on the MOO concept, and in the literature [150, 151, 152],

57 Automatic Document Clustering: Fusion of MODE and SOM

Table 3.2: Results obtained after application of the proposed clustering algorithm on text doc- uments in comparison to other clustering algorithms. Here, Rep. denotes representation, N: Number of scientific articles; F: Vocabulary size; OC: Obtained number of clusters; DI: Dunn Index; xx: all data points assigned to single cluster SMODoc clust MODoc clust VAMOSA NSGA-II-Clust SOGA K-means single-linkage Data set #N Rep. #F OC DI OC DI OC DI OC DI OC DI OC DI OC DI tf 183 4 0.2247 4 0.1082 5 0.1058 2 0.0714 5 0.0471 4 0.0811 4 0.0698 tf-idf 183 5 0.1844 4 0.1623 7 0.1081 2 0.0738 2 0.0832 5 0.1388 5 0.1494 50 4 0.0732 5 0.0397 2 0.0366 2 0.0121 4 0.0258 4 0.0268 4 0.0401 word2vec 100 2 0.6414 6 0.0282 2 0.6121 2 0.0111 2 0.0069 2 0.0059 2 0.0116 200 2 0.5657 8 0.0445 9 0.0292 2 0.0123 2 0.0039 2 0.0090 2 0.0106 NIPS 2015 403 300 2 0.5723 8 0.0445 11 0.0252 2 0.1676 3 0.0048 2 0.0058 2 0.0085 50 5 0.3096 5 0.2953 7 0.2674 2 0.2660 10 0.2900 5 0.2601 5 0.3124 glove 100 5 0.3884 4 0.3714 4 0.3533 2 0.3187 8 0.3833 5 0.3103 5 0.3593 200 4 0.4104 2 0.4099 3 0.4097 2 0.3829 8 0.4068 4 0.3753 4 0.3443 300 4 0.3778 4 0.3598 7 0.3669 2 0.3539 4 0.3111 4 0.3647 4 0.3509 tf 673 4 0.2948 4 0.2948 4 0.2948 2 0.1860 4 0.1328 4 0.1961 4 0.2635 tf-idf 673 3 0.5352 3 0.5286 2 0.5218 2 0.5218 3 0.1431 3 0.4204 3 0.3339 50 9 0.1805 11 0.1751 5 0.1665 2 0.1726 10 0.0521 9 0.0692 9 0.0738 word2vec 100 5 0.1238 4 0.0871 2 0.1290 2 0.0504 7 0.0612 5 0.1110 5 0.0940 200 5 0.1168 4 0.0827 3 0.0401 2 0.0333 2 0.0457 5 0.1094 5 0.1094 AAAI 2013 150 300 9 0.1513 11 0.1292 xx xx 2 0.0334 3 0.0401 9 0.0638 9 0.0763 50 2 0.3213 4 0.3213 5 0.2330 2 0.2513 2 0.3213 2 0.3213 2 0.3213 glove 100 3 0.4005 3 0.4005 5 0.2329 2 0.2753 3 0.0 3 0.2433 3 0.2470 200 3 0.3323 3 0.3640 2 0.2461 2 0.2848 2 0.3135 3 0.2588 3 0.2588 300 4 0.2346 3 0.2233 4 0.1338 2 0.1429 2 0.2080 4 0.1578 4 0.2319 tf 7229 2 3.6423 3 3.1248 3 0.6710 2 0.0069 4 0.0038 2 3.6423 2 3.6423 tf-idf 7229 3 0.9174 10 0.7450 3 0.5610 2 0.0059 4 0.0012 3 0.9174 3 0.9174 50 4 0.0452 4 0.0452 3 0.0424 2 0.0493 4 0.0308 4 0.0452 4 0.0480 word2vec 100 4 0.0474 4 0.0474 5 0.0469 2 0.0463 2 0.0424 4 0.0474 4 0.0426 200 5 0.0464 5 0.0449 2 0.0985 2 0.0454 3 0.0 5 0.0461 5 0.0460 WebKB 2803 300 2 0.0646 5 0.0421 6 0.0461 3 0.0419 3 0.0 2 0.0445 2 0.0607 50 4 0.5871 2 0.5637 3 0.0597 2 0.0601 2 0.5129 4 0.0430 4 0.0643 glove 100 4 0.6909 4 0.6189 6 0.0400 2 0.0462 2 0.5780 4 0.0468 4 0.0541 200 3 0.6107 3 0.6391 3 0.1613 2 0.0530 2 0.0727 3 0.0640 3 0.0698 300 4 0.6325 4 0.6325 6 0.0461 2 0.0621 2 0.0 4 0.0672 4 0.0764 it is already proved that MOO is more effective than SOO; therefore, SOGA does not perform well. Rest of the approaches namely, MODoc clust, VAMOSA, and NSGA-II-clust, are based on multi-objective optimization concept and utilize different optimization strategies, MODE [153], AMOSA [38] and NSGA-II [29]. But, none of the approaches explores the power of self- organizing map as a tool for mating pool construction in the framework as done in our proposed approach, SMODoc clust. Moreover, it is interesting to know that MODE has been shown at the front rank5 amongst various competitions organized under the IEEE Congress on Evolutionary Computation (CEC) conference series. Our proposed algorithm is also based on the MODE concept. Note that for NIPS 2015 articles, SOGA based clustering does not converge after fifth generation while using tf and tf-idf based representation schemes. Therefor, for SOGA, the results obtained after the fifth generation are reported in Table 3.2 and Table 3.3.

3.4.2 Results on AAAI 2013 Articles

On AAAI 2013 data set, our proposed approach mostly performs better than all other existing approaches utilizing different representation schemes. The best result was obtained using tf-idf representation and the corresponding value of Dunn-index is 0.53 with OC=3. Only with “tf” based representation schema, MODoc clust works similar to the proposed algorithm. MOCK is also executed with tf-idf based representation. The best solution obtained by MOCK corre-

5http://www.ntu.edu.sg/home/epnsugan/index files/cec-benchmarking.htm

58 3.4 Analysis of results obtained

Table 3.3: Results obtained after application of the proposed clustering algorithm on text doc- uments in comparison to other clustering algorithms. Here, Rep. denotes representation, N: Number of scientific articles; F: Vocabulary size; OC: Obtained number of clusters; DB: Davies- Bouldin Index; xx: all data points assigned to single cluster SMODoc clust MODoc clust VAMOSA NSGA-II-Clust SOGA K-means single-linkage Data set #N Rep. #F OC DB OC DB OC DB OC DB OC DB OC DB OC DB tf 183 3 0.8171 3 0.8192 8 1.3949 2 1.8226 3 3.8074 3 1.3051 3 1.5270 tf-idf 183 4 0.8909 4 1.9023 7 1.5161 2 1.6235 2 2.8180 4 1.3454 4 1.4449 50 2 0.1323 3 0.1346 4 0.3336 2 1.6002 4 0.5123 2 0.6897 2 0.6898 word2vec 100 4 0.4830 4 0.4833 5 0.4965 2 1.9047 5 0.4406 4 0.6415 4 0.6400 NIPS 2015 403 200 3 0.4420 3 0.4433 6 0.4937 2 2.0387 2 0.7869 3 0.6073 3 0.5974 300 3 0.4424 3 0.4448 7 0.4625 2 1.8985 3 0.6533 3 0.5950 3 0.5914 50 3 1.7339 4 1.8308 11 2.1762 2 1.4428 3 2.4221 3 2.3080 3 2.6423 glove 100 4 1.5774 3 1.6388 2 2.1357 2 1.6063 3 2.4676 4 2.7221 4 2.5282 200 4 1.6561 4 1.6561 3 2.7614 2 2.0814 3 2.1848 4 2.9711 4 2.6400 300 4 1.8533 3 1.8692 2 2.5201 2 1.9119 4 5.6560 4 2.9511 4 2.8510 tf 673 4 1.4330 3 1.4385 4 1.1605 2 1.8727 4 1.8695 4 1.8786 4 1.9064 tf-idf 673 4 1.7145 3 1.7788 2 1.8407 2 1.8929 4 1.8486 4 2.0155 4 1.8986 50 3 0.7356 3 0.9981 5 0.6382 2 1.7318 5 1.0032 3 1.0308 3 1.0242 word2vec 100 3 0.7170 2 0.8773 2 0.8161 1 1.9175 5 1.0271 3 1.0259 3 1.0353 200 3 0.7276 3 0.7452 3 1.0674 2 1.7372 2 1.2772 3 1.0142 3 1.0294 AAAI 2013 150 300 3 0.6879 3 0.7054 xx xx 2 1.7372 3 0.9644 3 0.9885 3 1.0076 50 3 1.2799 4 1.3200 5 1.7573 2 1.3644 3 1.4252 3 1.3475 3 1.4138 glove 100 4 1.1374 3 1.1822 5 1.5257 2 1.3644 3 1.2513 4 1.7296 4 1.6525 200 4 1.1970 4 1.1970 2 1.6171 2 2.0304 3 2.2181 4 1.5871 4 1.6124 300 4 1.2884 4 1.4062 4 1.7796 2 1.7294 3 1.6864 4 1.6865 4 1.6291 tf 7229 3 0.0206 3 0.0206 3 0.0678 2 6.9621 3 2.6846 3 0.0646 3 0.0646 tf-idf 7229 3 0.0834 4 0.0497 3 0.0623 2 23.757 4 2.0806 3 2.5467 3 0.0522 50 5 1.1400 5 1.1502 3 1.5417 2 2.4978 3 1.8074 5 1.3936 5 1.5454 word2vec 100 5 1.1457 5 1.1448 4 1.7018 4 2.5136 2 1.6088 5 1.3867 5 1.1367 200 5 1.1352 3 1.1913 2 0.6134 2 2.5136 3 2.5172 5 1.3574 5 1.5183 WebKB 2803 300 5 1.2220 5 1.2203 6 3.4282 3 2.7561 2 2.5237 5 1.3442 5 1.4609 50 3 0.5523 2 0.8155 3 2.6150 2 2.2373 3 1.9142 3 1.9468 3 2.2323 glove 100 3 1.4299 2 0.8687 6 3.3422 2 1.9867 2 1.1582 2 2.9522 2 0.2911 200 2 0.1932 3 1.3411 6 1.2107 2 2.6978 2 1.4694 2 0.3008 2 0.3008 300 3 1.6632 3 2.9034 6 3.4282 2 2.2490 2 2.0660 3 1.8072 3 2.1201

sponds to DI = 0.2684 and DB = 12.1723 with OC = 3. On the other hand, minimum DB value obtained by our proposed approach is 0.6879 with word2vec based representation scheme having 300 dimensions and the corresponding number of clusters is 3. Thus, we can say that op- timal value of number of clusters for AAAI dataset is 3. Similar to the NIPS 2015 data set, here also SOGA based clustering does not converge within fifth to eighth generations. Figure-3.7(b) clearly indicates the topics of different clusters. All clusters are related to machine learning. But at the lower level of abstraction, we can conclude that cluster-1 contains articles related to arti- ficial intelligence as the words like multi-agent, game, heuristics method, etc. are pre-dominant in this cluster. Cluster-2 corresponds to the papers discussing about different applications of machine learning approaches, for example Hidden Markov Model to Sentiment Analysis and other domains. Cluster- 3 precisely corresponds to the papers reporting applications of machine learning approaches like K-nearest neighbor classifiers etc. for solving different natural language processing tasks. These articles discuss about grammar, syntax and semantics, parsing etc. The Pareto optimal solutions obtained by the proposed clustering approach are shown in Figure 3.8(b). Each point in the Pareto optimal front of Figure 3.8(b) represents a non-dominated solution. Again, MODoc clust, VAMOSA, NSGA-II-clust, SOGA, K-means and single-linkage algorithms fail to achieve good scores for this data set in most of the cases (reasons of failure stated in Section 3.4.1).

59 Automatic Document Clustering: Fusion of MODE and SOM

(a)

(b) (c)

Figure 3.8: Pareto optimal fronts obtained after application of the proposed clustering algorithm on scientific articles (a) NIPS 2015 ; (b) AAAI 2013 ; (c) WebKB datasets

3.4.3 Results on WebKB dataset

On WebKB data set, our proposed approach, in most of the cases, performs better than all other existing approaches utilizing different representation schemes. Out of different dimensions used in Word2vec based representation, maximum DI value of 0.0474 and minimum DB value of 1.1351 were obtained by our proposed approach using 200 dimensions with OC=5. On the other hand, using Glove representation varying the dimensions, maximum DI value of 0.6909 was obtained with OC=4 and 100 dimensions. Minimum DB value of 0.1932 was obtained using glove representation with 200 dimensions and OC=2. In Table-3.2, maximum DI value of 3.6423 was obtained with tf representation. After thorough investigation of this result, we found that this solution corresponds to a partitioning where more than 80% of the total documents are assigned to a single cluster which in turn increases the compactness and separation of the clusters. This results into high value of Dunn index. This partitioning was generated because of the sparsity

60 3.4 Analysis of results obtained

Table 3.4: Values of different components of the Dunn Index for tf, tf-idf and Glove representation with 100 dimension on WebKB dataset. Here, Rep. denotes representation, OC: obtained cluster, DI: Dunn Index, a: minimum distance between two points belonging to different clusters, b: maximum diameter amongst different clusters. Rep. OC DI=a/b a b tf 2 3.6423 1010.2593 277.3699 tf-idf 3 0.9174 806.7541 879.386 glove (100) 4 0.6909 4.6481 6.727 in document matrix (containing most of the components as zero in document vector) which is of size 2803 × 7229. Similar situation is happened with tf-idf based representation. The best value of Dunn index obtained is 0.6909 which corresponds to OC=4 with Glove representation having 100 dimensions, whereas the best value of obtained DB index is 0.1932 with OC=2. MOCK attains best DB index value of 7.2509 which is greater than minimum DB value obtained by our approach. In Table-3.4, values of numerator and denominator of Dunn index corresponding to tf, tf-idf, glove with 100 dimensions representations for this dataset are shown. Numerator measures the minimum distance between two points belonging to different clusters, while, denominator measures the maximum diameter amongst diameters of different clusters. It is clearly evident from Table 3.4 that for tf and tf-idf representations, both numerator and denominator values are too high as compared to Glove (100) representation. This is because generated clusters are not proper/compact; there is a big cluster (containing 80% of data points) and 1 or 2 small clusters. Because of the presence of large-cluster, denominator value is high and cluster separation (numerator) is also high. Thus Dunn-index value is also high. This in turn proves that DI is not a good measure of cluster quality. It prefers to have non-uniform sized clusters. Except the cases of Glove and Word2vec based representations with 100 dimensions, the proposed algorithm always beats other algorithms and attains best result. Generally, with the increase in dimension/size of Word2vec/glove vector representation, pre- cision of capturing semantic information increases. With the increase in size parameters, more data is required to train the models and to represent the concepts. However, in our work, due to the use of word2vec/glove averaging to represent the articles/documents, there is a loss of semantic information. Therefore, in Table-3.3 it can be seen that with the increase in the vector length using word2vec/glove, instead of decrease in the DB index values, there are fluctuations in the result. Some more robust representation is required to avoid loss of semantic information as this representation of document plays a key role in defining similarity/dissimilarity metric between documents which in turn can help in clustering documents in an automated way. Therefore, we have tried Doc2vec6 representation. Note that we have trained the Doc2vec

6https://github.com/jhlau/doc2vec

61 Automatic Document Clustering: Fusion of MODE and SOM

Table 3.5: Results reporting DB index value obtained after application of the proposed clustering algorithm on WebKB documents using Doc2vec representation in comparison to other clustering algorithms. Here, Rep. denotes representation, N: Number of scientific articles; F: Vocabulary size; OC: Obtained number of clusters; DB: Davies-Bouldin Index SMODoc clust MODoc clust VAMOSA NSGA-II-Clust SOGA K-means single-linkage Data set #N Rep. #F OC DB OC DB OC DB OC DB OC DB OC DB OC DB 50 3 2.3204 3 3.0317 3 3.6981 2 3.9696 4 3.3678 3 3.6687 3 4.2620 100 2 0.9723 2 0.9729 4 5.0457 2 3.8375 2 3.6676 2 3.7273 3 3.9529 WebKB 2803 Doc2vec 200 2 0.9549 2 1.0054 2 2.6654 2 3.1647 4 3.9685 2 4.0644 2 3.8797 300 2 0.3217 3 0.8023 5 4.8537 2 2.9372 2 3.2979 2 4.3355 2 3.9873

on available WebKB documents, i.e., 4199 preprocessed documents, which make use of pre- trained glove [44] word embeddings having 2.2M as vocabulary size and 300 dimensional word vector. Results are also reported in Table 3.5. It can be inferred from the results obtained by SMODoc clust, MODoc clust and NSGA-II-clust techniques (shown in Table-3.5) that with the increase in the dimensionality of vector representation, qualities of clusters improve in terms of DB index value (lesser the value, more good is the cluster quality). However, in VAMOSA, this is not the case. From these statements, it can be inferred that the quality of clusters not only depends upon the algorithm but also on the type of objective functions (cluster validity indices in our case). In SMODoc clust, MODoc clust and NSGA-II-clust, two objective functions namely, PBM and Silhouette indices are used. While, in VAMOSA, PBM and Xie-beni indices are used. Note that for doc2vec representation, we have not reported Dunn index as it is biased towards non-uniform sized clusters as mentioned in the end of first paragraph of current section.

Non-dominated solutions present on the final Pareto optimal set obtained by the proposed clustering approach are shown in Figure 3.8(c). Note that for all the datasets, the number of fitness function evaluations equals to 5100.

3.4.4 Results using XLNET Language Model

We have tried the recently developed XLNET language model [154] to get the document embed- dings. The DB index value and the number of clusters obtained for the three datasets, namely, NIPS 2015, AAAI 2013 and WebKB are (a) 1.13, 3; (b) 0.9065, 4; (c) 0.3724, 11. It is clear that these results are less than the best results of the corresponding datasets reported in Table 3.3. It is important to note that XLNET has become a new state-of-the-art language model beating BERT [155]. But, here, it is not performing good because XLNET is a contextualized model, but here, we have considered the average of word embeddings obtained using XLNET pre-trained model7.

7https://github.com/zihangdai/xlnet

62 3.4 Analysis of results obtained

3.4.5 Statistical Significance

To further check the statistical significance of our approach, we have conducted some statistical hypotheses tests named as Welch’s t-test, guided by [156] at 5 %(0.05) significance level. It checks whether the improvements obtained by the proposed SMODoc clust are statistically significant or happened by chance. Statistical t-test provides some p-value. Minimum p-value implies that the proposed multi-objective clustering approach is better than others. In our experiment, p- values are calculated considering two groups. Among these two groups, one group corresponds to the list of Dunn index values produced by our algorithm and another corresponds to the list of Dunn index values produced by some other algorithm. In this t-test, two hypotheses are considered: the null hypothesis and the alternative hypothesis. The first hypothesis is that there is no significant difference between median values of two groups. On the other hand, alternative hypothesis shows that there is significant change between median values of two groups. The obtained p-values are shown in Table-3.6 which evidently support the results of Table-3.2.

Table 3.6: p-values obtained after conducting t-test comparing the performance of proposed SMODoc clust algorithm with other existing clustering techniques with respect to Dunn index values reported in Table 3.2. Here, xx: values are absent in Table-3.2. Data Set Representation #F MODoc clust VAMOSA NSGA-II-Clust SOGA K-means single-linkage tf 183 3.01E-192 6.59E-190 7.89E-261 1.96E-307 2.28E-241 5.41E-264 tf-idf 183 4.13E-011 7.44E-099 3.77E-172 1.09E-104 4.47E-041 1.77E-25 NIPS 2015 50 1.58E-023 1.24E-027 1.73E-68 5.21E-44 2.26E-042 2.99E-019 100 0.0 0.0 0.0 0.0 0.0 0.0 word2vec 200 2.80E-021 0.0 0.0 0.0 0.0 .00 300 0.0 0.0 0.0 0.0 0.0 0.0 50 2.62E-005 9.51E-036 6.59E-038 5.33E-009 1.59E-047 0.2513 100 4.70E-007 1.31E-025 2.31E-085 0.182621 1.35E-102 3.25E-018 glove 200 0.911417 0.961362 1.93E-016 0.38863 1.31E-025 3.47E-078 300 8.99E-008 0.001650 9.009E-13 2.26E-079 0.000127372 8.49E-016 tf 673 0.7885 0.788494 2.79E-168 2.82E-283 1.65E-146 1.13E-18 tf-idf 673 0.0714026 8.69E-005 8.69E-05 0.0 3.72E-181 0.0 AAAI 2013 50 0.049742 3.49E-06 0.006069 1.46E-213 1.64E-196 1.95E-167 100 3.69E-30 3.49E-06 0.00606986 2.17E-194 1.97E-05 4.79E-21 word2vec 200 1.49E-26 3.06E-103 1.43E-117 1.14E-91 0.009659 0.009659 300 1.10E-012 xx 2.05E-191 4.19E-177 4.19E-126 1.05E-99 50 0.788494 0 1.99E-089 0.788494 0.788494 0.788494 100 0.788494 6.93E-292 7.43E-207 0 7.30E-272 1.33E-264 glove 200 2.80E-021 2.52E-123 4.96E-047 9.69E-010 1.35E-096 1.35E-096 300 0.000143 1.01E-154 4.10E-135 2.51E-17 1.89E-103 0.264497 tf 7229 0 0 0 0.788494 0.788494 0.788494 tf-idf 7229 0 0 0 0 0 0 WebKB 50 0.788494 0.2513 0.308194 1.91E-006 0.788494 0.541214 100 0.788494 0.670639 0.539444 0.0662238 0.78849 0.076022 word2vec 200 0.45977 5.26E-052 0.560392 3.48E-045 0.717001 0.693676 300 4.55E-013 1.71E-009 2.91E-13 1.34E-078 7.48E-011 0.135651 50 5.94E-014 0.0 0.0 4.81E-098 0.0 0.0 100 1.64E-093 0.0 0.0 9.56E-181 0.0 0.0 glove 200 1.99E-017 0.0 0.0 0.0 0.0 0.0 300 0.788494 0.0 0.0 0.0 0.0 0.0

3.4.6 Complexity of proposed framework

Let N be the number of F-dimensional feature vectors, g be the maximum number of generations. 1) The population is initialized using K-means algorithm. The K-means algorithm takes O(tNF k)

63 Automatic Document Clustering: Fusion of MODE and SOM

Table 3.7: Comparative complexity analysis of existing clustering algorithms. Here, R is the number of reference√ distributions [1]; K is the maximum number of clusters present in a data set which is N; N is the number of data points; T otalIter is the number of iterations used and chosen in such a way that number of fitness evaluations of all the algorithms become equal. Algorithm Time complexity SMODoc clust O(gP (tNF K + MP )) MODoc clust O(gP (tNF K + MP )) MOCK O(N 2 log(N)F 3k2P 2MR) VAMOSA O(KN log(N)T otalIter) NSGA-II-clust O(gP (tNF K + MP )) SOGA O(gtP NKF ) K-means O(tNKF ) single-linkage O(N 2log(N)) time [41]. Here, t is the number of iterations, K is the number of clusters. If there are P solu- tions, then for each solution we have to calculate M objective functions, thus total complexity to initialize population (including objective function calculation) will be O(P (tNF k + M)). 2) Training complexity of SOM is O(P 2) as mentioned in [157]. 3) Extraction of neighborhood relationship for each solution takes O(P 2) time because of the calculation of the Euclidean distance of each neuron with respect to other neurons using associ- ated weight vectors, which is a P × P matrix. 4) Crossover and mutation operations of differential evolution algorithm take constant time; these involve some addition, subtraction or multiplication operations. This implies, new solu- tion generation using crossover and mutation takes O(P ) time as new solution is required to be generated for each solution in the population. 5) K-means clustering steps are applied on each new solution and the objective functional values are calculated. This takes O(P (tNF k + M)) time. 6) Non-dominated sorting takes O(MP 2) time as for each objective, comparison is required to be performed for each solution with respect to other solutions. Thus total run time complexity = O(P (tNF K +M)+g(P 2 +P 2 +P +P (tNF K +M)+MP 2)) Here, step-2 to step-3 will be repeated upto g number of generations.

=⇒ O(P (tNF K + M) + g(2P 2 + P + P (tNF K + M) + MP 2)) =⇒ O(P (tNF K + M) + g(2P 2 + P tNF K + MP 2)) =⇒ O(P (tNF K + M) + g(MP 2 + P tNF K)) =⇒ O((1 + g)P tNF K + PM(1 + gP )) =⇒ O(gP tNF K + gMP 2) =⇒ O(gP (tNF K + MP ))

64 3.5 Chapter Summary

Thus, total complexity of our proposed system is O(gP (tNF K + MP )) Similarly, complexity of NSGA-II-clust can also be analyzed. The total run-time complexity of NSGA-II-clust is O(P (tNF K + M) + g(P (tNF K + M) + MP 2)). Here, the first term is for population initialization and calculation of objective functional values; and the second term, P (tNF K + M) + MP 2 is for application of K-means clustering on new solution generated and then applications of non-dominated sorting and crowding distance mechanisms [29]. On solving, this boils down to O(gP (tNF K + MP )).

Comparison of complexity analysis with other algorithms: We have compared the time complexities of existing clustering algorithms and those are reported in Table 3.7. It is important to note that reported complexities of the existing algorithms are directly taken from the reference papers. It can be seen from Table 3.7 that the time complexities of our proposed multi-objective automatic document clustering algorithm with SOM (SMODoc clust) and without SOM (MODoc clust) based operators are almost same. MOCK algorithm is more expensive than ours. NSGA-II-clust runs with same complexity as of our proposed system. On comparing SOGA and K-means, it was found that SOGA takes little higher time as it is based on the search capability of genetic algorithm.

3.5 Chapter Summary

In this chapter, we have proposed a new automatic multi-objective document clustering ap- proach utilizing the search capability of differential evolution. The current algorithm is a fusion of DE and SOM where the neighbourhood information identified by SOM trained on the current population of solutions is utilized for generating the mating pool which can further take part in generating new solutions. To generate more diverse solutions, the concept of polynomial muta- tion is incorporated in the DE framework, which helps in convergence towards the optimal global solution. Two objective functions, both measuring the compactness and separation of clusters, are considered here and are optimized simultaneously to improve the cluster quality. The efficacy of the proposed multi-objective document clustering technique is shown in automatically parti- tioning two text document data sets containing some scientific articles and one web-document data set. Results are compared with various state-of-the-art techniques including single as well as multi-objective clustering algorithms. It was found that the proposed approach can reach the optimal global solution for all the data sets, whereas other algorithms got stuck at local optima. The results clearly show that the proposed framework is well-suited for partitioning the data sets in an automated manner.

65 Automatic Document Clustering: Fusion of MODE and SOM

Because document clustering can play a significant role in text summarization, the next chapter discusses a single document summarization technique. Each sentence is considered as a document and the document clustering technique is applied as a pre-processing step of summarization.

66 CHAPTER 4

Multi-objective Clustering based Framework for Extractive Single Document Summarization

In this second contributory chapter, we report how we developed an extractive single document text summarization (ESDS) system (ESDS SMODE) using the integration of a multi-objective differential evolution (MODE) and a self-organizing map (SOM). The sentences present in the document are first clustered utilizing the concept of multi-objective clustering. Then, represen- tative sentences are selected from different clusters using several sentence scoring features to generate the summary. The proposed approach can automatically detect the number of sentence clusters present in a document.

67 Multi-objective Clustering based Framework for Extractive Single Document Summarization

4.1 Introduction

4.1.1 Overview

The rapid growth in the volume of text information available on the World Wide Web motivates users to develop a automatic text document summarization system. In this direction, sentence based extractive summarization techniques [55, 107, 158] seem to be useful because they are popularly used in producing a summary where informative sentences are selected from the doc- ument using sentence scoring features. These features include position of the sentence in the document [105], the length of the sentence [105], sentence similarity with respect to the title of document [105], and so on. In this chapter, an extractive single document summarization (ESDS) technique is developed to summarize a single document, utilizing the concept of sentence clustering. To begin with, sentences of the document are clustered into K clusters, 1 ≤ K ≤ N, where K is the number of clusters and N is the number of sentences present in the document. Thereafter, sentences present in the cluster are ranked according to different aspects like similarity of sentences with the title [105], length of the sentences [105], and other features. Based on this ranking, some sentences are selected from each cluster to generate the summary. It has been shown in the literature that the results obtained by multi-objective optimization (MOO) based clustering techniques are better than those by single objective optimization (SOO) based versions [152, 159]. Motivated by this, the problem of sentence clustering is also framed as a MOO-based clustering problem where sentence clusters are identified in an automatic way. To detect sentence clusters having good quality, multiple cluster quality measures are optimized simultaneously utilizing the multi-objective clustering framework developed for partitioning the documents (here, the sentence will be treated as a document) in Chapter 3. Note that for optimization, multi-objective differential evolution (MODE) is utilized and self-organized map (SOM) based genetic operator is incorporated in the optimization process as done in the previous chapter. Moreover, with the advent of deep learning, it is possible to measure the semantic similarity of sentences. Our approach builds on this concept. Here, sentence to sentence similarity is computed using word-mover-distance [53] (for more on this, refer to Chapter 2, Section 2.1.3) which utilizes the deep-learning-based tool, namely, word2vec (as discussed in Section 2.1.2 of Chapter 2) to capture semantic similarity. The proposed approach is tested on two benchmark datasets: DUC2001 and DUC2002 in the domain of news articles. The results obtained are in terms of ROUGE scores (refer to Section

68 4.2 Problem definition

2.3 of Chapter-2) and are compared with various state-of-the-art techniques, which are MA- SingleDocSum [105], Unified Rank [160], DE [55], NetSum [109], CRF [96], QCS [101], Manifold Ranking [98] and SVM [95]. Results clearly show the superiority of the proposed approach.

4.1.2 Contributions

The key-contributions of the developed ESDS SMODE framework are listed below:

1. A multi-objective clustering technique is developed to cluster the sentences present in a document. Finally, several sentence scoring features are utilized to select some informative sentences from each cluster.

2. A semantic-based scheme is used to represent a sentence in the form of a vector. With the advent of the word2vec [43] model, semantic representations of the words are possible. We utilized this tool to further represent a sentence in the form of a vector.

3. To accurately calculate the similarity/dissimilarity between two sentences, Word Mover Distance (WMD) [53] is used, which also utilizes the word2vec [43] model to capture the semantic similarity between two sentences.

4. To detect clusters having different shapes/sizes, two well-known cluster validity indices are deployed as the optimization criteria.

5. The proposed technique is automatic in nature. It is capable of automatic determination of the number of sentence clusters from a given document.

6. Experiments are conducted on gold standard DUC2001 and DUC2002 datasets for two evaluation metrics (ROUGE-1 and ROUGE-2). The results are compared with various state-of-the-art techniques. Results illustrate the potentiality of our proposed approach over the existing techniques.

4.2 Problem definition

In the proposed approach, we have formulated the ESDS problem as a sentence clustering prob- lem using multi-objective optimization in which qualities of sentence clusters are measured using two validity indices, PBM [61] and Xie-Beni index [28]. Thus the problem of ESDS is as follows:

Find a set of optimal sentence-clusters, {S1,S2,...,SK } in an automatic way which satisfies the following:

i i i i 1. Si = {s1, s2,..., snpi }, npi: number of sentences in cluster i, sj: jth sentence of cluster i, N is the number of sentences in the document.

69 Multi-objective Clustering based Framework for Extractive Single Document Summarization

K 2. ∪i=1 | Si |= N and Si ∩ Sj = ∅ for all i 6= j.

3. Several cluster validity indices, V al1, V al2, . . . , V alM computed on this partitioning have attained their optimum values.

In general cluster validity indices measure some intrinsic cluster properties like compactness and separation in different ways. After generation of good quality sentence clusters, some sen- tences from each cluster are extracted using various sentence scoring features to generate the summary.

4.3 Proposed Method

This section discusses in detail the proposed Self-organized Multi-objective Differential Evolution based ESDS Approach. From now-onwards, we will refer it as ESDS SMODE. Flow chart of the ESDS SMODE is shown in Fig. 4.1.

Single 2. Population No 3. SOM 4. Apply genetic g< gmax 5. Merge Document 1. Preprocessing initialization (P) Training operators to form P and P` new population P` Yes

9. Generate summary and END 8.Obtain set of Pareto solutions 7. Update SOM training data 6. Select |P| solutions choose best one solution

Figure 4.1: Flow chart of the proposed architecture, ESDS SMODE, where, gmax is the user- defined maximum number of generations, g is the current generation number.

4.3.1 Representation of Solution and Population Initialization

In this step, the population P is initialized consisting of solutions < ~x1, ~x2 . . . ~x|P | >. Here, we assume that a solution can be encoded using the K representative sentences (cluster centers) of the document. Here, K is the number of sentence clusters selected randomly and each solution may have variable number of cluster centers. It is important to note that in the previous chapter, we have performed document clustering. But, here sentence clustering is performed as we are working on a single document. As solutions in the population further take part in SOM training, therefore variable length solutions1 are converted into fixed length by appending zeros. For example, if ith solution, ~xi, has Ki number of cluster centers then (N − Ki) zeros are appended, where N is number of sentences in the document as well as maximum length of a solution.

1If a solution has three clusters and maximum number of sentences in the document are 8, then solution is represented as < c1, c2, c3, 0, 0, 0, 0, 0 >, where ck is the cluster center (representative sentence) of kth cluster Sk).

70 4.3 Proposed Method

4.3.2 Assignment of Sentences to Sentence Clusters

In order to extract sentence-clustering corresponding to ith solution, K-medoid [47] algorithm (discussed in Section 2.1.1 of Chapter 2) is executed for some iterations with the present set of cluster centers initialized in previous step. After each iteration of K-medoid algorithm, cluster representatives/centroids are updated and this process continues until cluster centroids converge. Note that other sentences of the document are assigned to any of these K clusters depending on minimum distance criterion. In order to compute the dissimilarity between two sentences, recently developed word-mover-distance (WMD) [53, 54] is utilized. If two sentences have 0 WMD, then it means both sentences are similar. If sentence a has minimum WMD [53] to cluster center b in comparison to other cluster centers, then sentence a is assigned to bth cluster. For more detail about WMD, one can refer to Section 2.1.3 of Chapter 2.

4.3.3 Objective Functions Used

In order to measure the quality of the clustering present in a solution, two objective functions are utilized. In our framework, two cluster validity indices, Xie−Beni index [28] and PBM index [61], are utilized, both measuring the compactness (intra cluster distance) and separation (inter cluster distance) of the clusters, but in different ways. Note that the proposed approach is generic in nature. Any other combination of cluster validity indices could have been utilized as the objective functions. It was found in the literature that a combination of PBM-index and XB-index performed well in determining the optimal number of clusters from different data sets [161]. Inspired by this, the current set of objective functions are selected. The mathematical descriptions about these indices are provided in Table 2.1. Note that while evaluating these functions, WMD is utilized as the distance measure.

In order to determine the optimal number of clusters automatically from a given document, 1 {P BMindex, XBindex } should be maximized using the search capability of multi-objective dif- ferential evolution algorithm.

4.3.4 SOM Training

After forming the population, solutions (solution space, not objective space) in the population take part in SOM training. This is necessary to understand the topological mapping of the solutions in the 2-dimensional space. For more detail about this step, refer to Section 3.2.2 of Chapter 2.

71 Multi-objective Clustering based Framework for Extractive Single Document Summarization

4.3.5 Genetic Operators

In our framework, three types of genetic operators: mating pool selection, crossover/repairing, mutation are applied to generate new solutions from the present set of solutions in the population. Brief descriptions of these operators are given below:

Mating Pool Selection: In order to generate new solution, mating pool is constructed after considering the neighborhood solutions of the current solution retrieved using SOM. For the construction of mating pool, given the current solution (~xcurrent), similar step are followed as described in Section 2.1.12 of Chapter 1.

Crossover and Mutation: These steps are same as followed in the previous chapter. For more detail, refer to Algorithm 5 of Chapter 2. It is important to note that a solution encodes cluster centers which are the representative sentences from the document. These cluster representatives are then converted to their corre- sponding vectors before performing crossover operation. After applying polynomial mutation, vector component values of cluster centers will be changed and that cluster center (representa- tive) may not lie in the document. To convert the mutated vector into a feasible sentence present in the document, the following steps are performed:

• Assign the sentences to these updated cluster centers based on minimum cosine distance2.

• Now to find the representative of the pth cluster, the average cosine distance of each sen- tence assigned to that cluster to the remaining sentences of the same cluster is calculated. The sentence having minimum average cosine distance to all other sentences in the same cluster will become representative.

1 Finally, objective (PBM and Xie−Beni ) functional values of the new solution are calculated.

4.3.6 Selection of Best Solutions for Next Generation and Termination Cri- teria

These steps follow the same procedure as described in Section 3.2.6 and 3.2.7 of Chapter 3.

2 ~si.~v Here, cosine distance(~si, ~v)=(1 - cosine similarity(~si, ~v)) and Cosine similarity(~si, ~v) = measures ksikkvk the cos of angle between two vectors, ~si and ~v. Here, ~v is the updated cluster center in the form of a vector. ◦ If angle will be 0, that means ~si and ~v are overlapping to each other. If angle is 180 then both sentences are opposite to each other.

72 4.3 Proposed Method

4.3.7 Summary Generation (Sentence Extraction Module)

As any multi-objective optimization algorithm provides a set of solutions having equal impor- tance on the final Pareto optimal front, called as Pareto optimal solutions, therefore there is a need to select the best solution based on user requirement. In our framework, first, summary corresponding to each solution is generated and then the best solution is selected using best ROUGE-1 score. Summary generation steps for ith solution are discussed below:

1. First document center is identified: Average WMD of a sentence with respect to other sentences in the given document is calculated. Finally sentence having minimum average distance is selected as the document-centroid (as shown below)

N N X X distwmd(si, sj) m = arg min (4.1) O i=1 j=1,i6=j

Where N is the number of sentences in the document, O is the total number of sentence-  N×(N−1)  pairs and is given as 2 , si is the ith sentence, m is the document center index (mth sentence in the document).

2. Clusters present in the ith solution are ranked: The WMD of each cluster center present in

the ith solution to document center is calculated as follows: zk = distwmd(ck, sm), where

1 ≤ k ≤ N, ck is the kth cluster center. Finally clusters are ranked in descending order

based on these zk scores.

3. Calculate sentence score in each cluster: To assign a score to each sentence of a cluster, four features are used here: length of the sentence, position of the sentence in the document, distances of sentences to their cluster centers and anti-redundancy. Formal descriptions of these features are given below:

• Length of the sentence (F1): Existing literatures suggest that shortest sentence is less likely to be appeared in the summary [162]. In this work normalization based sigmoid function is used which favors the longest sentence, but does not completely rule out for medium length sentences.

  k    k  −l(si ) − µ(l) −l(si ) − µ(l) Lsk = 1 − exp / 1 + exp i std(l) std(l)

k Where, l(si ) is the length of sentence si, k is the cluster in which sentence si belongs, µ(l) is the mean length of sentences in the kth cluster and std(l) is the standard deviation of lengths of sentences in kth cluster.

73 Multi-objective Clustering based Framework for Extractive Single Document Summarization

• Position of the sentence in the document (F2): In most documents, relevant sentences tend to be found in title, leading sentences of the paragraph etc. It is expressed as: p = p( 1 ), where q is the position of ith sentence. i qi i k • Similarity with title (F3): It is calculated as: sim title k = distwmd(s , title), where, si i k title is the headline/title of the document in which sentence si belongs to. • Anti-redundancy (F4): In summary, all sentences should be different from each other to reduce redundancy. Therefore, in each cluster, anti-redundancy value is calculated

P|ck| k k k for each sentence. It is expressed as: antred k = distwmd(s , s ), where, s si i,j=1,i6=j i j i is the ith sentence in kth cluster, |ck| is the number of sentences in kth cluster.

Finally, sentence score is calculated by assigning different weights to various factors (defined above) as:

sentence scoresi = α × F 1 + β × F 2 + γ × F 3 + δ × F 4 (4.2)

where, α, β, γ and δ are the weights assigned to different factors such that α+β +γ +δ = 1

4. Arrange sentences in descending order present in a cluster according to their sentence scores.

5. Now, to generate summary, clusters are considered rank-wise. Given a cluster, top ranked sentences are extracted sequentially until summary length reaches to some threshold (in terms of number of words).

4.4 Experimental Setup

This section covers the datasets used for our experimentation purpose, evaluation measures, comparative methods considered for comparison with our proposed approach and parameters settings.

4.4.1 Datasets

For the purpose of evaluation of our proposed algorithm, gold standard data from Document Understanding Conference3 for the years 2001 and 2002 are used. DUC2001 is the collection of 30 topics (such as natural disasters, biographical information etc.) and DUC2002 is a collection of 59 topics. These contain 309 and 567 news reports (in the form of documents), respectively written in English. These reports are taken from newspapers and news agencies such as Associated Press and the Wall Street Journal. Each topic is accompanied by reference/gold/actual summary for

3https : //www − nlpir.nist.gov/projects/duc/data.html

74 4.4 Experimental Setup the single as well as multi-document. For single document summarization, reference summary is of approximately 100 words. A brief description of the used datasets is given in Table 4.1.

Table 4.1: Brief description of datasets used for single document summarization DUC2001 DUC2002 #Topics 30 59 #Documents 309 567 Source TREC TREC length of summary (in words) 100 100

4.4.2 Evaluation Measure

For evaluation of results, we have used ROUGE Toolkit [134] in its version 1.5.5 and adopted by DUC for automatic summarization evaluation. It measures the N−gram overlapping units between reference summary and system (predicted) summary. For mathematical definition for ROUGE score calculation, refer to section 2.3.2 of Chapter 2. In this work, values of N are considered as 1 and 2 to provide ROUGE-1 and ROUGE-2, respectively.

4.4.3 Comparing methods

The results obtained using the proposed approach namely, ESDS SMODE, is compared with the various state-of-the art techniques: MA-SingleDocSum [105], FEOM [104], Unified Rank [160], DE [55], NetSum [109], CRF [96], QCS [101], Manifold Ranking [98] and SVM [95]. These techniques are discussed in literature survey described in Section 2.2.2 of Chapter-2. Two other baselines namely, ESDS MGWO and ESDS MWCA also developed in this chapter in which multi-objective grey wolf optimizer (MGWO) [163] and multi-objective water cycle algorithm (MWCA) [164], respectively, are used as the underlying optimization strategy. Grey Wolf Op- timizer (GWO) is a meta-heuristic algorithm proposed by Mirjalili et al. [165]. This algorithm is based on leadership hierarchy and hunting procedure of grey wolves in nature. While, Water Cycle Algorithm [166, 167] is a meta-heuristic algorithm that mimics the water cycle process in nature, i.e., the flow of rivers and streams to sea and flow of streams to rivers. For ESDS MGWO and ESDS MWCA, the same steps are used by our proposed algorithm ESDS SMODE excluding utilization of SOM as mating construction tool.

4.4.4 Parameter setting

This section discusses the parameters used by our ESDS SMODE algorithm. Following are the different parameters values:

75 Multi-objective Clustering based Framework for Extractive Single Document Summarization

• MODE parameters: Population size (|P|)= 10, mating pool size=5, maximum number of

generations (gmax)=10, crossover probability (CR)=0.8, F =0.8, distribution index (ηm)

of polynomial mutation=20, pm=0.6, insertion mutation probability [0.6, 0.85], deletion mutation probability [0.86, 1].

• SOM parameters: initial neighborhood size (σ0)=2, initial learning rate (σ0)=0.1, training iteration in SOM=|P|. Sensitivity analysis on these parameters can be found in ref. [168].

• Some other parameters used by our algorithm are: maximum number of fitness evalu- ations (NFEs)=200, weights of sentence scoring features : α = 0.28, β = 0.31, γ = 0.31, δ = 0.10; System summary: length (in words)=100 words. The values of these parameters are selected after consulting the existing literatures. In most of the existing literatures [105, 169], the similar values of different parameters are considered.

• After analyzing the reference summary, it has been observed that there exist at least 4 sentences in a particular summary. Therefore, the minimum number of clusters is kept as 4 so that one sentence from each cluster can take part in the system summary. Maximum number of clusters are kept as 14.

• To represent the sentences into vector forms, word vectors of different words are averaged. These word vectors are obtained from pre-trained word2vec4 model on GoogleNews corpus and each word vector is of 300 dimension. Word Mover Distance also makes use of same4 model to calculate distance between two sentences.

Results obtained are averaged over 10 runs of the algorithm. We have also done the sensitivity analysis on MODE parameters (CR, F and ηm) using Taguchi [170, 171, 172] method. Note that we have used default parameters to run the MGWO5 and MWCA6 for our summarization task. Code of these algorithms are available online.

Tuning of MODE parameters used in ESDS SMODE: To tune the MODE parameters using Taguchi method, three parameters, CR, F and ηm, are used. We have excluded the population size and number of iterations (fitness evaluation) parameters as our main objective is to show convergence towards the true Pareto optimal solution in less number of iterations. Further, SOM parameters (like number of training iterations, number of neurons) depend on population size in our task. Therefore, these are also excluded from parameter tuning. First parameter (to be tuned) is the crossover probability lying between [0 − 1], second parameter is a

4https://github.com/mmihaltz/word2vec-GoogleNews-vectors 5http://www.alimirjalili.com/GWO.html 6http://www.ali-sadollah.com/water-cycle-algorithm-wca

76 4.4 Experimental Setup control factor which lies between [0 − 2], third one is the distribution index used in polynomial mutation. In the literature [168, 173] general value of ηm is seen between [19 − 21]. Therefore, we have kept the same range.

In this method, two factors are used: controlled factor, i.e., whose optimal value is to be determined and uncontrolled factors (also called noise factors). The main aim of this method is to determine the control factors by maximizing the uncontrolled factor which is ROUGE-1 (as we are selecting the best solution in the population based on the best ROUGE-1 recall value) score for our summarization task. Due to the large number of documents available for summarization task, we have considered 5 random documents from DUC2001 dataset and 5 random documents from DUC2002 dataset to tune the parameters.

Three levels are used for possible values of these control parameters with L9 orthogonal array of Taguchi method. Each level corresponds to set of different combinations of parameter values. These levels are shown in Table 4.2. Figure 4.2 shows the results generated by Taguchi method. As can be seen from this figure, optimized value of CR, F and η are 0.8, 0.8 and 20, respectively, which are used in our experiments to summarize the single documents.

Table 4.2: Experiment results on ESDS SMODE on different parameter combinations. The val- ues of CR, F and eta correspond to levels (1, 2, 3) are (0.4, 0.6, 0.8), (0.3, 0.8, 1.5) and (19, 20, 21), respectively. Here, SNRA is the Signal to Noise Ratio, MEAN is mean of uncontrolled factor values (ROUGE-1 score values) of different documents. Run Order CR F eta doc1 doc2 doc3 doc4 doc5 doc6 doc7 doc8 doc9 doc10 SNRA MEAN 1 1 1 1 0.40299 0.32759 0.44805 0.32000 0.72656 0.30357 0.49275 0.40000 0.38462 0.43038 -8.16215 0.42365 2 1 2 2 0.42405 0.32759 0.46667 0.33125 0.68382 0.30822 0.49275 0.39080 0.41333 0.45205 -7.96725 0.42911 3 1 3 3 0.40299 0.32759 0.44156 0.36567 0.68382 0.30822 0.51429 0.46341 0.39063 0.41489 -7.88930 0.43137 4 2 1 2 0.40299 0.32759 0.48592 0.33333 0.61594 0.30822 0.52703 0.45161 0.36000 0.42308 -8.03722 0.42363 5 2 2 3 0.43846 0.32759 0.46000 0.30328 0.72656 0.30822 0.49275 0.44444 0.37681 0.40000 -8.15299 0.42787 6 2 3 1 0.40299 0.32759 0.42157 0.35000 0.71928 0.30822 0.46333 0.42141 0.36333 0.41489 -8.12724 0.41932 7 3 1 3 0.40299 0.33077 0.45570 0.33333 0.61475 0.30822 0.48485 0.42500 0.36709 0.42308 -8.13198 0.41464 8 3 2 1 0.42258 0.33861 0.46805 0.38235 0.64498 0.30822 0.48858 0.46154 0.37681 0.44078 -7.72244 0.43407 9 3 3 2 0.42543 0.33726 0.46805 0.33333 0.72656 0.30822 0.51429 0.47532 0.37681 0.43740 -7.79790 0.44115

Table 4.3: ROUGE Scores of different methods on DUC2001 and DUC2002 data sets DUC2001 DUC2002 Method Average ROUGE-2 Average ROUGE-1 Average ROUGE-2 Average ROUGE-1 ESDS SMODE 0.21450 0.45214 0.34132 0.49117 ESDS MGWO 0.15228 0.37108 0.18838 0.41849 ESDS MWCA 0.14997 0.36702 0.18812 0.41800 MA-SingleDocSum [105] 0.20142 0.44862 0.22840 0.48280 DE [55] 0.18523 0.47856 0.12368 0.46694 UnifiedRank [160] 0.17646 0.45377 0.21462 0.48487 FEOM [104] 0.18549 0.47728 0.12490 0.46575 NetSum [109] 0.17697 0.46427 0.11167 0.44963 CRF [96] 0.17327 0.45512 0.10924 0.44006 QSC [101] 0.18523 0.44852 0.18766 0.44865 SVM [95] 0.17018 0.44628 0.10867 0.43235 Manifold Ranking [98] 0.16635 0.43359 0.10677 0.42325

77 Multi-objective Clustering based Framework for Extractive Single Document Summarization

Figure 4.2: Results generated by Taguchi method. Here, SN is the Signal to Noise Ratio which we have to maximize. SN is maximum for CR, F and η (eta m) at levels 3, 2 and 2, respectively.

4.5 Experimental Results and their Discussion

4.5.1 Comparison with Existing algorithms

Table 4.3 presents the ROUGE scores obtained by our proposed approach ESDS SMODE, and different state-of-the-art methods on DUC2001 and DUC2002 data sets. It can be seen that our approach outperforms all other approaches for both data sets with respect to ROUGE−2 measure. But on comparing, ROUGE−1 for DUC2001, our system is only 0.0457 points lower than the best system (DE), while, from other systems, it differs by 0.027 on an average. For DUC2002, the obtained ROUGE-1 by our approach is better than state-of-the-art methods. Our method is fully unsupervised in nature as compared to supervised systems like CRF [96], QCS [101], Manifold Ranking [98] and SVM [95] developed for text summarization. FEOM [104], DE [55], MA-SingleDocSum [105] make use of evolutionary algorithms to optimize single objective or weighted sum of objectives to achieve better ROUGE score. But our algorithm is based on optimization of two objectives. These algorithms (FEOM, DE and MA-SingleDocSum) make use of reproduction operators like random selection, tournament selection etc. similar to single-objective optimization problems to generate solutions using crossover/mutation opera- tions. The use of self-organizing map based reproduction operator in our proposed algorithm, ESDS SMODE, helps in generating good quality solutions [168]. Note that SOM was never uti- lized in text summarization solution frameworks using evolutionary algorithms as per our knowl-

78 4.5 Experimental Results and their Discussion edge. Also, none of the compared approaches uses word mover distance to capture the semantics between sentences. After comparing ESDS SMODE with ESDS MGWO and ESDS MWCA, it was found that our approach performs better than these techniques. Thus it can be concluded from obtained results that the use of WMD as sentence dissimilarity measure and self-organized multi-objective differential evolution for sentence clustering indeed helps in achieving improved performance. Higher value of ROUGE−2 implies that the word orderings present in the sum- mary produced by our system follow the reference summary more closely. This also indicates the increased fluency of the produced summary. While, ROUGE-1 does not consider fluency into account because it simply counts the number of 1-gram overlapping units between reference and candidate summary and can not determine if the result is coherent or the sentences flow together in a sensible manner. As our proposed algorithm is based on evolutionary concept; therefore, it provides a set of Pareto optimal solutions at the end lying on Pareto front optimal front. These Pareto fronts obtained over three documents of DUC2001 dataset are shown in Figure 4.3. On comparing CPU time, it was found that ESDS SMODE takes more time, i.e., on an average 78.99 sec- onds/document for DUC2001 dataset, while, it is 50.34 seconds/document for DUC2002 dataset.

Table 4.4: Improvements obtained by our proposed approach over other methods based on ROUGE−2 score Methods Improvements obtained by Proposed approach (%) DUC2001 DUC2002 ESDS MGWO 40.86 81.19 ESDS MWCA 43.03 81.44 MA-SingleDocSum 6.49 49.44 DE 15.80 175.97 UnifiedRank 21.56 59.03 FEOM 15.64 173.27 NetSum 21.21 205.65 CRF 23.80 212.45 QSC 15.80 81.88 SVM 26.04 214.09 Manifold Ranking 28.94 219.68

4.5.2 Improvements obtained

Improvements obtained (IO) by the proposed method ESDS SMODE (as it is better than oth- ers) based on Rouge−2 score for both datasets are shown in Table 4.4. Mathematically, IO is calculated as : P roposedMethod − OtherMethod IO = × 100 (4.3) OtherMethod In Table 4.4, a comparison with MA-SingleDocSum, ESDS MGWO and ESDS MWCA ap-

79 Multi-objective Clustering based Framework for Extractive Single Document Summarization

(a) (b)

(c)

Figure 4.3: Pareto Fronts obtained by ESDS SMODE over three documents of DUC2001 dataset. In (b) and (c), all solutions are of rank-1.

proaches on DUC2001 dataset shows that our approach improves by 6.49%, 40.86% and 43.03%, respectively over the ROUGE-2 score. While, for DUC2002 dataset, our approach improves by 49.44%, 81.19% and 81.44%, respectively, with respect to ROUGE-2 score.

Table 4.5 shows the improvements obtained by our approach ESDS SMODE in comparison with other methods using ROUGE-1 score for DUC2002 dataset. Here, improvements obtained by our approach with respect to UnifiedRank, ESDS MGWO and ESDS MWCA approaches are 1.30%, 17.37% and 17.50%, respectively. In Table 4.6, comparison is made between DE and other methods using ROUGE−1 score on DUC2001 dataset as DE achieves highest ROUGE−1 score on DUC2001 dataset. It can be seen that DE improves over FEOM, ESDS MGWO and ESDS MWCA methods by 0.27%, 28.86% and 30.29% points, respectively.

80 4.5 Experimental Results and their Discussion

Table 4.5: Improvements obtained by our Table 4.6: Improvements obtained by DE proposed approach over other methods us- over other methods using ROUGE−1 score ing ROUGE−1 score on DUC2002 dataset on DUC2001 dataset Methods DUC2002 Methods DUC2001 ESDS MGWO 17.37 Proposed approach 5.84 ESDS MWCA 17.50 ESDS MGWO 28.96 MA-SingleDocSum 1.73 ESDS MWCA 30.29 DE 5.19 MA-SingleDocSum 6.67 UnifiedRank 1.30 UnifiedRank 5.46 FEOM 5.46 FEOM 0.27 NetSum 9.24 NetSum 3.08 CRF 11.61 CRF 5.15 QSC 9.48 QSC 6.70 SVM 13.60 SVM 7.23 Manifold Ranking 16.05 Manifold Ranking 10.37

Figure 4.4: An example of good quality-generated summary with respect to reference summary for the document, AP 880316 − 0061, of topic d21d under DUC2001 dataset.

4.5.3 Analysis of Results

For error-analysis, we have considered some random documents from the datasets. Some exam- ples of good and bad quality summary obtained by ESDS SMODE approach are also illustrated in this work. In Figure 4.4, an example of predicted summary is shown with respect to ac- tual summary of document AP 880316 − 0061 of topic d21d under DUC2001 dataset. Matched sentences are shown by same color. Here, the predicted summary covers most of the informa- tion in reference summary having 0.8918 and 0.7500 as ROUGE-1 and ROUGE-2 recall scores, respectively, and thus is considered as good summary. While, Figure 4.5, shows an example of predicted summary which does not seem to be good, and it obtains 0.3823 and 0.1025 as ROUGE-1 and ROUGE-2 recall scores, respectively.

81 Multi-objective Clustering based Framework for Extractive Single Document Summarization

Figure 4.5: An example of low-quality summary. (a) Some sentences of the document, AP 891101 − 0150, of topic d16c under DUC2001 dataset. (b) reference summary and predicted summary of the same document.

In part-(a) of the Figure-4.5, some sentences of the documents are shown and the underlined sentences/words are the part of actual summary. In part-(b) of the Figure-4.5, actual and predicted summary are shown. Here, underlined words of the documents form sentences in the actual summary as it is generated by human annotators. But, our proposed system is sentence based extractive summarization and is able to select original sentences from the document, but, can not modify them. Because of this reason, ROUGE score is very less. Some of the documents

82 4.5 Experimental Results and their Discussion in the dataset have this type of property. For DUC2001 data set, our error analysis reveals the following possible reasons behind the low value of ROUGE-1 by our proposed approach: in general, summary rarely contains questions. But in a few of the generated summaries, questions are selected by our proposed approach. This has decreased the ROUGE score. To rectify this, another sentence scoring feature must be used which should consider the type of sentence. Another observation for both the datasets is that usually, the sentence ordering in the gen- erated summary plays a significant role for better readability and fluency. But in our approach, sentences, in the generated summary, are arranged based on the rankings of the corresponding clusters. Thus, we may not get the sentences in order as they present in the original document, which may decrease the readability of the summary. Therefore, the sentence arrangement must be taken into account to improve readability. Some post-processing steps can be applied to rearrange the sentences selected for the summary to increase fluency.

4.5.4 Statistical significance t-test

To prove that obtained results by the proposed approach ESDS SMODE, are statistically signif- icant, we have also conducted statistical t-test [156] at 5% significance level. This t-test provides p-values. Lesser the p-value, more significant our result is. More description about t-test can be found in Section 3.4.5 of Chapter 3. To conduct this test, two groups are considered. The first group contains list of ROUGE-1 (ROUGE-2) score produced by our SOM based approach and another group contains list of ROUGE-1 (ROUGE-2) of exiting methods. p-values obtained are: a) using ROUGE-2 for DUC2001, less than 0.00001; b) using ROUGE-1 for DUC2001, 0.000043; c) using ROUGE-2 for DUC2002, less than 0.00001; d) using ROUGE-1 for DUC2002, 0.004183. Test results support the hypothesis that obtained improvements by the proposed approach are not occurred by chance, i.e., improvements are statistically significant.

4.5.5 Complexity of the proposed framework

Let N be the number of input sentence vectors, g be the maximum number of fitness evaluations. 1) First step of our algorithm is the population initialization which is initialized using K-Medoid clustering algorithm by running multiple times for varied number of clusters to generate the solutions. The K-Medoid algorithm takes O(tK(N − K)2) [174] time. Here, K is the number of clusters, t is the number of iterations to converge. Let P solutions be there, then for each solution we need to calculate M objective functions, therefore total complexity to initialize pop- ulation (including objective function calculation) will be O(P (tK(N − K)2 + M)).

83 Multi-objective Clustering based Framework for Extractive Single Document Summarization

2) The solutions in the population undergo SOM training which takes O(P 2) time [157]. 3) Mating pool construction takes O(P 2) time as here we need to extract neighborhood rela- tionship (NR) for each solution using trained SOM. 4) Other genetic operators like Crossover and mutation take constant time as these involve only some arithmetic computations. This implies, new solution generation using crossover and mu- tation operations takes O(P ) time as new solution is required to be generated for each solution in the population. 5) K-Medoid clustering steps are applied on each new solution and the objective functional val- ues are calculated. This takes O(tK(N − K)2 + M) time. 6) Non-dominated sorting and crowing distance calculation take O(MP 2) [29] time in worst case as for each objective, comparison is required to be performed for each solution with respect to other solutions. Thus total run time complexity is

O(P (tK(N − K)2 + M) + g(P 2 + P 2 + P + P (tK(N − K)2 + M) + MP 2))

Here, step-2 to step-3 will be repeated upto g number of fitness evaluations.

=⇒ O(P (tK(N − K)2 + M) + g(2P 2 + P + P (tK(N − K)2 + M) + MP 2)) =⇒ O(P (tK(N − K)2 + M) + gP + gP 2(2 + M) + gP (tK(N − K)2 + M)) =⇒ O(P tK(N − K)2(1 + g) + gP + gMP 2)) =⇒ O(gP tK(N − K)2 + gP (1 + MP )) =⇒ O(gP tK(N − K)2 + gMP 2) =⇒ O(gP (tK(N − K)2 + MP ))

Thus, total complexity of our proposed system is O(gP (tK(N − K)2 + MP ))

4.6 Chapter Summary

In this chapter, a clustering-based framework for extractive single document text summariza- tion (ESDS) systems is proposed utilizing MODE and SOM. The approach developed firstly used a sentence clustering technique to partition the available sentences in semantic space in an automated way. Two sentence-cluster quality measures are optimized simultaneously using different multi-objective search techniques. Finally, representative sentences are selected from different sentence clusters using some sentence scoring features. Results on standard data sets prove the efficacy of the proposed technique compared to state-of-the-art techniques in terms of ROUGE-2 score. Our best approach improves by 6.49% points for DUC2001, while, for the DUC2002 dataset, our best approach improves by 49.44% points over the best approach, namely,

84 4.6 Chapter Summary

MA-SingleDocSum. In terms of the ROUGE-1 measure for DUC2002 dataset, our best approach improves by 1.30% points over the UnifiedRank approach. We plan to (a) extend the current approach for automatic adaption of various parame- ters.; (b) explore various sentence representation schemes; (c) explore different sentence similar- ity/dissimilarity measures that will help in forming sentence clusters; and (d) solve the same task using a binary optimization framework. Therefore, in the next chapter, we solved the problem of single document summarization considering the items (c) and (d).

85

CHAPTER 5

Extractive Single Document Summarization using Multi-objective Binary Differential Evolution

In this chapter, we formulate the problem of extractive single document text-summarization as a binary optimization problem. Different statistical features (objective functions) like cohesion, readability, coverage, and the similarity of a sentence with the title, among others are consid- ered while generating the summary. Self-organizing map (SOM) based genetic operators are incorporated in the optimization process to see the performance improvements. As the choice of similarity or dissimilarity between sentences plays an influential role in any summarization pro- cess; therefore, different existing measures like normalized google distance, word mover distance, and cosine similarity are explored in this work.

87 Extractive Single Document Summarization using Multi-objective Binary Differential Evolution

5.1 Introduction

5.1.1 Overview

In the last chapter, the task of extractive single-document summarization is solved using a multi- objective clustering framework. However, in this chapter, the same task is solved differently. In other words, the summarization problem is treated here as a binary optimization problem where different quality measures (or objective functions) of the summary are optimized simultaneously. These objective functions include the position of the sentence in the document, the similarity of a sentence with the title, length of the sentence, cohesion, readability, and coverage. Multi- objective binary differential evolution (MOBDE) [46] is used as the underlying optimization strategy to optimize all objective functions simultaneously where each chromosome (or solution) is a binary string representing a set of possible sentences to be selected in the generated summary. Optimization of multiple objective functions helps in generating a good quality summary for a given document and in attaining a better ROUGE score. Like previous chapters, here also SOM is utilized for constructing the mating pool, and for exploring the search space efficiently. To show that the performance of the proposed summarization technique not only depends on objective functions considered but also on the type of sentence similarity/dissimilarity function used, experiments are conducted by varying the similarity/dissimilarity measures. These are normalized Google Distance (NGD) [175], word mover distance (WMD) [54] and cosine similarity [55]. The proposed approach is tested on two standard datasets of text summarization, namely, DUC2001 and DUC2002 (https://www-nlpir.nist.gov/projects/duc/data.html). Results obtained clearly show the superiority of our proposed algorithm in comparison to various state- of-the-art techniques.

5.1.2 Contributions

The major contributions of the current work are enumerated below:

• In the literature, the ESDS problem is often formulated as a single objective optimization problem with the weighted sum of different objectives [105, 55], and this is popularly solved using different EA techniques. However, in this work, the summarization problem is treated as a multi-objective optimization problem where various aspects of a summary like readability, the similarity of the sentences in the summary with the title, and others are optimized simultaneously.

• In the existing multi-objective evolutionary algorithms for summarization task, usually,

88 5.2 Statistical Features or Objective Functions

reproduction operators like roulette wheel selection, tournament selection [29], among oth- ers popularly used in a single-objective optimization framework, are used to generate new solutions. However, in the current study, to generate high-quality solutions, some newly de- signed self-organizing map-based genetic operators are used, which further help in reaching the global optimum solutions more quickly.

• In order to show that performance of the summarization system not only depends on the objective functions used but also on the type of similarity/dissimilarity measure used between sentences, three types of similarity/dissimilarity measures (normalized google dis- tance [175], word mover distance [54], and cosine similarity [55]) are explored in this work.

• Most of the papers on summarization using several optimization strategies make use of an actual summary to report the results. Nevertheless, in real-time situations, the actual summary may not be available. Therefore, in this work, we explored various unsupervised strategies for selecting a single solution from the final Pareto optimal front produced by any multi-objective optimization-based technique.

5.2 Statistical Features or Objective Functions

To obtain a good summary, selection of objective functions (quality functions on sentences) is crucial. These objective functions assign some fitness values to the sentences and further help in improving the quality of generated summary. The set of objective functions used in our approach are: the position of the sentence in the document, similarity of a sentence with the title, length of the sentence, cohesion, coverage, and readability. First five objective functions are selected motivated by the paper [105]. Authors of cited paper have optimized weighted sum of first five objective functions and shown that their results are better that state-of-the-art results. But combining the values of different objective functions using weighted criteria into a single value may not be meaningful [31]. Moreover, in any text-based summarization system, readability is an important factor as generated summary should be readable to end-users. Therefore, in our approach, readability feature is considered as a sixth objective function. All these objective functions have to be maximized simultaneously by the use of some multi-objective optimization framework instead of using weighted sum approach. Brief description on these objective functions are provided below:

89 Extractive Single Document Summarization using Multi-objective Binary Differential Evolution

5.2.1 Sentence Position

In any document, regardless of domain, relevant/informative sentences can be found in some sections of the document like the leading paragraph of the document. Therefore, to consider this information into account, sentence position [176, 177] is expressed as:

X 1 p = p( ) (5.1) qi ∀si∈Summary where qi is the position of ith sentence. It assigns higher scores to initial sentences of the document. As the sentence position in the document increases, the value of p decreases.

5.2.2 Similarity with Title

Sentences in the summary should be similar to the title [178] to obtain a good summary because the title describes the theme of the document. This objective function is defined as given below:

X sim(si, title) SWT = , (5.2) avg O ∀si∈Summary

SWT SWT = avg (5.3) max∀SummarySWT where, title is the headline/title of the document in which sentence si belongs to, sim(si, sj) is the similarity between sentences, si and sj, O is the number of sentences in generated summary,

SWTavg is the average similarity of the sentences in summary with the title, max∀SummarySWT is the average maximum similarities of all sentences with the title, and SWT is the similarity factor of the summary S with the title. SWT is close to 1 if sentences, in summary, are closely related to the title of the document.

5.2.3 Sentence Length

Literature survey suggests that shorter sentences have less chances to appear as a part of the summary [162]. In the current work, normalization based sigmoid function [179] is used which favors the longest sentence but does not entirely rule out the medium length sentences. Mathe- matically, it is expressed as:

   1 − exp −l(si)−µ(l) X std(l) (5.4)   −l(si)−µ(l)  ∀si∈Summary 1 + exp std(l)

90 5.2 Statistical Features or Objective Functions

Where, µ(l) is the mean length of sentences in the summary, l(si) is the length of sentence, si, and std(l) is the standard deviation of lengths of sentences in summary S.

5.2.4 Cohesion

Cohesion [169] measures the relatedness of the sentences in the summary. For a good summary, relatedness between sentences should be tightly coupled. It is expressed as

log(C × 9 + 1) COH = s (5.5) log(M ∗ 9 + 1)

Where, P sim(s , s ) ∀si,sj ∈Summary i j N × (N − 1) Cs = and, Os = (5.6) Os 2

M = max sim(si, sj), i, j ≤ N, Cs measures the average similarity of the sentences in the summary, sim(si, sj) is the similarity between sentences, si and sj, N is the total number of sentences in the document, M is the maximum similarity between two sentences. It ranges between [0, 1]. 1 indicates sentences in summary are highly related to each other.

5.2.5 Coverage

Coverage (CoV) [105] measures the extent to which sentences in the summary provide useful information about the document and should be maximized. Coverage is defined as

X X sim(si, sj) CoV = (5.7) N − 1 ∀si∈Summary ∀sj ∈Doc,si6=sj where si and sj are the sentences belonging to generated summary and document, respectively,

Doc is the document, N is the number of sentences in the document, sim(si, sj) is the similarity between sentences, si and sj.

5.2.6 Readability Factor

Readability factor [169] is the last objective function which is the most important factor for summary formation. In this, each sentence should be related to the previous one to make the summary readable. It is expressed as:

Np X R = sim(si, si−1) (5.8) i=2

91 Extractive Single Document Summarization using Multi-objective Binary Differential Evolution where, Np is the number of sentences in the predicted summary, si and si−1 are two consecutive sentences in the predicted summary, sim(si, si−1) is the similarity between sentences, si and si−1.

5.3 Problem Definition

Consider a document D consisting of N sentences, {s1, s2, . . . , sN }. Our main task is to find a subset of sentences, S ∈ D, such that

X li ≤ Smax (5.9)

si∈S where, S represents the main theme/topic of the document or subset of sentences which cover the relevant information from the document, si is the sentence belonging to S, li measures the length of ith sentence in terms of number of words, Smax is the maximum number of words allowed in generated summary. This summary should be optimal with respect the various quality measures discussed in previous section.

5.4 Proposed Methodology

In this chapter, two approaches were developed for sentence based extractive single document summarization. Note that both approaches utilize a binary version of multi-objective differen- tial evolution (MOBDE) technique (discussed in section 2.1.9 of Chapter 2) as the underlying optimization strategy. SOM-based genetic operators are introduced in the process to increase the convergence. The flowchart of the proposed approach is shown in Fig 5.1 and underlying steps are discussed in subsequent sections.

1. Approach 1: In this approach all objective functions are assigned some importance fac-

tors/weights. For example, if fitness values of six objective functions are < ob1, ob2, ob3,

ob4, ob5, ob6 > and weights assigned are < α, β, γ, δ, λ, φ >, then < ob1 × α, ob2 × β, ob3 ×

γ, ob4 × δ, ob5 × λ, ob6 × φ > are optimized simultaneously. The values of these weights are selected after conducting a thorough literature survey [46, 105, 180].

2. Approach 2: In this approach, all objective functions are simultaneously optimized without assigning any weight values.

In the literature [105, 180], it was shown that some of the objective functions used in our approach have more importance than others. Therefore, Approach 1 is developed to see the effect of the

92 5.4 Proposed Methodology

2. Pre- 3. Population processing initialization (P) and No 4. SOM Training g

10. Select single END best solution and 7. Select best 6. Merge P and generate summary |P| solutions P’

Figure 5.1: Proposed architecture. Where g is the current generation number initialized with 0; gmax is the maximum number of generations which is defined by the user; |P | is the number of solutions in the population. After step-8, g is incremented by 1 and the process continues until maximum number of generations is reached.

varying importance of different objectives functions.

5.4.1 Preprocessing

Before generating the summary, a series of steps are executed to pre-process the document. These steps include segmentation of the document into sentences, stop word removal (frequent words like is, am, are, etc. are removed from the document), case folding (lower case conversion) and removal of punctuation marks. Here, the nltk toolkit [181] is used for document segmentation and removal of stop words.

5.4.2 Representation of Solution and Population Initialization

Any evolutionary algorithm, starts with a set of solutions (or chromosomes), < ~x1, ~x2 . . . ~x|P | >, called as population, where, | P | is the number of solutions. As our approach is based on binary optimization, each solution is represented in the form of a binary vector. The size of the solution is set equal to the number of sentences in the document. For example, if a document consists of 10 sentences then a valid solution can be represented as [1, 0, 0, 1, 1, 0, 1, 0, 0, 0]. This solution indicates that first, fourth, fifth, and seventh sentences of the original document should be in the summary. The initial population is generated randomly. While generating the solution, the constraint on summary length is taken into account as P l ≤ S , where, l si∈Summary i max i measures the length of sentence in terms of the number of words, Smax is the maximum number of words allowed in the generated summary.

93 Extractive Single Document Summarization using Multi-objective Binary Differential Evolution

5.4.3 Objective Functions Used

To measure the quality of each solution in the population, a set of objective/fitness functions are evaluated. These functions are discussed in the Section 5.2, and all are of maximization type. Note that optimization of these functions helps in getting a good quality summary.

5.4.4 SOM Training

In this step, SOM [62, 30] will be trained using the solutions in the population for which, we have used the sequential learning algorithm to train the SOM (described in Algorithm-1 in Chapter- 2). SOM will help in understanding the distribution structure of the solutions in the population. The solutions which are closer in the input space, come closer to each other in the output space (neuron grid in SOM).

5.4.5 Genetic Operators

In any evolutionary algorithm, genetic operators help in generating new solutions. This set of new solutions forms a new population, P 0 . In our framework, from each solution, a new solution is generated using three genetic operators: mating pool generation, mutation, and, crossover. Let us assume that at generation ‘t’, we want to generate a new solution for current solution denoted as ~xc,t. Then, genetic operators are followed as below:

Mating Pool Generation

It consists of a set of solutions which can mate to generate new solutions. For its constructions for the current solution, neighboring solutions are identified using the trained SOM as discussed in Section 2.1.12 of Chapter 2. Note that mating pool size is kept fixed.

Mutation and Crossover

To perform these genetic operators, three solutions, xr1,t, xr2,t, and xr3,t, are randomly selected from the constructed mating pool. Thereafter, Eqs. 2.10, 2.11 and 2.12 are followed in sequence as mentioned in Section 2.1.10 of Chapter 2. These Eqs. give rise a new solution vc,t+1. It is important to note that during generation of the new solution, vc,t+1, all possible combinations of the mating pool (as there are more number of solutions in the mating pool) are tried, and mutation and crossover are performed against each combination and then constraint of summary length is checked. It may be possible that more than one combination may satisfy the constraint. In that case, only that combination is selected which is close to length constraint (considering the maximum number of words in the summary).

94 5.4 Proposed Methodology

5.4.6 Selection of the Best |P | Solutions for Next Generation

This step includes selection of the best |P | solutions out of the old population (P ) and new population (P 0 ). Note that size of population P 0 is equal to population P. To perform this operation, non-dominated sorting (NDS) and crowding distance operator (CDO) of NSGA-II are utilized [29]. For more details, refer to section 3.2.6 of Chapter 2..

5.4.7 Updation of SOM Training Data

In this step, training data for SOM is updated. In the next generation, SOM will be trained using those selected solutions (out of the best solutions selected in previous step) which have not been seen before. It is important to note that updated weight vectors of the neurons in the current generation will now be treated as the initial weight vectors of the neurons in the next generation.

5.4.8 Termination Condition

For any iterative procedure, termination condition is required. Therefore, in our work, proposed algorithm is repeated until a maximum number of generations (iterations), gmax is reached. This step is shown by diamond box in Fig 5.1.

5.4.9 Selection of Single Best Solution and Generation of Summary

At the end of the final generation, a set of non-dominated solutions on the Pareto optimal front are generated by our MOO-based algorithm. All solutions represents a summary due to their binary representation (while in previous chapter, we have to extract sentences from the optimized clusters to generate summary). Therefore, the best solution was selected having the highest ROUGE-2 score. Note that to calculate the ROUGE score, gold/reference summary is used, which may not be available in real time situations. Therefore, a single solution from the final Pareto optimal front should be selected after considering other criteria which do not use any supervised information. To address this issue, we have explored various methods to select the best solution. Let us name the approaches making use of supervised (available gold summary information) and unsupervised information for selection of single best solution from the final Pareto optimal front as SMaxRouge and UMaxRouge, respectively. The methods explored under UMaxRouge policy are explained below:

95 Extractive Single Document Summarization using Multi-objective Binary Differential Evolution

1. Maximum values of six different objectives functions and their combinations: coverage (MaxCov), readability (MaxRead), sentence length (MaxSenLen), sentence position (MaxSen- Pos), similarity with title (MaxSimTitle), cohesion (MaxCoh). To calculate these, firstly, for all the solutions of the final generation, the single objective function (for example, readability score) is analyzed, and then, the solution having the highest value based on chosen single objective function is considered as the best solution. Some combinations of these objective functions are also explored. In this case also, the solution with the highest value is considered as the best solution. For example:

• MaxWeightSumAllObj: In this approach, summation of all objective functional values optimized in our approach is considered.

• MaxWeightSum2Obj: In MaxWeightSum2Obj, the summation of two objective func- tions, namely, sentence position and sentence similarity with the title is considered.

• MaxWeightSum3Obj: This is similar to MaxWeightSum2Obj. Only difference is that we have added one more objective function namely, cohesion.

2. Ensemble approach (EnSem): In this approach, we have firstly considered all the sentences which are present in the summaries corresponding to all generated rank-1 solutions of the final Pareto optimal front. Then the frequency of occurrence of each of these sentences over different summaries corresponding to different rank-1 solutions is calculated as per Eq. 5.10. Sentences are then sorted based on their frequencies of occurrence and those are added one by one as per their sorted order in the final summary until the desired length is reached.

Let |PS| is the number of rank-1 solutions, PSS is the set of all unique sentences present in the summaries corresponding to PS number of solutions. Let us assume that we want

to count the frequency of occurrence of ith sentence, i.e., senti, belonging to PSS. Then, the following equation is followed:

 |PS|  X 1, if senti ∈ PSk countsenti = B and B = (5.10) k=1 0, otherwise

where, PSk is the kth summary corresponding to kth solution of a document. Same equa- tion (the above Eq.) was followed to calculate the count of remaining sentences belonging to PSS.

Two other variations of the ensemble approach are also tried. After collecting the sen- tences of rank-1 solutions (merged pool), they are sorted based on (a) maximum length;

96 5.5 Experimental Setup

(b) maximum sentence to title similarity. For both cases (a) and (b), final summary is generated by adding the sentences from the merged pool one by one following their sorted order until the desired length is reached. In this work, the approaches corresponding to (a) and (b) are named as EnSemMaxLen and EnSemMaxSentTitle, respectively.

K X ReconsErrorj = k DocV ec − SentV eci k2 (5.11) i=1

where DocV ec is the vector representing document’s theme, SentV eci is the ith sentence vector of jth summary (or summary corresponding to jth solution), K is the number of

sentences in jth summary, k DocV ec − SentV eci k2 is the Euclidean distance between document vector and ith sentence vector. In the current work, we name this approach as MinReconsErrorWord2vec.

Performing the averaging of word vectors to get the sentence vector and then averaging the sentence vectors to obtain the document vector, somehow reduce the semantics of sentence and document [51] vectors. Therefore, we have tried another approach based on Doc2vec [182]. Its performance is shown to be good when trained on large corpora with pre-trained word-embedding [182]. From the trained model, we can directly get the document vector and sentence vector [183]. Here also we want to minimize the reconstruction error between document vector and generated summary as mentioned in Eq. 5.11. Let us name this approach as MinReconsErrorDoc2vec.

3. Maximum distance from the origin (MaxObjDistOrigin): As six objective functions used in our proposed approach are of maximization type; therefore, here, we have calculated the Euclidean distance between the origin having position (0, 0, 0, 0, 0, 0) and objective functional values of the solution. The solution having the largest distance is selected as the best solution.

Note that, the sentences, present in the final summary, are reported based on their occurrences in the original document. For example, the sentence which appears first in the document will be the first sentence in the summary.

5.5 Experimental Setup

This section presents the datasets used for the experimentation, evaluation metrics to measure the performance, comparing methods, followed by parameter settings. All the proposed ap- proaches were implemented on Ubuntu server having Intel Xeon CPU 2.20 GHz with 256 GB of

97 Extractive Single Document Summarization using Multi-objective Binary Differential Evolution

RAM.

5.5.1 Datasets

In order to show the effectiveness of the proposed approach and to show that performance not only depends on the chosen objective functions, but, also depends on the type of similar- ity/dissimilarity measures used, two benchmarks datasets namely, DUC2001 and DUC2002 from Document Understanding Conference1 are used. These contain 309 and 567 news reports (in the form of documents), respectively, written in English. For each document, the original/actual summary is available in approximate 100 words for single document summarization. A brief description of the used datasets is provided in Table 4.1 of Chapter-2. In addition to these datasets, we have also used the CNN dataset [184] which contains news articles collected from CNN news site https://edition.cnn.com/. It consists of 3000 news articles/documents out of which only 1000 articles are made available by the authors of CNN dataset [184]. Note that this dataset is released as part of competition on extractive summarization held with ACM Symposium on Document Engineering2 in the year 2019. The actual summary includes 3-4 sentences on an average.

5.5.2 Evaluation Measure

To evaluate the performance of the proposed architecture, we have utilized the ROUGE-N mea- sure [134]. For mathematical definition for ROUGE score calculation, refer to section 2.3.2 of Chapter-2. In our experiment, N takes the values of 1 and 2 for ROUGE−1 and ROUGE−2, respectively.

5.5.3 Comparing Methods

For DUC2001 and DUC2002, we have compared our proposed system with 13 existing sys- tems. Some methods use supervised approaches, while, others used neural network. Some of the comparing algorithms are also based on optimization techniques to improve the ROUGE score. The names of the existing systems used for comparison are Unified Rank [160], MA- SingleDocSum [105], Manifold Ranking [98], QCS [101], CRF [96], NetSum [109], SVM [95], DE [55], FEOM [104], SummaRuNNer [7], NN-SE [6], COSUM [108], ESDS-GHS-GLO [107]. These works except [6, 7] make use of both DUC2001 and DUC2002 datasets for reporting the performance of summarization systems. In addition to these methods, in paper [99], five regression-based methods are proposed, namely, LeastMedSq, Linear Regression, MLP Regres-

1https://www-nlpir.nist.gov/projects/duc/data.html0 2https://doceng.org/doceng2019

98 5.6 Experimental Results sor, RBF Regressor, and SMOreg, which differ in terms of machine learning classifier used. Out of these regression-based models, Linear Regression and LeastMedSq performed the best for DUC2001 and DUC2001 datasets, respectively. Therefore, these best methods are also consid- ered for comparison purpose. Note that [6, 7] make use of only DUC2002 dataset. Therefore, for a fair comparison, results are directly taken from these reference papers. Above discussed techniques are already described in Section 2.2.2 of Chapter 2. As CNN dataset is released as part of the competition on single-document extractive sum- marization, therefore, we have used two top systems submitted in the competition. The first system was developed by Oliveria et al. [185] from Federal Institute of Espirito Santo, Brazil. They have considered the problem of ESDS as a problem of maximum coverage by selecting the optimal subset of sentences from the document and utilized Integer Linear Programming (ILP) [186] to maximize it. Brito et al. from Fraunhofer Center for Machine Learning, Germany developed the second system in which they have used the SummaRuNNer model [7] with some modifications. For more details, refer to [184].

5.5.4 Parameter Settings

Different parameter values used in our proposed framework are- DE parameters: | P |= 40, mating pool size=4, threshold probability in mating pool construction (β)=0.7, maximum num- ber of generations (gmax)=25, crossover probability (CR)=0.2, b=6, F=0.8. SOM parameters: initial neighborhood size (σ0)=2, initial learning rate (σ0)=0.6, training iteration in SOM=|P|, topology=rectangular 2D grid; grid size=5 × 8. Sensitivity analysis on DE parameters and SOM parameters can be found in [46] and [168], respectively. Inspired by these works, similar values of parameters are utilized in the current work. Importance factors/weight values assigned to different objective functions: α = 0.25, β = 0.25, γ = 0.10, δ = 0.11, λ = 0.19, φ = 0.10; System summary: length (in words)=100 words. In most of the existing literature [46, 180], similar weight values of importance factors are considered. As there are six objective functions; there- fore, the maximum number of fitness function evaluations is 6240. Results obtained are averaged over 10 runs of the algorithm. Word Mover Distance makes use of pre-trained word2vec model on GoogleNews (https://github.com/mmihaltz/word2vec-GoogleNews-vectors) corpus to calculate the distance between two sentences.

5.6 Experimental Results

Table 5.1 reports the ROUGE scores obtained by our proposed approaches using different similar- ity/dissimilarity measures (NGD, CS, WMD) and different state-of-the-art methods on DUC2001

99 Extractive Single Document Summarization using Multi-objective Binary Differential Evolution and DUC2002 data sets. Note that these results are generated by our proposed approach with SMaxRouge strategy of selection of a single best solution from the final Pareto optimal front as mentioned in Section 5.4.9. To illustrate the utility of incorporating SOM based genetic operators in the DE process, results are also reported for multi-objective binary DE-based sum- marization approach with standard genetic operators of DE (without using SOM). It can be observed that our approaches using discussed similarity/dissimilarity measures outperforms all other approaches for both the data sets in terms of ROUGE-1 and ROUGE-2 scores. The best ROUGE scores as reported in Table 5.1 for both the datasets were obtained using Approach-1 with SOM-based genetic operators and WMD as the similarity measure. Thus it can be con- cluded from obtained results that the use of different sentence similarity/dissimilarity measures and self-organized multi-objective differential evolution for optimization indeed helps in achiev- ing improved performance.

As any evolutionary algorithm generates Pareto optimal solutions in the final generation, therefore, we have shown the Pareto optimal fronts obtained (over one random document of DUC2001/DUC2002) after the application of the proposed Approach-1 (WMD) with SOM-based operators in the Fig 5.2. These fronts correspond to first, fourteen, nineteen and twenty-fifth generations. Note that it is difficult to plot Pareto optimal fronts for six objective functions. Therefore, we have shown the projected Pareto optimal fronts in three objective space (as shown in Fig 5.2). The following three subsections will discuss the results obtained using different distance/similarity measures on DUC2001 and DUC2002 datasets.

5.6.1 Discussion of Results Obtained using Normalized Google Distance (NGD)

In Table 5.1, considering all cases (both approaches, with SOM and without SOM based genetic operators), our results beat other existing methods. The best ROUGE scores for both the datasets were obtained using Approach-1 with SOM-based genetic operators. On comparing the results of Approach-2 with SOM and without SOM-based operators for DUC2002 dataset, it was observed that ROUGE-2 and ROUGE-1 scores are higher in case of Approach-2 without SOM-based operators. But, the difference is not much significant when compared using SOM based operators.

5.6.2 Discussion of Results Obtained using Cosine Similarity (CS)

In Table 5.1, considering all cases (both approaches, ‘with SOM’ and ‘without SOM’ based genetic operators), it can be concluded that our proposed approaches outperform other existing

100 5.6 Experimental Results

Table 5.1: ROUGE Scores attained by different methods for DUC2001 and DUC2002 data sets. Here our proposed methods are executed using Normalized Google Distance (NGD), Cosine Similarity (CS) and Word Mover Distance (WMD), and, SMaxRouge strategy is used for selecting a single best solution from the final Pareto front. Here, † denotes the best results; it also indicates that results are statistically significant at 5% significance level; xx indicates results are not available in reference paper. For LeastMedSq and Linear Regression methods, results in the reference paper are presented up to 4 decimal points, therefore, to make a fair comparison up to 5 decimal points, we have added 0 as the last decimal digit such that their results remain unchanged. Similar case also applicable to NN-SE and SummaRuNNer methods.

DUC2001 DUC2002 ROUGE-2 ROUGE-1 ROUGE-2 ROUGE-1 With SOM 0.26949 0.47699 0.27846 0.50225 Approach-1 (NGD) Without SOM 0.26742 0.47521 0.27705 0.50191 With SOM 0.26774 0.47291 0.27519 0.49899 Approach-2 (NGD) Without SOM 0.26265 0.46762 0.27654 0.50162 With SOM 0.26459 0.47554 0.27649 0.50624 Approach-1 (CS) Without SOM 0.25282 0.46289 0.27292 0.50050 With SOM 0.26209 0.47398 0.25961 0.49159 Approach-2 (CS) Without SOM 0.26629 0.47862 0.27319 0.50147 With SOM 0.29238† 0.50236† 0.28846† 0.51662† Approach-1 (WMD) Without SOM 0.28930 0.49486 0.28556 0.51441 With SOM 0.28462 0.49863 0.28520 0.51538 Approach-2 (WMD) Without SOM 0.28190 0.48877 0.28656 0.51406 COSUM [108] - 0.20123 0.47274 0.23092 0.49083 ESDS-GHS-GLO [107] - 0.19574 0.45403 0.22142 0.47903 MA-SingleDocSum [105] - 0.20142 0.44862 0.22840 0.48280 DE [55] - 0.18523 0.47856 0.12368 0.46694 UnifiedRank [160] - 0.17646 0.45377 0.21462 0.48487 FEOM [104] - 0.18549 0.47728 0.12490 0.46575 NetSum [109] - 0.17697 0.46427 0.11167 0.44963 CRF [96] - 0.17327 0.45512 0.10924 0.44006 QSC [101] - 0.18523 0.44852 0.18766 0.44865 SVM [95] - 0.17018 0.44628 0.10867 0.43235 Manifold Ranking [98] - 0.16635 0.43359 0.10677 0.42325 Linear Regression [99] - 0.21104 0.46374 0.23924 0.49784 LeastMedSq [99] - 0.20794 0.46204 0.23964 0.49824 NN-SE [6] - xx xx 0.23200 0.47400 SummaRuNNer [7] - xx xx 0.23900 0.45400

methods. Out of both operators in Approach-1 utilizing WMD, ‘with SOM’ operator performs well. On the other hand, for DUC 2002 dataset, comparing the results of Approach-2 using both operators, ROUGE-2 and ROUGE-1 scores are higher in case of ‘without SOM’ based operators.

101 Extractive Single Document Summarization using Multi-objective Binary Differential Evolution

Table 5.2: ROUGE Scores attained by proposed Approach-1 and Approach-2 utilizing word mover distance (WMD) on CNN dataset. Here, SMaxRouge strategy is used for selecting a single best solution from the final Pareto front. ROUGE-2 ROUGE-1 With SOM 0.6590 0.7212 Approach-1 (WMD) Without SOM 0.5782 0.6527 With SOM 0.6337 0.7035 Approach-2 (WMD) Without SOM 0.5689 0.6461 Brito et al. —— 0.3400 0.4600 Oliveira et al. —— 0.4500 0.5700

(a) (b)

1.0 1.0 0.9 0.8 0.8 0.7 0.6

0.6 Coverage Coverage 0.4 0.5 0.4 0.40 0.8 0.35 0.7 0.30 0.6 0.20 0.2 0.5 0.25 0.25 0.4 0.30 0.20 0.3 0.3 Sentence Position0.35 Sentence0.4 Position 0.40 0.15 0.5 0.2 0.45 Readability 0.6 0.1 Readability

(c) (d)

1.0 1.0

0.8 0.8 0.6 0.6

Coverage 0.4 Coverage 0.4 0.2 0.2 1.0 1.0 0.8 0.8 0.6 0.2 0.6 0.2 0.3 0.4 0.3 0.4 0.2 Sentence Position0.4 0.2 Sentence Position0.4 0.5 Readability 0.5 0.0 Readability

Figure 5.2: Pareto optimal fronts obtained after application of the proposed approach. Here, Proposed approach refers to Approach-1 (WMD) with SOM-based operators. Sub-figures (a), (b), (c) and (d) are the Pareto optimal fronts obtained after first, fourteen, nineteen and twenty- fifth generation, respectively. Red color dots represent Pareto optimal solutions; three axes represent three objective functional values, namely, sentence position, readability, coverage.

102 5.6 Experimental Results

5.6.3 Discussion of Results Obtained using Word Mover Distance (WMD)

In Table 5.1, considering all cases (both approaches), it was found that Approach-1 obtains the best ROUGE scores with SOM based genetic operators for both the datasets. This result is also the best when comparing with other similarity/dissimilarity measures. One of the reasons behind this improved performance is the ability of WMD in capturing semantic relationships between sentences. Another possible reason is the use of SOM based operators which helps the algorithm to reach the optimal solution having good ROUGE scores. Time taken to generate summary using Approach-1 with SOM based operator for DUC2001 is 32 second/document, while the same approach without SOM based operator takes 29 second/document to generate the summary. For DUC2002, Approach-1 with SOM and without SOM based operators, take the almost same time, i.e., 20 second/document. Note that these reported times exclude the time taken to calculate similarity/ dissimilarity between two sentences, which is approx 10-20 second in case of WMD. As Approach-1 (as per results of Table 7.2), utilizing word mover distance and SOM-based genetic operator, performs the best; therefore, we have evaluated the same approach on the third dataset namely, CNN. The corresponding results are reported in the Table 5.2. Note that results of Approach-1 and Approach-2 for CNN dataset are shown using with and without SOM- based genetic operators. From Table 5.2, it can be observed that Approach-1 using WMD as a dissimilarity measure and SOM as a genetic operator, performs the best which was also the case for DUC2001 and DUC2002 datasets.

5.6.4 Study on Different Methods of Selecting a Single Best Solution from Final Pareto Front

In Table 5.1, we have shown the best results produced by our proposed approaches utilizing SMaxRouge strategy for selecting a single best solution from the final Pareto front. But, in real time situations, actual summary may not be available. Therefore, we have explored various unsupervised methods under UMaxRouge strategy to generate a single summary out of multiple solutions on the final Pareto optimal front as discussed in Section 5.4.9. Corresponding results are reported in Table 5.3. It is important to note that among-st different proposed approaches, Approach-1 (WMD) performs the best with SMaxRouge strategy for the selection of single best solution; therefore, unsupervised methods are explored under this approach only. It can be observed from Table 5.3 that the method, MaxWeightSum2Obj, is able to beat the remaining approaches for DUC2002 dataset; having Rouge-1 and Rouge-2 scores of 0.51191 and 0.24871 (using SOM based operators), respectively, but, these scores are less than Rouge-1 and Rouge-2

103 Extractive Single Document Summarization using Multi-objective Binary Differential Evolution scores of 0.51662 and 0.28846, respectively, which were the best results attained by SMaxRouge strategy. For DUC2001 dataset, using MaxWeightSum2Obj, we obtain better results in terms of Rouge-1 score having value 0.20839, but, it is just close to the best result of existing approaches. But, most of the approaches under UMaxRouge strategy are not able to select the best solution as selected by SMaxRouge strategy. Hence performances of these approaches are poorer compared to SMaxRouge strategy as reported in Table 5.1.

Table 5.3: ROUGE Scores obtained using Approach-1 (WMD) when the best solution is selected using any of the strategies under UMaxRouge strategy. All the strategies explored here for selecting a single best solution from the final Pareto front are unsupervised in nature. Bold entries indicate they are able to beat the state-of-the-art algorithms.

DUC2001 DUC2002 ROUGE-2 ROUGE-1 ROUGE-2 ROUGE-1 With SOM 0.09268 0.31442 0.11924 0.34899 MaxCoh Without SOM 0.08949 0.30803 0.16460 0.27372 With SOM 0.13969 0.41237 0.17064 0.45934 MaxCov Without SOM 0.16107 0.42382 0.11007 0.24130 With SOM 0.13388 0.38343 0.15633 0.423459 MaxRead Without SOM 0.13353 0.38081 0.16276 0.28974 With SOM 0.11518 0.37225 0.14217 0.42641 MaxSenLen Without SOM 0.11659 0.37350 0.11830 0.23563 With SOM 0.20163 0.43891 0.24859 0.50957 MaxSenPos Without SOM 0.19796 0.43700 0.18503 0.33797 With SOM 0.17096 0.42528 0.20021 0.46747 MaxSimTitle Without SOM 0.07824 0.21931 0.16498 0.30265 With SOM 0.17484 0.42412 0.20669 0.47523 MaxWeightSumAllObj Without SOM 0.20450 0.45214 0.18319 0.32418 With SOM 0.20839 0.47140 0.24871 0.51191 MaxWeightSum2Obj Without SOM 0.20431 0.44477 0.18402 0.33673 With SOM 0.19723 0.43780 0.24787 0.50997 MaxWightedSum3Obj Without SOM 0.20518 0.44514 0.33872 0.18752 With SOM 0.12717 0.32238 0.15327 0.37152 Ensemble Without SOM 0.12065 0.31312 0.14944 0.36796 With SOM 0.06512 0.25632 0.09802 0.30849 EnSemMaxLen Without SOM 0.08931 0.26963 0.09714 0.30733 With SOM 0.11611 0.30302 0.14499 0.35167 EnSemMaxSentTitle Without SOM 0.05194 0.22642 0.14267 0.35113 With SOM 0.18474 0.43984 0.21136 0.48370 MaxObjDistOrigin Without SOM 0.186083 0.43571 0.162728 0.30669 With SOM 0.15695 0.39800 0.19048 0.44749 MinReconsErrorWord2vec Without SOM 0.14409 0.38408 0.18777 0.32736 With SOM 0.29221 0.49990 0.28620 0.51623 MinReconsErrorDoc2vec Without SOM 0.28930 0.48486 0.27142 0.50101

104 5.6 Experimental Results

Ensemble based approach in general performs well. But, as there are a large number of non-dominated, a variety of solutions (a solution ‘soli’ may be good in terms of ‘sentences to title similarity’ objective as compared to ‘solj’. On the other hand, solution ‘solj’ may be good in terms of cohesion objective which is of low priority in our approach) and we have considered the sentences belonging to these solutions to generate the final summary, the ensemble approach does not perform better than SMaxRouge strategy.

After observing the results obtained by MaxCoh, MaxCov, MaxRead, MaxSenLen, MaxSen- Pos, MaxSimTitle approaches of selecting a single best solution (based on maximum value of single objective function) it was concluded that these approaches are also not able to extract the best solution from the final Pareto optimal front. Only the approach, MinReconsErrorDoc2vec is able to perform well and beats the existing algorithms. But, there are slight variations in the results as reported in Table 5.1. In summary, it can be concluded that solutions selected using MinReconsErrorDoc2vec under UMaxRouge scheme are very similar to those selected by SMaxRouge scheme (refer to Table 5.1) where available reference/gold summary is utilized for selecting single best solution. Thus performances of the proposed approaches under MinRecon- sErrorDoc2vec and SMaxRouge strategies are similar. But the MinReconsErrorDoc2vec scheme does not utilize any available supervised information. Thus the use of the MinReconsError- Doc2vec scheme is recommended with the proposed approaches for selecting the single best solution from the final Pareto front. Note that doc2vec used in this approach was trained using DUC2001, DUC2002, DUC2006 and DUC2007 data sets utilizing implementation available at https://github.com/jhlau/doc2vec under the default parameters mentioned in that link and makes use of pre-trained model on googlenews corpus. DUC2006 and DUC2007 are the standard summarization datasets consisting of 50 and 45 document sets, respectively.

5.6.5 Convergence Plots

We have shown the convergence plots obtained by our proposed approach for some random documents in Fig 5.3. Maximum Rouge-1 and Rouge-2 score values attained by our approach over the generations are plotted. These figures show that Approach-1 (WMD) with SOM converges to a Rouge-1/Rouge-2 value after a particular iteration (as there is no change in Rouge-1/Rouge-2 score values after that iteration). This also proves the faster convergence of our approach towards the near optimal value of Rouge score (in comparison to other approaches).

105 Extractive Single Document Summarization using Multi-objective Binary Differential Evolution

(a)

(b)

(c)

(d)

Figure 5.3: Convergence plots. Sub-figures (a), (b), (c) and (d) show the convergence plots for four random documents. At each generation/iteration, maximum Rouge-1 and Rouge-2 scores are plotted.

106 5.6 Experimental Results

5.6.6 Improvements Obtained

We have also calculated the performance improvements obtained (PIO) by our best approach under SMaxRouge strategy to select a single best solution from the final Pareto front in com- parison to existing methods using the ROUGE−2 and ROUGE−1 scores and those values are shown in Table 5.4. These improvements correspond to the best results when using Approach-1 (WMD) with SOM based operators. Mathematically, PIO is defined as :

P roposedMethod − OtherMethod PIO = × 100 (5.12) OtherMethod

Table 5.4: Improvements attained by the proposed approach, Approach-1 (WMD) with SOM based operators over other methods considering ROUGE scores. Here, xx indicates non- availability of results on the DUC2001 dataset. Improvements obtained by Proposed approach (%) Methods DUC2001 DUC2002 ROUGE-2 ROUGE-1 ROUGE-2 ROUGE-1 COSUM 45.30 6.27 24.92 5.25 ESDS-GHS-GLO 49.37 10.64 30.28 7.85 MA-SingleDocSum 45.16 11.98 26.30 7.01 DE 57.85 4.98 133.24 10.64 UnifiedRank 65.69 10.71 34.41 6.55 FEOM 57.63 5.26 130.96 10.92 NetSum 65.21 8.21 158.32 14.90 CRF 68.74 10.38 164.07 17.4 QSC 57.85 12.01 53.72 15.15 SVM 71.81 12.57 165.45 19.49 Manifold Ranking 75.76 15.86 170.18 22.06 Linear Regression 38.57 8.34 20.60 3.78 LeastMedSq 40.63 8.74 20.40 3.70 NN-SE xx xx 24.88 8.99 SummaRuNNer xx xx 20.70 13.21

Here, improvements obtained by our proposed approach compared to MA-SingleDocSum and DE are 45.16% and 4.98%(≡ 5%) considering ROUGE−2 and ROUGE−1 scores, respectively, for the DUC2001 dataset. While for DUC2002 dataset, improvements obtained by our approach compared to MA-SingleDocSum and COSUM are 26.3% and 5.25%, respectively. After compar- ing with the latest work on summarization [7] based on neural network, we obtained 20.70% and 8.99%(≡ 9%) improvements over ROUGE-2 and ROUGE-1 scores, respectively, for the DUC2002 dataset. In summary, for DUC2001 dataset, minimum 38.57% and 5.24% improvements are ob- tained over the existing techniques in terms of ROUGE-2 and ROUGE-1 score, respectively. While for DUC2002 dataset, mimimum 20.60% and 3.70% improvements are obtained over the

107 Extractive Single Document Summarization using Multi-objective Binary Differential Evolution existing techniques in terms of ROUGE-2 and ROUGE-1 score, respectively.

5.6.7 Error-analysis

In this section, we have thoroughly analyzed the errors made by our proposed approach, Approach- 1 (with SMaxRouge strategy of selection of a single best solution from the final Pareto optimal front) with SOM-based operators using WMD as similarity/dissimilarity measure between sen- tences (as this approach gives the best result). Some random documents from DUC2001/DUC2002 are selected to perform error-analysis from each dataset. Some parts of the lines in predicted and reference/actual summary do not match because some sentences in the actual summary were generated by human annotators. In Fig 5.4, an example of generated summary by our proposed algorithm is shown corresponding to document AP 881109 − 0149 of topic d21d under DUC2001 dataset. The same color shows matching lines, and the beginning of a line is indicated by [Line- number]. Here, the generated summary covers most of sentences in actual summary having 0.8115 and 0.6383 as ROUGE-1 and ROUGE-2 scores, respectively. Therefore, it is considered as a good summary.

Reference summary: DUC2001 -> d21d -> AP881109-0149

[Line-1] The cruise ship Song of America was forced to return to port after an engine seized up and started a small fire, but no one was hurt . [Line 2] The ship left Miami on Sunday with about 1,300 passengers on a Caribbean cruise. Rick Steck, a spokesman for Royal Caribbean Cruise Line said the fire was quickly doused by crewmembers . [Line 3] The passengers, who had been brought on deck , were allowed to resume the evening's activities. [Line 4] The 705-foot ship turned around and returned to Cozumel on its remaining three engines to replenish firefighting supplies. The passengers stayed aboard , and the ship will return to Miami on Thursday or Friday .

Predicted summary:

[Line-1] The cruise ship Song of America was forced to return to port after an engine seized up and started a small fire, but no one was hurt, the ship 's owner said today. [Line 2] The ship left Miami on Sunday with about 1,300 passengers on a Caribbean cruise. [Line 3] The passengers were mustered on deck while crew members doused the blaze, but then allowed to resume the evening's activities, he said. [Line 4] The 705-foot ship turned around and returned to Cozumel on its remaining three engines to replenish firefighting supplies, Steck said.

Figure 5.4: An example of reference summary and predicted summary for document AP 881109− 0149 of topic d21d under DUC2001 dataset.

Fig 5.5 shows an example of a predicted summary which does not seem to be good, and the corresponding values of ROUGE-1 and ROUGE-2 scores are 0.44 and 0.1276, respectively. The possible reasons could be the generation of reference summary by human annotators. Our developed approach is based on extractive summarization. Therefore, it selects direct sentences from the document to be present in the generated summary, but, it is not capable of restructuring

108 5.6 Experimental Results the sentences. For example, consider Line-1 of Fig 5.5 in the predicted summary which is too long in original document, but is shortened by annotators in Line − 1 of reference summary to allow the reference summary to cover other themes of the main document (as more number of words can be added to reach the desired summary length). However, our predicted summary is not able to cover the whole idea of the document as the selection of Line − 1 increases the number of the words in summary and not many sentences can be added because of restriction in the number of words in summary.

Reference summary: DUC2001 -> d60k -> SJMN91-06106024

[Line-1] Rodney King spends his time seeing doctors and thinking about his injuries he fears may become permanent. [Line-2] He is staying with relatives and fears retribution by the police. [Line- 3] His ex-wife says he's depressed and frightened; his attorney has hired guards to protect him. [Line-4] King suffers headaches and numbness of the face after five hours of plastic surgery to repair fractures of his cheek and eye bones, and has instituted an $83 million law suit against the city for excessive force. [Line-5] In another development, he 's now a suspect in a February 21 robbery and shooting, a result of the wide publicity.

Predicted summary:

[Line-1] Six weeks after his beating by Los Angeles police and seemingly forgotten in the political turmoil that has followed -- Rodney G. King fears retribution, spends most of his time seeing doctors, and thinks a lot about the headaches, scars and facial numbness he worries might become permanent. [Line-2] Lerman has filed an $83 million claim against the city on King's behalf. King's neat, blue home in Altadena has the curtains drawn, its phone number and those of other family members long changed .

Figure 5.5: An example of reference summary and predicted summary for document SJMN91− 06106024 of topic d60k under DUC2001 dataset.

5.6.8 Study on Effectiveness of SOM based Operators on DUC2001 and DUC2002 datasets

Note that difference in the Rouge-1/Rouge-2 score values attained by ‘with SOM’ and ‘without SOM’ versions of Approach-1 (WMD) (shown in the Table 5.1) seems to be very small. In order to further investigate the issue, we have carried out the following analyses: (a) plotted the box plots; (b) performed the t-test. Detailed information about these are given below:

1. Box plots: We have plotted the box plots showing the variations of the average Rouge- 1/Rouge-2 values of the highest ranked solutions (Rank-1) produced in the final gen- eration of each document. For example, let d be a particular document belonging to DUC2001/DUC2002 dataset and Q be the number of rank-1 solutions obtained on the final Pareto optimal front of the final generation in that document, then average Rouge-1

109 Extractive Single Document Summarization using Multi-objective Binary Differential Evolution

for the document ‘d’ denoted as Average R1d is calculated as:

Q X Average R1d = R1j (5.13) j=1

where, R1 and j indicate the Rouge-1 score and rank-1 jth solution, respectively. Similar steps are followed to calculate the average Rouge-2 value. Following the above process, average Rouge scores are calculated for all the documents. This is done because we have reported the average Rouge-1/Rouge-2 scores of the best solutions of all documents in Table 5.1 and the best solution is one of the highest ranked solutions. Note that the best results are obtained using Approach-1 (WMD). Therefore, box plots are drawn for this method. From Fig 5.6(a) and 5.6(b), it is evident that Approach-1 with SOM based operators attains better median values of the average of Rouge-1/2 values of rank-1 solutions of all documents for DUC2001 and DUC2002 datasets, respectively, in comparison to those obtained by ‘without SOM based operators’. Also for both the datasets, Approach-1 (WMD) using SOM-based operator covers solutions having a high range of Rouge-1/Rouge-2 values as can be analyzed from the green bullets/points in these figures.

We have also drawn the box plots for three random documents showing Rouge-1/Rouge-2 variations (with SOM and without SOM based operators) across different rank-1 solu- tions. These box plots are shown in Fig 5.7 and Fig 5.8 for DUC2001 and DUC2002 dataset, respectively. These box plots per document also show the superiority of SOM based operators in covering a high range of Rouge-1 and Rouge-2 score values. At the top of each sub-figure of Fig 5.7 and Fig 5.8, super-title is written describing dataset name, topic name and document number under that topic. For example, at the top of Fig 5.7(a), ‘DUC2001/d03a/WSJ911204-0162 ’ is written indicating dataset: DUC2001, topic name: d03a and document number: WSJ911204 − 0162.

2. t-test: We have also conducted t-test to show the significant difference between the Rouge recall values obtained by two versions (with SOM and without SOM based operators) of Approach-1 (WMD) under SMaxRouge scheme. The p-values (at 5% significant level) attained by these approaches are reported in Table 5.5.

Table 5.5: The p-values obtained by Approach-1 (WMD) with SOM and without SOM based operators (under SMaxRouge scheme) considering ROUGE-1 and ROUGE-2 score values. Dataset ROUGE-1 ROUGE-2 DUC2001 0.024134 0.032038 DUC2002 0.218967 0.238569

110 5.6 Experimental Results

(a) (a)

(b)

(b)

Figure 5.6: Box plots. Sub-figures (a) and (b) for DUC2001 and DUC2002 dataset, respectively, show the variations of average Rouge-1/Rouge-2 values of highest ranked (rank-1) solutions in each document. In each colored box, the horizontal colored line indicates the median value of rank-1 solutions.

These p-values obtained on DUC2001 dataset clearly show that Approach-1 (WMD), when used with SOM based operator significantly improves the results. But, on DUC2002 dataset, results obtained are not significant as Rouge score values attained by Approach- 1 (WMD) with SOM based operators are close to those attained by Approach-1 (WMD) without SOM based operators. However, from Figure 5.6-5.8, it is appropriate to say that there exist a set of documents for which our approach is able to determine good quality solutions with high Rouge scores in less number of iterations when used with SOM based operators.

111 Extractive Single Document Summarization using Multi-objective Binary Differential Evolution

(a)

(b)

(c)

Figure 5.7: Box plots. Sub-figures (a), (b) and (c) show the Rouge-1/Rouge-2 score variations per document over DUC2001 dataset. In each colored box, the horizontal colored line indicates the median value of Rouge-1/Rouge-2 score using rank-1 solutions of a document.

5.6.9 Statistical Significance t-test

To validate the results obtained by the proposed approach, a statistical significance test named as, Welch’s t-test [187], is conducted at 5% significance level. It is carried out to check whether the best ROUGE scores obtained by Approach-1 (WMD) with SOM based operators (under SMaxRouge scheme) are statistically significant or occurred by chance. To establish this, we have calculated the p-value using Welch’s t-test among two groups. The first group includes a

112 5.6 Experimental Results

(a)

(b)

(c)

Figure 5.8: Box plots. Sub-figures (a), (b) and (c) show the Rouge-1/Rouge-2 score variations per document over DUC2002 dataset. In each colored box, the horizontal colored line indicates the median value of Rouge-1/Rouge-2 score using rank-1 solutions of a document.

list of ROUGE-1 (ROUGE-2) values produced by our method after executing it for Q (equal to number of comparing methods) times, while, the second group contains a list of ROUGE-1 (ROUGE-2) values by remaining methods. Now, two hypotheses are considered by this t-test namely, the null hypothesis and the alternative hypothesis. The null hypothesis states that there is no significant difference between median ROUGE-1 (ROUGE-2) values of the two groups. On the contrary, alternative hypothesis states that there is significant difference between median ROUGE-1 (ROUGE-2) values of two groups. This t-test provides p-value. Minimum p-value

113 Extractive Single Document Summarization using Multi-objective Binary Differential Evolution signifies that our results are significant. The p-values obtained are shown in Table 5.6. Test results support the hypothesis that obtained improvements by the proposed approach are not occurred by chance, i.e., improvements are statistically significant.

Table 5.6: The p-values obtained by Approach-1 (WMD) with SOM based operators (under SMaxRouge scheme) considering respect to existing methods. Dataset ROUGE-1 ROUGE-2 DUC2001 0.000152 < 0.00001 DUC2002 0.004183 < 0.00001

5.6.10 Complexity Analysis of the Proposed Approach

In this section, the complexity of the proposed approaches both for with SOM and without SOM based genetic operators are analyzed. Let N be the number of solutions, M be the number of objectives to be optimized, T be the maximum number of generations.

With SOM:

1) Population initialization step takes O(N) time as there are N solutions which are randomly initialized using a binary vector obeying some constraint. Each solution undergoes objective function calculation step which takes O(NM) time. Thus, the total time complexity of popula- tion initialization is O(N + NM) which is equivalent to O(NM). 2) The solutions in the population undergo SOM training which takes O(N 2) time [157]. 3) Mating pool generation for each solution takes O(N 2) time as for each solution we have to find neighbors. 4) The time taken for new solution generation using genetic operators (crossover and mutation) is O(N + NM). The term M is present because of objective function calculation for each new solution. 5) Evaluation of dominance and the non-dominance relationships between 2N solutions (after merging old population and new population) and then the selection of best N solutions take O(MN 2) time [29]. Steps-2 to 5 are repeated up to T number of generations. Note that updation of SOM training data takes constant time. So it can be ignored. Thus, the total time complexity of the proposed architecture with SOM based operators is

O(MN + T (N 2 + N 2 + N + NM + MN 2)).

On solving further, it gives rise to

=⇒ O(MN + T (2N 2 + NM + MN 2))

114 5.7 Conclusive Remarks

=⇒ O(MN + T (MN 2)) =⇒ O(MN(1 + TN)) ≈ O(TMN 2)) which is the worst time complexity of our approach when using SOM based genetic operators.

Without SOM-based Genetic Operators:

In the proposed architecture without SOM based genetic operators, step-2 and step-3 will not be there. Here, the mating pool for each solution is the entire population. Other steps will remain the same. Thus total time complexity without SOM based genetic operators is

O(MN + T (N + NM + MN 2)) ≈ O(MN + TMN 2) =⇒ O(MN(1 + TN)) ≈ O(TMN 2) which is the same as the time complexity of the proposed architecture when developed with SOM based genetic operators.

5.7 Conclusive Remarks

In this chapter, an extractive single document text summarization system is developed. Six objective functions are utilized for selecting a good subset of sentences present in the document. The similarity/dissimilarity between two sentences is calculated utilizing three measures: nor- malized google distance, word mover distance and cosine similarity to show that summarization result not only depends on proposed framework but also the type of similarity/dissimilarity mea- sures used. Various unsupervised methods are explored to select a single best summary from the available set of summaries on the final Pareto optimal front. Experimental results on several benchmark data sets show that our SOM-based approach with WMD as a distance measure outperforms other existing methods. Experimental results demonstrate that our SOM-based approach with WMD as a distance measure obtained 45% and 5% improvements over the best existing method considering ROUGE−2 and ROUGE−1 scores, respectively, for the DUC2001 dataset. While for the DUC2002 dataset, improvements obtained by our approach are 20% and 5%, considering ROUGE−2 and ROUGE−1 scores, respectively. Currently, there is great demand for figure-summarization in the biological domain because figures attribute significantly in understanding the core concepts. These figures are always difficult to interpret by humans as well as machines. Therefore, in the next chapter, we will propose a system for the summarization of figures in the biological articles.

115

CHAPTER 6

Textual Entailment based Figure Summarization for Biomedical Articles

In biomedical scientific articles, figures play a significant role in understanding the core concept of the research presented. But, due to the high level of complexity, they cannot be easily inter- preted by machines nor by humans. Therefore, in this chapter, we propose a novel unsupervised approach (FigSum++) for automatic figure summarization in biomedical scientific articles. Dif- ferent quality measures capturing the relevance of the sentences to the figure are simultaneously optimized using the search capability of a multi-objective optimization technique to obtain a good set of sentences in the summary. A new way of measuring diversity among sentences in terms of textual entailment is also proposed.

117 Textual Entailment based Figure Summarization for Biomedical Articles

6.1 Introduction

6.1.1 Overview

In the current chapter, we introduce a novel extractive summarization technique to deal with the problem of summarizing the figures in biomedical articles in an unsupervised way. According to Futrelle [25], 50% of the texts in biomedical articles are related to figures only. Moreover, as per [26], only the caption of the figure and title of the article with an abstract convey 30% of the information related to the figure. These figures are always difficult to interpret by humans as well as machines. Therefore, associated texts in the article can be used to describe them. For example- [113] proposed a system, FigSum, to generate a summary of images related to the biomedical domain using scattered text throughout the various sections of scientific articles like the introduction, proposed method, results and other sections. The top-scoring sentences having high tf-idf cosine similarity [188] with the figure’s caption and the article’s central theme were considered as a part of the summary. However, in a biomedical article, several sentences are there, and it is difficult to decide which are more relevant to the figure. Therefore, there is a need to develop a more sophisticated system which summarizes figures by extracting the relevant sentences by optimizing different criteria in an unsupervised way. To measure the similarity between sentences, a well-known measure, cosine similarity, [189] is used. The higher the similarity, the closer they are. But this measure requires vector repre- sentation of the sentences for which a recently developed pre-trained language model on large biomedical corpora, namely, BioBERT [45], is utilized. Bidirectional Encoder Representations from Transformers for Biomedical Text Mining (BioBERT) is a domain-specific language repre- sentation. The model was applied on different NLP tasks, and improved performance has been reported for solving many BioNLP tasks [45], such as biomedical relation extraction, biomedical named entity recognition, among others. Therefore, in our task, we have made use of this repre- sentation to characterize the sentences in semantic space. Note that it can capture the semantic similarity between sentences. Our work is motivated by the fact that in a biomedical article, many sentences are there, and those may be relevant to the figure with respect to different perspectives (also called scoring features or fitness functions or objective functions) like whether the sentences refer to that figure (SRF), amount of similarity the sentences have with figure’s caption (SFC), number of 1-gram overlapping words between sentence and figure’s caption (SOC1), and number of 2- gram overlapping words between sentence and figure’s caption (SOC2). Moreover, whether a sentence entails a figure’s caption (STE) or not, can be considered to be another scoring

118 6.1 Introduction function. Therefore, in our proposed system (FigSum++), these sentence-scoring functions are optimized simultaneously in an unsupervised way using the multi-objective (MOO) binary differential evolution [46] algorithm (MOBDE), which is an evolutionary algorithm (EA). For more details about MOBDE, refer to section 2.1.10 of Chapter-2. To avoid redundancy in generated summary, another goodness measure named anti-redundancy (SAR) is also considered in our optimization process. Note that SAR employs the cosine similarity while computing the similarity/dissimilarity between sentences of the summary in semantic space. It is also important to note that SAR is considered to maintain diversity amongst sentences. Generally, in the MOBDE framework, the rand/1/bin scheme/variant is used (as described in Section 2.1.10 of Chapter 2) to generate a new solution at each iteration using fixed values of two parameters: mutant factor (F) and crossover rate (CR) [80]. As a result, the search ability of these algorithms could be limited. Note that in the MOBDE framework, CR and F are the two crucial parameters that help in reaching the optimal global solution. Moreover, the rand/1/bin scheme may not be efficient as it has an exploratory nature. But, the best solution (or the best summary for a given figure) may lie in the local or global region. Therefore, instead of rand/1/bin, the ensemble of two other DE schemes (current-to-rand/1/bin and current-to-best/1/bin) is used in the new solution generation process. The motivation behind using these variants is that in any evolutionary algorithm, diversity among solutions and convergence towards true/global optimal solutions are the important phenomena, which can be achieved using current-to-rand/1/bin and current-to-best/1/bin, respectively. More information about these variants can be found in the paper [80]. Also, to eliminate fixing the values of F and CR parameters, a pool of values of these parameters is also considered based on literature survey [80, 190]. These DE variants can randomly select F and CR values from the given pool. This phenomenon is shown in Figure 6.3 (more description provided in section 6.3.4). Because it is difficult to decide which set of objective functions is the most suited for our task using the MOO-based algorithm, an ablation study was also done on the selected objective functions. Here, ablation study means various combinations of the objective functions, for example, (a) SAR TE and STE; (b) SAR TE and SRF; (c) SAR TE, SRF, and SFC, and others are optimized simultaneously using the MOBDE framework in different runs of our proposed algorithm. Textual entailment (TE) [63] is itself a challenging problem in NLP domain. The importance of TE can be understood by the BioNLP1 2019 shared task on textual inference and question entailment on the biomedical text. For more description of TE with an example, the reader can refer to section 2.1.6 of Chapter-2. Due to the popularity of TE, we have proposed a different

1https://aclweb.org/aclwiki/BioNLP Workshop

119 Textual Entailment based Figure Summarization for Biomedical Articles way of measuring anti-redundancy in a summary. The sentences in a summary should not be entailed to each other to maintain diversity amongst sentences. Thus, in total, two ways of measuring anti-redundancy in the summary are explored: one makes use of cosine similarity, while another makes use of the textual entailment relationship between sentences.

6.1.2 Contributions

Following are the major contributions of this chapter:

1. To the best of our knowledge, the proposed work is the first attempt in developing a multi- objective based framework for solving figure-summarization tasks in which various sentence scoring features like the number of sentences referring to the figure, semantic similarity be- tween sentences and a figure’s caption, the number of overlapping words between sentences and a figure’s caption and others are optimized simultaneously to generate a good quality summary. Moreover, whether, sentences in a summary entail a figure’s caption or not is also considered as another objective function in the optimization process.

2. Any multi-objective evolutionary algorithm should satisfy two properties: diversity among solutions and convergence towards true Pareto optimal front. To achieve the same, two different DE variants (current-to-rand/1/bin and current-to-best/1/bin) are utilized in the current framework. The first schema incorporates diversity and the second one includes convergence.

3. To minimize redundancy amongst sentences in the generated summary, a new method utilizing textual entailment relationships between sentences is proposed.

4. To measure the similarity amongst sentences in the semantic space, a deep learning-based pre-trained language model recently proposed, namely, BioBERT [45], was developed for biomedical text mining, is utilized.

5. To determine the set of most contributing objective functions in our optimization process, an ablation study is presented.

6. All the existing approaches provide a single fixed length summary at the end of the exe- cution. But, as our approach is population-based in nature, therefore, multiple summaries of different lengths are provided at the end to the end-user and the user can select any summary based on his/her choice.

We tested our system on two gold-standard datasets, FigSumGS1 and FigSumGS2 containing 91 and 84 figures, respectively. Results obtained clearly show the superiority of our proposed

120 6.2 Problem Definition

1st, 4th, 5th, 8th , and 11th sentences are in the summary

1 0 0 1 1 0 0 1 0 0 1 0

Sequence of sentences in a biomedical article

Figure 6.1: ith solution representation in the population. Here, 12 is the number of sentences in the article, ‘0’ denotes that the sentence will not be a part of extractive summary and vice-versa.

Table 6.1: Description of symbols used in describing objective functions (mathematical formu- lation). Symbol Description xi ith solution is denoted as xi (as our system generates a set of solutions and each solution corresponds to a subset of sentences forming summary for mth figure) N maximum length of the solution or number of sentences in the article xij denotes the jth component (1/0) of ith solution; the value 0 indicates that jth sentence is not selected for summarization and 1 indicates that the sentence is selected for summarization. sij jth sentence of the kth article, belonging to ith solution | . | measures the count M cosine similarity between two sentences Ckm caption of mth figure in kth article S1 the set of sentences in the article entailed to Ckm S2 the set of sentences in the ith solution entailed to Ckm sia → sib bth sentence of the kth article, belonging to ith solution entailed by ath sentence belonging to same article and same solution ↑ and ↓ indicate fitness functions are of maximization and minimization type, re- spectively. algorithm in comparison to various state-of-the-art techniques.

6.2 Problem Definition

Consider a biomedical article A consisting of N sentences, A={s1, s2, . . . , sN } and a set of M figures {Fig-1, Fig-2, . . . , Fig-M}. We aim to summarize mth figure (Fig-m) using these sentences. Then, our main objective is to select a subset of sentences, S ∈ A, related to mth figure, defined as follows:

 N  X 1, if si ∈ S Smin ≤ Bi ≤ Smax and Bi = (6.1) i=1 0, otherwise such that {SAR TE(S), SAR CS(S), STE(S), SFC(S), SRF(S), SOC1(S), SOC2(S)} are op- timized simultaneously; where, Smin and Smax are the minimum and the maximum number of sentences to be present in the summary, respectively; SAR TE(S), SAR CS(S), STE(S),

121 Textual Entailment based Figure Summarization for Biomedical Articles

SFC(S), SRF(S), SOC1(S), and, SOC2(S) are the objective functions measuring different as- pects/qualities of summary at syntactic and semantic level and discussed below. Note that (a) there can also be two or more than two objective functions instead of seven; (b) In STE, SFC, SOC1, and SOC2, mth figure’s caption is utilized. Let us assume that we want to generate sum- mary of mth figure in kth article whose caption is Ckm. Then the steps of computing objective functions for ith solution are enumerated below and the notations used while calculating these objectives are provided in Table 6.1. Representation of ith solution is shown in Figure-6.1.

1. SAR : There can be redundant sentences in the article. Therefore, to reduce the redundancy in the summary, two versions of SAR are considered (any one can be used one at a time):

(a) SAR CS (↓): It measures the cosine similarity (CS) between sentences in the summary. Let us call it as SAR CS. It’s score for ith solution is calculated as

PN ( M(sia, sib)) SAR CS = a,b=1,a6=b if x = x = 1 (6.2) O ia ib

where, O is the total number of paired sentences considered during calculation and rest of the notations are discussed in Table 6.1.

(b) SAR TE (↓): Second version measures the anti-redundancy between sentences of the summary in terms of textual entailment relationships. It can be defined as below

 N N P P Q(s , s ) 1 if sia → sib SAR TE = a=1 b=1 ia ib if x = x = 1 and Q(s , s ) = O ia ib ia ib 0 otherwise (6.3) Here O is the total number of paired sentences considered during calculation.

2. STE (↑): This function calculates the entailment relationships between sentences of the summary and figure’s caption. To calculate the score for this function, first, we need to

identify the sentences in the articles which are entailed to mth figure caption, i.e., Ckm. Let us denote this set as S. Then, the number of overlapping sentences belonging to ith solution and S is calculated which will be considered as STE score. Mathematically, it can be expressed as STE =| S1 ∩ S2 | (6.4)

Note that to identify the entailed sentences in the article to Ckm, we have used the pre- trained model available at https://github.com/jgc128/mednli_baseline. In this model

122 6.3 Proposed Approach

GloVe2 word2vec embeddings (840 B tokens, 2.2M vocabulary size, and 300-dimensional vectors) are used for initialization followed by fine tuning using fastText3 word embedding on BioASQ4 and MIMIC-III5 data. Note that BioASQ is the collection of 12, 834, 585 abstracts of scientific articles related to the biomedical domain and MIMIC-III data consists of 2, 078, 705 clinical notes with 320 tokens.

3. SFC (↑): In this objective, average cosine similarity between sentences in the ith solution

and figure’s caption (Ckm) belonging to kth article is calculated. Mathematically it’s score is calculated as: PN M(sij,Ckm) SFC = j=1 if x = 1 (6.5) L ij

where, L is the count of xij having value of 1.

4. SRF (↑): It counts the number of sentences present in the ith solution referring to the mth figure by using keyword ‘Figure-m’. It is computed as

N X Ij where Ij = 1, if sentence sij refers to mth figure and xij = 1 (6.6) j=1

5. SOC1 (↑): It counts the number of 1-gram overlapping words between sentences present in the ith solution and mth figure’s caption; it is defined as follows:

N X | (W ords ∈ sij ∩ (W ords ∈ Ckm) | if xij = 1 and Words are in the form of 1-gram unit j=1 (6.7)

6. SOC2 (↑): It is similar to SOC1. Only difference is that in place of 1-gram, number of 2-gram overlapping words are counted. It is calculated as below:

N X | (W ords ∈ sij) ∩ (W ords ∈ Ckm) | if xij = 1 and Words are in the form of 2-gram unit j=1 (6.8)

6.3 Proposed Approach

This section discusses the various steps followed in our proposed approach (FigSum++). The corresponding flowchart is also shown in Figure 6.2.

2https://nlp.stanford.edu/projects/glove/ 3https://fasttext.cc/docs/en/english-vectors.html 4http://participants-area.bioasq.org/general information/Task6a/ 5https://mimic.physionet.org/

123 Textual Entailment based Figure Summarization for Biomedical Articles

1. Input: Figure with it’s 2. Pre-processing of sentences 3. Population initialization (P) caption and sentences in the article

No 4. Calculation of objective t < tmax functions

Yes

9. Selection of the best |P| 5. Generate new solutions to form a new Solutions population P`

10. Select the best solution

and report the 8. Apply non-dominating sorting 6. Calculate objectives functional values corresponding corresponding to each new solution summary

7. Merge old population (P) and new population (P`)

Figure 6.2: Flow chart of the proposed architecture where, g is the current generation number initialized with 0 value, tmax is the user-defined maximum number of generations, |P | is the size of the population.

6.3.1 Pre-processing

Before applying our proposed approach, pre-processing of biomedical article is required. List of steps followed to perform the same are described below:

1. Biomedical articles are available in the pdf format, therefore, first, sentences are extracted using Grobid tool6. Note that while extracting the sentences, abstract and appendix (if available) are excluded. Only remaining sections like introduction, methodology etc. are used.

2. Removal of stop-words.

Moreover, the cosine similarity between sentences is pre-computed as it will be required while running the experiments. To calculate the same, first, sentences are represented in the form of fixed length numeric vectors using the BioBERT [45] language model.

6.3.2 Population Initialization and Solution Representation

This step includes initialization of population. Population P consists of a set of solutions < ~x1, ~x2 . . . ~x|P | >, where, | P | is the size of the population. For our task, binary representation of the solution is followed having length equals to the number of sentences present in the article. Each solution may have a varied number of sentences generated randomly between the range

[Smin,Smax]. If the jth component of the solution is 1, then jth sentence should be part of

6https://grobid.readthedocs.io/en/latest/Grobid-service/

124 6.3 Proposed Approach summary and vice-versa. Solution representation is shown in Fig 6.1 assuming that article has 12 sentences.

6.3.3 Calculation of Objectives Functions

After initializing the population, objective functional values (discussed in Section 6.2) are com- puted for each solution, which help in evaluating the quality of the solution (or summary as the solution represents a summary). Note that the proposed framework is very generic and user can select any combination of objective functions.

6.3.4 Genetic Operators

This step is shown by step-5 of the Figure-6.2. The process followed for new solution generation is described below. Let ~xc,t be the current solution in the population at generation ‘t’ for which new solution is to be generated.

Mating Pool Generation

The mating pool includes a set of solutions which can mate to generate new solutions. For the construction of the mating pool for the current solution, (~xc,t), a fixed number of random solutions are picked up from the population.

Mutation and Crossover

In our work, we have used 2 trial vector generation schemes/variants namely, current-to-rand/1/bin and current-to-best/1/bin and different from rand/1/bin variant (used in most of the chapters). These schemes have distinct properties. First one helps in creating diverse solution from the current solution (which further helps in introducing diversity among solutions), while, second one helps in speed up the convergence rate (provides right direction in reaching towards global optimal solution). Moreover, F and CR are two crucial parameters present in MOBDE frame- work which help in generating good quality solutions or achieving faster convergence. In the literature [67, 191], the range of value suggested for F usually lies between 0.4 and 1, while for CR, value of 0.9 or 1 is suggested. But, sometime, fixing the values of these variables makes the search space limited. Therefore, instead of fixing them, pool of F and CR values are provided motivated by the paper [190] and discussed schemes can select these parameter values randomly from the given pools. Descriptions of these variants are provided in the paper [190] in contin- uous space. But, as our approach is based on binary encoding, therefore, they are adopted in binary space motivated by the paper [46]. To generate new trial vectors corresponding to ~xc,t,

125 Textual Entailment based Figure Summarization for Biomedical Articles all schemes first make use of mutation and then crossover which are discussed below:

1. current-to-rand/1/bin: a) Mutation: To perform this operation for the current solution ~xc,t, firstly three random solutions, ~xr1,t, ~xr2,t, and, ~xr3,t, are selected from its constructed mating pool and then a probability vector P (xc,t+1) is generated by following the following operation:

c,t+1 1 P (x ) = c,t+1 r1,t c,t r2,t r3,t (6.9) j 2×b[x +r×(x −x )+F ×(x −x )−0.5] − j j j j j 1 + e 1+2F

where ~xc,t is the current solution at generation ‘t’ for which new solution is generated, c,t c,t c,t r1,t c,t P (xj ) is the probability estimation operator, (xj + r × (xj + r × (xj − xj ) + F × r2,t r3,t (xj − xj ) − 0.5) is the mutation operation, b is a real positive constant, r is a random k,t number between 0 to 1, F is the DE control parameter, xj is the jth component of kth solution for k = {r1, r2, r3, c} at generation ‘t’. This operator generates probability value for each component of the current solution. Then Eq. 2.11 and 2.12 are followed to generate corresponding offspring, ~vc,t+1, for the current solution, ~xc,t.

2. current-to-best/1/bin: This variant makes use of two random solutions selected from the mating pool, current solution (~xc,t ) and the best solution ~xbest to generate a trial vector. Similar to current-to-rand/1/bin, it also first performs the mutation and then crossover. To select the best solution in the current generation, some mechanism like non-dominated sorting [29] can be used, but, it will increase computation time. Therefore, in our approach, the best solution is selected by considering the average of the used objectives functions (mathematically shown in Eq. 6.10).

m best,t X  f(~x ) = arg maxi=1,2,...,|P | Obij /m (6.10) j=1

where |P | and m are the size of the population (or number of solutions in the population)

and the number of used objective functions, respectively, Obij is the jth objective function value corresponding to ith solution. Note that the value of SAR CS/SAR TE get reversed while computing ~xbest,t because these two functions are of minimization type and rest are of maximization type. Then the following operation is performed to generate the probability vector which is further converted into binary space.

c,t+1 1 P (x ) = c,t best,t c,t r1,t r2,t (6.11) j 2×b[x +r×(x −x )+F ×(x −x )−0.5] − j j j j j 1 + e 1+2F

126

6.3 Proposed Approach

F= [ a, b, c] Trial vector 1 vc,t+1

Constraint If violated, make c,t X checking them feasible c Current solution vc,t+1 Select best trial vector Trial vector 2 CR= [ m, n, p]

Figure 6.3: Flow chart of generation of solutions from the current solution, ~xc,t at generation ‘t’ using two DE variants. Here, F and CR are the pool of some values; y1 and y2 are the trial vectors generated using current-to-rand/1/bin and current-to-best/1/bin scheme, respectively.

where ~xbest,t is the best solution at generation ‘t’, ~xc,t is the current solution at generation ‘t’ for which new solution is generated. Rest of the notations are same as in current-to- rand/1/bin. Then Eqs. 2.11 and 2.12 are followed to generate the trial vector.

Out of two vectors, one having good objective function values will be considered as the best trail vector for the current solution [192]. To find the best trial vector, we have again used the concept of maximum average objective functional values.

Checking of Constraints: After application of mutation and crossover operations, constraint of number of 1s in the new solutions/trial vectors is checked. It may be possible that generated new solutions don’t satisfy the constraint. Therefore, to make them feasible (within constraint) some heuristics can be applied. The following steps are executed to make the new solutions feasible or within the range, [Smin,Smax]:

• Let us denote the new solution (vc,t+1) as ith solution

• Initialize ModifiedSolution with zeros equal to the maximum length of the solution

• Sort the sentences present in the ith solution based on maximum number of uni-grams/maximum number of bi-grams/similarity with figure’s caption. To select a single selection criterion, a random probability ‘p’ is generated. If p < 0.33 then sentences in the solutions are sorted based on maximum number of uni-grams; if p > 0.33 and p < 0.67, then sentences in the solutions are sorted based on maximum number of bi-grams; otherwise, those are sorted based on maximum similarity with figure’s caption.

• Generate a random number ‘r’ between Smin and Smax.

127 Textual Entailment based Figure Summarization for Biomedical Articles

• Fill the indices of ModifiedSolution with 1s until we cover ‘r’ indices. Note that indices are considered in the sorted order as done in step-3.

• Return the ModifiedSolution.

The objective functional values of generated new solutions are also evaluated. The flow-chart of this entire process of solution generation is shown in Figure 6.3.

6.3.5 Selection of Best |P | Solutions for Next Generation

After forming a new population, P 0 , it is merged with the old population, P and then, top |P | solutions are selected using the dominance and non-dominance relationships between the solutions in the objective space. For more detail about this step, refer to Section 3.2.6 of Chapter 3.

6.3.6 Termination Condition

The process of mating pool generation, crossover, and mutation followed by selection and then updation of the population is repeated until a maximum number of generations, tmax is reached.

In other words, the loop will continue until t < tmax. Here, t is the current generation number initialized to 0 and is incremented by 1 after each iteration. This step is shown by the diamond box in Figure 6.2.

6.3.7 Selection of Single Best Solution and Generation of Summary

After the final generation, we obtain a set of non-dominated solutions on the final Pareto optimal front. All these solutions are non-dominating to each other, thus, having equal importance. Therefore, the decision-maker has to select a solution based on his/her requirement. In this work, for the purpose of reporting and comparative study, summary corresponding to each of the Pareto optimal solutions is generated and then, that solution is selected which has the highest F-measure value. In calculation of F-measure, it makes use of gold/reference summary. The sentences, in summary, are reported based on their occurrences in the scientific article. For example, the sentence which appears first in the article will be the first sentence in the summary.

6.4 Experimental Setup

In the subsequent sections, we have discussed datasets, evaluation measures, and, parameters used.

128 6.4 Experimental Setup

6.4.1 Datasets

For our figure-summarization task, we have used two publicly available7 data sets. First dataset, FigSumGS1, has 91 figures, while, second dataset, FigSumGS2, has 84 figures. Actual/gold summary is made available by the annotators. These figures belong to 19 biomedical full-text articles. Brief description of the used datasets are provided in the Table 6.2 like the number of figures in each article, number of sentences in the article and in the gold summary of each figure etc. More information can be found in [20].

Table 6.2: Statistics of the used datasets. Here, U1 and U2 are the average number of unique sentences per figure in FigSumGS1 and FigSumGS2 dataset, respectively; #SentInGS is the number of sentences present in the gold summary; ‘-’ implies 18th and 19th articles are not used in FigSumGS2 dataset. FigSumGS1 FigSumGS2 Article #Sent. #Figures U1 #SentInGS U2 #SentInGS 1 190 3 5.0 15 11.7 35 2 144 3 18.0 54 11.7 35 3 173 7 5.0 35 8.0 56 4 160 5 8.6 43 10.2 51 5 172 4 12.8 51 10.5 42 6 140 5 8.4 42 10.8 54 7 281 9 7.8 70 11.8 106 8 137 9 4.7 42 6.3 57 9 142 5 6.2 31 11.2 56 10 87 5 6.4 32 8.4 42 11 162 6 6.0 36 9.7 58 12 34 2 7.5 15 6.0 12 13 50 3 8.0 24 11.0 33 14 138 3 5.0 15 12.7 38 15 119 3 12.3 37 11.0 33 16 120 5 9.2 46 12.4 62 17 152 7 5.1 36 14.1 99 18 157 4 6.2 25 - - 19 184 6 4.8 29 - -

6.4.2 Evaluation Measures

To evaluate the performance of our system in comparison to available gold summary, we have reported the F-measure (or F1-score) [20] value which is a well known measure in information retrieval. Formal definition of the F-measure is provided in section 2.3.2 of Chapter 2.

7http://figshare.com/articles/Figure Associated Text Summarization and Evaluation/858903

129 Textual Entailment based Figure Summarization for Biomedical Articles

Table 6.3: Parameter setting for our proposed approach. Here, Q is the number of sentences in the actual summary specific to a figure. Parameters Values Population size (|P|) 40 Maximum number of generations (tmax) 25 Fpool [0.6, 0.8, 1.0] CRpool [0.1, 0.2, 1.0] Smin and Smax Q + 2 and Q − 2

6.4.3 Experimental Settings

Different parameter values used in our proposed framework are reported in the Table 6.3. Pop- ulation size (|P |) and maximum number of generations (tmax) are kept fixed because more will be their values, more will be the computation time. As in our approach, the value of |P |, tmax and maximum number of objective functions are kept as 40, 25, and 6, respectively; therefore, the value of maximum number of fitness evaluations will be 6240. Results obtained are averaged over 5 runs of the algorithm. For representation of sentences, BioBERT, a pre-trained model8 on biomedical text articles and a book corpus were used which provide fixed length vectors of the sentences.

6.4.4 Comparative Methods

As our proposed approach is unsupervised in nature, therefore, we have made comparison with other existing unsupervised methods. Although, supervised techniques exist in the literature, but, it will be unfair to make comparison between supervised and unsupervised methods. Un- supervised methods include three methods namely, Randomsent, FigSum [113], FigSum+ [20]. Further, three variants of FigSum+, which are similarity, tfidf, and, SurfaceCue based versions, are considered (shown in Table 6.6(a)). These variants select top-n sentences based on maximum caption similarity function, TF-IDF [188, 193] based similarity function, and, sentence referring to figure function, respectively. Here, TFIDF is a well known bag-of-words model in vector space. Brief descriptions of these methods are already provided in section 2.2.3 of Chapter-2. Note that our developed method is unsupervised in nature. Gold summaries were used only to evaluate our system at the end. Moreover, the system proposed is based on extraction of relevant sentences from the article related to a given figure; therefore, only sentence-extraction based methods are used for comparative study.

8https://github.com/naver/biobert-pretrained/releases/tag/v1.0-pubmed-pmc

130 6.5 Results and Discussion

Table 6.4: Average precision (P), recall (R) and F-measure (F1) values obtained for both datasets using reduced set of sentences. Here, the decimal number in the left of ‘’is the standard deviation. Datasets→ FigSumGS1 FigSumGS2 S.No. SAR version↓ Objective functions↓ P R F1 P R F1 SAR CS 0.180.22 0.150.20 0.170.21 0.220.15 0.180.13 0.200.14 1 SRF SAR TE 0.220.27 0.150.19 0.180.22 0.250.13 0.200.11 0.220.12 SAR CS 0.200.22 0.180.19 0.190.20 0.220.14 0.190.12 0.200.13 2 STE+SRF SAR TE 0.220.24 0.180.19 0.200.20 0.210.14 0.180.12 0.190.13 SAR CS 0.200.21 0.180.19 0.190.20 0.220.14 0.200.13 0.210.13 3 STE+SOC1+SOC2 SAR TE 0.190.21 0.160.17 0.170.18 0.220.14 0.190.11 0.200.12 SAR CS 0.190.21 0.180.20 0.180.20 0.220.13 0.200.13 0.210.13 4 SRF+SOC1+SOC2 SAR TE 0.210.25 0.170.21 0.180.22 0.210.13 0.180.12 0.200.12 SAR CS 0.210.22 0.190.21 0.200.21 0.210.22 0.170.21 0.180.21 5 STE+SRF+SOC1+SOC2 SAR TE 0.230.25 0.190.21 0.200.22 0.240.14 0.200.12 0.220.13

6.5 Results and Discussion

We have conducted two sets of experiments, ExpSet1 and ExpSet2, by varying the number of input sentences. We have discussed them one by one with corresponding results obtained. Then we have discussed the comparative analysis with the existing methods with ablation study on different combinations of objective functions. At the end, we have provided error analysis of the results obtained followed by statistical significance test of our results.

1. ExpSet1 : In this set, we have considered only those sentences in the article for our exper- iment whose entailment probability values to a given figure’s caption (Let’s say Fig-m is to be summarized) are greater than 0.5. The proposed approach is then applied on this reduced number of sentences. Note that the number of input sentences are reduced to min- imize the computation time. This was done to see whether the reduced set of sentences extracted from the article using entailment probability values are sufficient to obtain a good quality summary.

Results and Discussion: The results obtained under ExpSet1 are shown in Table 6.4. We have tried only 5 combinations of objective functions using different versions (SAR CS and SAR TE) of anti-redundancy objective function (SAR). From this Table, it can be observed that the highest values of F1-measure for FigSumGS1 and FigSumGS2 datasets are 0.20 and 0.22, respectively. These highest values are obtained using the SAR TE in combination with objective functions, namely, STE, SRF, SOC1, and SOC2. In most of the rows in this table, the values of F-measure corresponding to SAR TE are high. Thus, here we can infer that the anti-redundancy objective function measured in terms of textual entailment relationship is contributing towards the better result.

131 Textual Entailment based Figure Summarization for Biomedical Articles

Table 6.5: Average precision (P), recall (R) and F-measure (F1) values obtained by the proposed approach for both datasets namely, FigSumGS1 and FigSumGS2, by varying the objective func- tion combinations. Here, the decimal number in the left of ‘’is the standard deviation. Note that here all sentences in the article are used for the experiment. Datasets→ FigSumGS1 FigSumGS2 S.No. SAR version↓ Objective functions↓ P R F1 P R F1 SAR CS 0.240.18 0.200.15 0.220.16 0.220.12 0.190.11 0.200.22 1 STE SAR TE 0.280.18 0.220.14 0.240.15 0.260.13 0.220.11 0.240.12 SAR CS 0.530.17 0.460.20 0.490.17 0.310.13 0.270.12 0.290.12 2 SRF SAR TE 0.640.24 0.470.18 0.540.19 0.390.14 0.300.11 0.340.12 SAR CS 0.360.18 0.270.14 0.300.15 0.300.12 0.240.10 0.270.11 3 SFC SAR TE 0.290.21 0.210.16 0.240.18 0.310.13 0.250.11 0.280.12 SAR CS 0.510.17 0.460.18 0.480.16 0.320.12 0.300.12 0.310.12 4 STE+SRF SAR TE 0.620.23 0.480.19 0.530.19 0.370.14 0.300.11 0.300.12 SAR CS 0.360.18 0.300.16 0.320.16 0.250.14 0.230.11 0.240.12 5 STE+SFC SAR TE 0.280.21 0.230.18 0.250.19 0.300.12 0.250.10 0.270.11 SAR CS 0.540.17 0.470.18 0.500.16 0.340.13 0.280.11 0.310.12 6 SRF+SFC SAR TE 0.630.21 0.470.16 0.530.17 0.370.14 0.300.12 0.330.12 SAR CS 0.430.17 0.420.18 0.420.17 0.320.13 0.300.13 0.310.12 7 STE+SOC1+SOC2 SAR TE 0.500.20 0.440.20 0.460.19 0.370.13 0.310.11 0.340.37 SAR CS 0.550.14 0.540.18 0.540.5 0.370.13 0.330.12 0.340.12 8 SRF+SOC1+SOC2 SAR TE 0.650.20 0.520.19 0.570.18 0.380.13 0.320.11 0.350.12 SAR CS 0.540.15 0.520.18 0.520.15 0.360.12 0.320.12 0.340.12 9 STE+SRF+SOC1+SOC2 SAR TE 0.650.20 0.540.18 0.590.18 0.420.12 0.380.11 0.400.11

2. ExpSet2 : In this set, all the available sentences in the article are considered for our exper- iments. The proposed approach is applied on this full set of sentences.

Results and Discussion: The results obtained using all sentences of the articles are re- ported in Table 6.5. Further, in the same table, results are shown using different versions of anti-redundancy (SAR CS and SAR TE) objective function in combination with other objective functions. After observing Table 6.5, it is found that the highest values of F1- measure for FigSumGS1 and FigSumGS2 datasets are 0.59 and 0.40, respectively, which are more than values obtained after experimentation with reduced set of input sentences. Moreover, the maximum F-measure value obtained using different objective function com- binations including SAR CS function is 0.54 (S.No. 8) which is 4% less than the highest F-score. The other observations made from Table 6.5 are enumerated below:

(a) Among-st the most of the objective function combinations, SAR TE performs bet- ter than SAR CS. Thus, we can say that SAR TE is contributing more in figure summarization process in comparison to SAR CS.

(b) When we remove STE from the best combination (S.No. 9), F-score decreases by 2% (S.No. 8). But, on comparing, SAR TE+STE+SOC1+SOC2 (S.No. 7) and SAR TE+SRF+SOC1+SOC2 (S.No. 8), the second one is better. This infers that although STE is contributing towards the best F-score value, SRF is more contributing

132 6.5 Results and Discussion

than STE when used with SOC1 and SOC2. The same can also be observed by seeing the F-score of SAR TE+STE (S.No. 1) and SAR TE+SRF (S.No. 2). There is a big jump in the F-score value.

(c) On comparing, STE, SRF, and, SFC along with any version of SAR, again, SRF is more contributing. For any scientific article, it is purely logical because if a sentence refers to a particular figure keyword like ‘Figure-’, then it indicates that sentence is associated with that figure.

Table 6.6: Comparison of the best results obtained by our proposed approach with (a) un- supervised methods; (b) supervised methods, in terms of average precision (P), recall (R) and F-measure (F1) for both datasets namely, FigSumGS1 and FigSumGS2. Here, the decimal num- ber in the left of ‘’is the standard deviation. Note that here all sentences in the article are used for the experiment. FigSumGS1 FigSumGS2 Type of Methods Method P R F1 P R F1 Proposed (FigSum++) 0.650.20 0.540.18 0.590.18 0.420.12 0.380.11 0.400.11 RandomSent 0.060.09 0.060.12 0.060.09 0.080.08 0.090.11 0.080.09 FigSum 0.280.24 0.190.19 0.220.19 0.310.20 0.130.10 0.180.13 Unsupervised FigSum+ (SurfaceCue) 0.960.13 0.410.22 0.540.21 0.630.36 0.160.13 0.240.17 FigSum+ (tfidf) 0.300.25 0.340.24 0.300.20 0.270.22 0.200.14 0.290.15 FigSum+ (Similarity) 0.280.20 0.380.28 0.300.22 0.310.16 0.280.16 0.220.16

(a)

FigSumGS1 FigSumGS2 Type of Methods Method P R F1 P R F1 Unsupervised Proposed (FigSum++) 0.650.20 0.540.18 0.590.18 0.420.12 0.380.11 0.400.11 NBSurfaceCues 0.440.11 0.170.20 0.180.15 0.490.06 0.050.04 0.080.05 NBSOTA 0.440.15 0.740.17 0.530.12 0.370.14 0.430.19 0.380.13 Supervised SVMSOTA 0.580.15 0.170.20 0.230.22 0.540.12 0.100.11 0.150.15 NBSimilarity 0.480.18 0.150.12 0.200.12 0.420.14 0.100.08 0.140.08

(b)

6.5.1 Comparison with Existing Unsupervised Methods

In Table 6.6(a), the best results obtained by our proposed approach in comparison to some existing unsupervised state-of-the-art techniques are shown. From this table, it can be observed that our proposed unsupervised method (FigSum++) attains the maximum F-measure values of 0.59 and 0.40 for FigSumGS1 and FigSumGS2 datasets, respectively, using combination of SAR TE, STE, SRF, SOC1, and, SOC2, objectives functions (this result corresponds to the best result reported in Table 6.5). Although, for FigSum+ (SurfaceCue) method, Precision values are high (0.96 and 0.63 for two datasets), but, Recall (0.41 and 0.16) values are low as of our proposed method. It indicates that the number of sentences in the obtained summaries corresponding to this method are less and those are exactly matching to the sentences of the gold

133 Textual Entailment based Figure Summarization for Biomedical Articles summaries. The technique, Randomsent, does not consider any feature specific objective function while generating summary. It randomly selects top-n sentences as a part of figure’s summary and thus gives a very poor F-measure values of 0.06 and 0.08 for the used datasets, respectively. Note that our technique is based on sentence selection for figure summary. Therefore, we have made a comparison using only those techniques which also extract sentences for generating the summary. Out of three variants of FigSum+, SurfaceCue method gives F-measure values of 0.54 and 0.24 on the two datasets which are 5% and 16% less than the best values attained by our proposed unsupervised method. Note that we have not reported the number of sentences in the predicted summary corresponding to each figure as average F-measure values over all figures are reported in Table 6.6. We have also compared our results in comparison to some supervised methods in Table 6.6(b). For comparison, we have considered different methods namely, NBSurfaceCue [20], NB- SOTA [115], NBSimilarity [20], and, SVMSOTA [115]. Here, the first three methods (NBSur- faceCue, NBSOTA and NBSimilarity) make use of naive bayes classifier [194], while, fourth one (SVMSOTA) makes use of support vector machine [195]. The features used by SVMSOTA and NBSOTA to train the supervised model are sentence referring to figure, paragraph referring to figure, reference sentence similarity, caption similarity, etc. Although it is quite unfair to compare two different types of techniques (supervised and unsupervised) because in most of the cases supervised methods always perform better. But, here, after observing the results, it can be concluded that our F-measure value is better than existing supervised methods. In terms of improvements in F-measure values among supervised methods, we can say that there are 6% and 2% improvements obtained by our method for FigSumGS1 and FigSumGS2 datasets, respectively. But, recall value of NBSOTA is better than ours. This is because of using fea- ture ‘figure reference paragraph’ while training, but, our system does not make use of any such paragraph-based feature.

6.5.2 Pareto fronts obtained

At the end of any evolutionary algorithm, a set of Pareto optimal solutions are obtained. Each solution may vary in terms of number of sentences (as our approach generates solutions having minimum and maximum number of 1’s) and may have different types of summaries. These Pareto optimal solutions obtained at the end of generations after application (24th generation) of our proposed algorithm optimizing objective functions namely, SAR TE, STE, SRF, SFC, SOC1, and, SOC2, are shown in Figure 6.4. This Pareto optimal front is obtained while gen- erating summary for the Figure-2 of the article available at http://www.ncbi.nlm.nih.gov/ pmc/articles/PMC2941798. It is difficult to plot all the objective functional values in a single

134 6.5 Results and Discussion

Figure 6.4: Pareto optimal solutions obtained after applying our proposed approach at the end of 24th generation. (a) Figure illustrating objective functional values of SAR TE (denoted as SAR v2 in the figure), STE, and, SRF; (b) Figure illustrating the objective functional values of SRF, SOC1, and, SOC2. ‘fr-0’ in legend denotes solutions are of rank-1.

figure; therefore, we have plotted three objective functions in two 3-D plots. Objective function names are written on the axis.

6.5.3 An Example of Summary Obtained

Here, we have shown an example of summary obtained by our proposed approach. The summary shown is corresponding to Figure-4 of the article available at http://www.ncbi.nlm.nih.gov/ pmc/articles/PMC1159166 under FigSumGS1 dataset and shown in Figure 6.5 of the current chapter. Actual summary and figure’s caption are also shown. The matching lines between actual and predicted summary are highlighted with the same colour. Note that the summary shown in Figure 6.5 is obtained after optimizing SAR TE, STE, SRF, SOC1, and, SOC2 objective functions. The F-measure value obtained corresponding to summary shown is 0.82, and, the number of sentences in the actual summary and predicted summaries are 9 and 8, respectively. This can be considered as an example of good summary as F-score is more than 80%.

135 Textual Entailment based Figure Summarization for Biomedical Articles

Figure 6.5: An example of Summary obtained by our proposed approach. (a) Figure-4 of the article available at http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1159166; (b) Caption of the figure; (c) Actual and predicted summaries. Coloured lines (excluding black colour lines) in actual and predicted summary indicate the matched lines.

136 6.5 Results and Discussion

6.5.4 Error Analysis

We have done a thorough error-analysis of the summaries generated for the figures in the articles with respect to both the data sets. This analysis is corresponding to the average best F-measure reported in Table 6.5 by our proposed approach.

For FigSumGS1 dataset:

After observing the F-measure values for all figures in FigSumGS1 dataset, it has been found that only one figure has F-measure value less than 20% (Figure-3 of the article available at http: //www.ncbi.nlm.nih.gov/pubmed/?term=22473769), 3 figures have F-measure values between 30% to 35%. For rest of the figures, the F-measure values are above 40%. The low value less than 20% is because of the following reason: the figure discuss the ratio of two biomedical terms, and, thus, the caption is full of only numbers, while, in the actual summary, sentences do n’t have so many numbers. Our designed objective function mainly deals with the figure’s caption at the syntactic and semantic level, which tries to make our summary as close to caption and thus, there is little overlap between our summary and actual summary which decreases the F1 score value.

For FigSumGS2 dataset:

In this dataset, there are mainly three figures (Figure-3, 5, and 6) of the article available at http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1134656/ which have F1-scores, less than 20% and thus, causing decrease in overall average F1-score. Out of these, Figure-6 has F1 value of 0.09 which can be considered as an example of worst summary generated. This is due to the following reasons: (a) the captions of these figures refer to caption of another figure (Figure-2 of the same article); (b) the captions of Figure-3 and 5 have only 3 and 2 words, respectively, which are quite insufficient to explain a figure. For rest of the explanations, it is refers to caption of Figure-2.

6.5.5 Box-plots

To illustrate the effectiveness of using SAR TE over SAR CS or in other words, to show the va- iations of F-measure values (in ExpSet2) corresponding to two versions of the anti-redundancy objective function (SAR CS and SAR TE) in combination with other objective functions, we have drawn the box-plots for both the datasets. These box plots shown in Figure 6.6(a) and 6.6(b) correspond to FigSumGS1 and FigSumGS2 datasets, respectively. The results of five sets of objectives functions, i.e., SRF, STE+SRF, SRF+SFC, SRF+SOC1+SOC2, and,

137 Textual Entailment based Figure Summarization for Biomedical Articles

(a)

(b)

Figure 6.6: Box plots showing variations of the best F-measure values obtained for (a) Fig- SumGS1; (b) FigSumGS2 datasets. The symbols namely, A, B, C, D, E, F, and, G represent objective functions namely, SAR CS, SAR TE, STE, SRF, SFC, SOC1, and, SOC2, respectively.

138 6.5 Results and Discussion

STE+SRF+SOC1+SOC2, each associated with SAR CS and SAR TE, are chosen for compar- ison because these combinations have equal or more than 50% and 30% F-measure values for FigSumGS1 and FigSumGS2 dataset, respectively. Thus, a total of 10 boxes are there in each figure. In each colored box, the horizontal colored line indicates the median value of F-measure mentioned at the y-axis. In these box-plots, the symbols namely, A, B, C, D, E, F, and, G rep- resent SAR CS, SAR TE, STE, SRF, SFC, SOC1, and, SOC2, objective functions, respectively. From these plots, it can be analyzed that the median values of the used objectives functions in integration with SAR TE, have high median value as a comparison to when used with SAR CS. For example, the box corresponding to B+D (i.e., SAR TE+SRF) has high median value than A+D (i.e., SAR CS+SRF). Thus, it can be inferred that the anti-redundancy objective func- tion measured in terms of textual entailment relationship is more effective than cosine similarity among-st sentences of the summary.

6.5.6 Statistical Significance of Results

To check the significance of our best result obtained with the existing state-of-the-art results (reported in Table 6.6), we have conducted the statistical significance t-test9 at 5% significance level. This validate whether the best result obtained is statistically significant or occurred by chance. It provides p-value. Lesser is the p-value, more significant is our result. The p-values obtained using F-measure values reported in Table 6.6(a) are:

1. .002695 for FigSumGS1 dataset

2. .000307 for FigSumGS2 dataset

Test results support the hypothesis that obtained improvements by the proposed approach are not occurred by chance, i.e., improvements are statistically significant.

6.5.7 Complexity Analysis of the Proposed Approach

In this section, we have analyzed the complexity of our proposed approach. Let the number of solutions, the number of objectives to be optimized and the maximum number of generations be

N, M, and, tmax, respectively.

1. Initialization of population takes O(N) time as there are N solutions. For each solution, its objective functional values are calculated which takes O(NM) time. Thus, the total time complexity of population initialization is O(N + NM) which is equivalent to O(NM).

9https://www.socscistatistics.com/tests/studentttest/default2.aspx

139 Textual Entailment based Figure Summarization for Biomedical Articles

2. Construction of mating pool takes O(1) time as solutions are randomly selected from the population.

3. New solution generation using genetic operators (mutation and crossover) takes O(2 × (NM)) time. The constant 2 is multiplied because for each solution, two trial vectors are generated i.e., total 2N new trial vectors will be generated and their associated objective function values are computed.

4. Selection of best trail vector takes O(1) time.

5. Merging of old population (P) and new population (P 0 ) takes O(1) time.

6. Selection of the best solutions based on dominance and the non-dominance criteria from the merge population takes O(M(2N)2) time [29].

Steps-2 to 6 are repeated up to tmax number of generations. Note that step-2, 4 and 5, take constant time, therefore, they can be omitted from the total time complexity calculation. Thus, the total time complexity of the proposed architecture is

2 O(MN + tmax(2(NM) + M(2N) ))

On solving further, it gives rise to

2 2 =⇒ O(MN + tmax(2NM + 4MN )) ≡ O(MN + tmax(4MN )) 2 =⇒ O(MN(1 + 4tmaxN)) ≡ O(4tmaxMN )) 2 =⇒ O(tmaxMN )) which is the worst time complexity of our approach. From this complexity, it can be inferred that if we increase the number of generations and the number of solutions in the population, then, there will be an increase in the computation time.

6.6 Conclusive Remarks

In this chapter, we have proposed a sentence-based figure summarization system (FigSum++) for biomedical articles. Relevant sentences relevant to a figure are extracted by optimizing differ- ent sentence scoring functions. For efficient search or to reach towards global optimal solution, ensemble of two different DE variants is used in the proposed framework. Moreover, another function utilized for measuring anti-redundancy in the summary in terms of textual entailment is also proposed. To measure the semantic similarity among-st sentences, recently proposed, BioBERT language model for biomedical text mining is utilized. From the obtained results, it is

140 6.6 Conclusive Remarks evident that newly proposed anti-redundancy based objective function when measured in terms of textual entailment (TE) and optimized with other objective functions provides improvements of 5% and 11% for two datasets in terms of the F1-score over the state-of-the-art methods, re- spectively. Moreover, TE based anti-redundancy objective function performs better than cosine similarity based anti-redundancy objective function. Thus, it can be inferred that textual entail- ment plays a major role in summarization task. Our future work will concentrate on parallelize our summarization system by simultaneously generating summaries of all the figures of a given article. The current work only focuses on figure-summarization task. But, nowadays, micro-blogging sites are getting popularity due to the involvement of a large number of users. A lot of tweets are posted per minute by the end-users, giving real-time information about ongoing events such as disasters, politics, education, etc. In the case of natural disasters, a significant amount of relevant information (giving crucial information) immersed among these tweets. Therefore, there is a need to develop a system that summarizes relevant tweets by extracting informative tweets. Thus, in the next chapter, we have proposed an unsupervised approach for summarizing the relevant tweets which automatically selects the informative tweets.

141

CHAPTER 7

Multi-objective Based Approach for Microblog Summarization

This chapter proposes a novel multi-objective optimization-based framework for microblog/tweet summarization. A subset of relevant tweets is automatically selected from an available set of tweets. Different statistical quality functions measuring various aspects of summary, namely: length, tf-idf score of the tweets, and anti-redundancy, are optimized simultaneously using the search capability of a multi-objective binary differential evolution technique. A newly designed self-organizing map based genetic operator is incorporated in the optimization process. An ab- lation study is also performed to determine which set of measures is best suited for different datasets. At the end of the chapter, the extension of the proposed approach to solve the multi- document summarization task is also illustrated.

143 Multi-objective Based Approach for Microblog Summarization

7.1 Introduction

7.1.1 Overview

Due to continuous growth in social media platforms like Twitter, Tumblr1, and others, a lot of short-text massages called tweets are posted related to various event or topic categories like education, political issues, disaster-event, among others. Thus, they have become an invaluable source for getting updated information regarding ongoing events [196, 23]. According to the Twitter blog2 posted in 2013, 400 million tweets are created by the 200 active users each day. In 2016 and 2019, this number has increased to 303 million and 500 million tweets per day3. In the literature [23, 24, 197, 198], the significance of accessing microblogging sites for gathering information was illustrated. Thus, a vast amount of information is being generated on a day-to- day basis. These tweets are posted with varying characteristics in terms of relevancy (providing useful information) or non-relevancy. This makes the relevant tweet or information extraction from such data a crucial task. But, if such relevant information gets extracted successfully, then it may help in the decision-making process. Another challenge is to deal with the extracted relevant tweets because going through all such tweets is a time-consuming task - this demands summarization/extraction of the relevant tweets [126, 199].

Figure 7.1: Figure showing (a) classification of tweets into situational and non-situational cate- gories; (b) summarization of situational tweets.

In this work, we have considered disaster-related tweets because summary/extraction of rel- evant tweets may provide some valuable information which may, in effect, enable a management authority to handle the situation in the concerned area. Here, let us call relevant and non- relevant tweets as situational and non-situational tweets, respectively. Situational tweets [126] include those tweets that provide information on the current situation of the affected area, num-

1https://www.tumblr.com/tagged/social-networking 2https://blog.twitter.com/official/en us/a/2013/celebrating-twitter7.html 3https://www.dsayce.com/social-media/tweets-day/

144 7.1 Introduction ber of casualties or some other crucial information, whereas non-situational tweets are related to sympathy, emotions, post-disaster-event analysis. In Figure 7.1, the general flow of classification vs summarization is shown. The focus of this work is on part (b) of Figure-7.1. The input/output scenario of the system developed is demonstrated in Figure 7.2 where a set of situational tweets are considered as the input and in the output, extracted useful tweets are shown (matched tweets are shown in colour). These extracted tweets include helpline numbers, blood bank numbers and number of persons killed in the Hyderabad blast.

Figure 7.2: Example of Microblog Summarization.

Most of the existing works [126, 21, ?, 118, 127, 199] consider a specific trait/objective while summarizing the tweets. For example, the approach for real-time tweet summarization in the paper [126] focuses on maximizing the number of content words (numeral, noun and verbs) us- ing integer linear programming. But, there may be different traits like maximum length of the tweets [21], tf-idf score of the tweets [?], etc., which can be considered all together to obtain a good quality summary. Thus, in this chapter, a novel microblog/tweet summarization technique (MOOTweetSumm) is proposed using the concepts of multi-objective optimization (MOO). Sev- eral tweet scoring features/objective functions like the length of the tweet [21], summation of tf-idf scores [21] are simultaneously optimized using the multi-objective binary differential evo- lution algorithm (MOBDE) [46] which is an evolutionary algorithm (EA) (described in sections

145 Multi-objective Based Approach for Microblog Summarization

2.1.9 and 2.1.10 of Chapter 2). Because there can be a lot of re-tweets; therefore, to avoid having redundant information as a part of the summary, another objective function, anti-redundancy, is also optimized simultaneously.

Like Chapter 3 and 4, here also the SOM based operator is incorporated in the MOBDE framework to check its effectiveness for the microblog summarization task. To measure the similarity/dissimilarity between tweets, recently proposed word mover distance (WMD) (see the definition provided in Section 2.1.3 of Chapter 2) is utilized. The proposed approach is evaluated on four disaster event-related datasets. Results obtained clearly show the superiority of our proposed algorithm in comparison to various state-of-the-art techniques. As a part of this work, the potentiality of the proposed approach is tested for multi-document summarization, where we must summarize a given set of documents.

7.1.2 Contribution

The major contributions of the current chapter are enumerated below:

• A multi-objective optimization-based approach is proposed for the microblog summariza- tion task in which different goodness measures of a summary are optimized simultaneously. As per the literature survey, it is the first attempt in using the MOO framework for solving the microblog summarization task.

• An ablation study is presented to illustrate which combination of a set of objective functions is best suited for summarizing each dataset.

• Self-organizing map based genetic operator is also explored in the MOBDE framework to illustrate the performance improvements.

• Existing algorithms provide a single summary after the execution of the algorithm. But the proposed approach outputs different possible summaries having a variable number of tweets corresponding to different non-dominated solutions of the final Pareto optimal front to the user. Therefore, the user will have more alternatives in selecting a single summary from the final pool. Depending on the user/domain requirement, a single summary can be selected.

146 7.2 Problem Definition

7.2 Problem Definition

Consider an event D consisting of N tweets, D={t1, t2, . . . , tN }. Our main task is to find a subset of tweets, T ⊆ D, such that

 N  X 1, if ti ∈ T Smin ≤ Bi ≤ Smax and Bi = (7.1) i=1 0, otherwise

such that maximize{Ob1(T ), Ob2(T ), Ob3(T )} (7.2) where, Smin and Smax are the minimum and the maximum number of tweets in the summary, respectively, Ob1, Ob2 and Ob3 are the objective functions discussed in subsequent sections. Note that in Eq. 7.2, there can also be two objective functions ((Ob1 and Ob2) or (Ob1 and Ob3)) instead of three. These objective functions quantify the goodness of different tweets and further help in improving the quality of the generated summary. All these objective functions have to be maximized simultaneously by the use of some multi-objective optimization framework. These objectives are calculated for each solution in the population as each solution denotes a subset of tweets representing a summary.

Anti-redundancy (Ob1)

In a set of tweets, a lot of re-tweets can be there, therefore, to reduce the redundancy in the summary, this objective function is considered. It is expressed as:

P|T | distwmd(ti, tj) Ob = i,j=1,i6=j (7.3) 1 |T | where, ti and tj are the ith and jth tweets, respectively belonging to T, |T | is the total number of tweets to be in the summary, while distwmd(ti, tj) is the Word Mover Distance (for definition refer to section 2.1.3 of Chapter 2) between ith and jth tweets.

Maximum tf-idf Score of the Tweets (Ob2)

The tf-idf [188] is a well known measure in information retrieval to assign some weights to different words. Here, ‘tf 0 means term frequency and ‘idf’ means inverse-document frequency in a set of tweets (considered as a document). Each tweet is considered as a bag of words and each word has its own tf-idf score. Thus, a tweet ‘t’ can be represented as a vector

vt = [w1t, w2t, w3t, ...... , wnt] (7.4)

147 Multi-objective Based Approach for Microblog Summarization where  1 + N  w = tf . 1 + log (7.5) k,t k,t 1 + {t0 ∈ D|k ∈ t0} and ‘tfk,t’ is calculated by counting the number of occurrences of kth word in the same tweet (t), t0 ∈ D, N is the total number of tweets available. Thus, the summation of tf-idf scores of different tweets belonging to T is considered. The subset of tweets having maximum average tf-idf score is considered as a good summary. Mathematically, it can be expressed as

P|T | P wk,t Ob = i=1 wordk∈ti,ti∈T i (7.6) 2 |T |

where wk,ti is the tf-idf score of kth word (wordk) present in a tweet ti and ti is the ith tweet belonging to T .

Maximum length of the tweets (Ob3)

Based on the assumption that longer tweet conveys important information, this objective func- tion is taken into consideration. Mathematically, it can be expressed as

|T | X Ob3 = length(ti) (7.7) i=1 where, ti is the ith tweet in the summary, length counts the number of words in the tweet after removing stop words (example: is, am are etc.). However, some of the longer tweets may not be relevant as they contain irrelevant words. Therefore, the other objective function discussed above

(Ob2) is considered which pays attention to the importance of different words in the tweet. Note that Ob3 was not average over the number of tweets in the summary. The reason is described using an example: Suppose summary A has 20 tweets and summary B has 21 tweets including the 20 tweets that are also in A. The additional tweet has a length of 1 (1 word), then the average length of summary B will be smaller than that of A which is contradictory to our thinking.

7.3 Proposed Methodology

In this work, we have developed an extractive tweet summarization system. It utilizes a multi- objective based differential evolution technique as the underlying optimization strategy. SOM- based genetic operators are incorporated in the process to see their effectiveness. The flowchart of the proposed approach is shown in Figure 7.3.

148

1. Input Figure with it’s 3. Population initialization 5. Apply genetic operators to 4. SOM caption and sentences in 2. (P) and objective No form new population P` and START g< gmax Training the article containing Pre-processing functions calculation calculate value of objective

that figure functions

Yes

10. Select the best solution and 6. Merge P END 9. Obtain the set of Pareto 8. Update SOM 7. Selection of best |P| corresponding summary Optimal solutions training data solutions for next generation and P`

7.3 Proposed Methodology

3. Population initialization 5. Apply genetic operators to 1. Input a set 2. Yes 4. SOM START (P) and objective functions g< gmax form new population P` and Pre-processing Training of Tweets calculation); Initialize g=0 calculate values of objective functions No

10. Select the best solution and 8. Update SOM training 7. Select best |P| solutions 6. Merge END 9. Obtain the set of Pareto corresponding summary Optimal solutions data and increment g by 1 for next generation P and P`

Figure 7.3: Flow chart of the proposed architecture where, g is the current generation num- ber initialized to 0 and gmax is the user-defined maximum number of generations (termination condition), |P | is the size of population.

7.3.1 Representation of Solution and Population Initialization

Here, the population is initialized in the same manner as done in Chapter 5 (refer to section

5.4.2). Note that the initial population may have a varied number of tweets between [Smin,Smax]. This provides the end-user flexibility to choose the best summary as per his/her requirement or expert knowledge in terms of the number of tweets.

7.3.2 Objective Functions Used

To obtain a good summary, the use of a good set of objective functions/quality measures is essential. These objective functions quantify the quality of the subset of tweets present in the solutions and thus optimization of all these helps in achieving a good quality summary. All these objective functions are already discussed in section 7.2 and all are of maximization type.

7.3.3 SOM Training

In this step, SOM training is performed using the solutions in the population as described in Algorithm 1 of Chapter 2. Thus, SOM will help in understanding the distribution structure of the solutions in the population. In other words, SOM provides a topology preserving map of the solutions in low dimensional space.

7.3.4 Genetic Operators

In our framework, from each solution, a new solution is generated using three steps: mating pool generation, mutation, and crossover and thus set of these new solutions form a new population (P 0 ). For the construction of the mating pool for the current solution, SOM is utilized as discussed in Section 2.1.12 of Chapter 2. The remaining genetic operators are the same as used for the single document summarization task in Chapter 5. After generating a new solution using genetic operators, it undergoes constraint checking

149 Multi-objective Based Approach for Microblog Summarization

to check whether the number of ones (or tweets) in the solution is between [Smin,Smax. If it violates, then it is made feasible using the following steps:

1. Pick up the new solution. Let us call it ith solution.

2. Initialize ModifiedSolution with zeros equal to the maximum solution length.

3. Find the indices of sorted tweets of the ith solution based on maximum tweet length or maximum tf-idf score. To do this, a random probability ‘p’ is generated. If p < 0.5 then solutions are sorted based on maximum tweet length, otherwise, those are sorted based on maximum tf-idf scores.

4. Generate a random number ‘r’ between Smin and Smax

5. Fill the indices of ModifiedSolution with 1s until we cover ‘r’ indices. Note that filled indices are sorted indices obtained in step-3.

6. return the ModifiedSolution.

Here, it is important to note that while optimizing two objectives, Ob1 and Ob2, we also provide importance to new solution generated using maximum tweet length (based on some probability as in Step-3) score because the new solution may have tweets having long-lengths and can convey important information.

7.3.5 Selection of Best |P | Solutions for Next Generation

After generating the new population, P 0 , it is merged with the old population, P. Out of these solutions, best |P | number of solutions are selected using well known non-dominating sorting (NDS) and crowding distance based operators of NSGA-II algorithm[29]. For more detail about this step, one can refer to Section 3.2.6 of Chapter 3.

7.3.6 Updating SOM Training Data and Termination Condition

Theses steps are similar to as described in Section 5.4.7 and 5.4.8 of Chapter 5.

7.3.7 Selection of Single Best Solution and Generation of Summary

This is similar to step as discussed in Section 5.4.9 of Chapter-5. In brief, two methods, super- vised and unsupervised, are explored to select the best solution. Let us call these methods as SBest and UBest, respectively and described below:

150 7.4 Experimental Setup

• SBest: It select that solution which has the highest ROUGE-1 score obtained by utilizing gold/reference summary.

• UBest: In this method, an adaptive weighting scheme (AWS) [200] is utilized in which objective functional values are summed up after multiplying with their respective weights. The solution having the best value of the weighted sum will be considered as the best solution. Let K × #Ob be the matrix of objective functional values, where, K and #Ob are the number of Pareto optimal solutions and number of objective functions used in our optimization strategy, respectively. Then, steps used to select the best solution are explained below:

1. Normalize the values of objective functions by applying

Obkl + Fkl = where Ob = max Obkl (7.8) + l k∈K Obl

where, Obkl is the lth objective function value corresponding to kth solution.

2. Construct the normalized weighted matrix by multiplying normalized objective func- tion value with its respective weight as

F wtdkl = Obkl × wl (7.9)

where, wl is the weight factor assigned to lth objective.

3. For each kth solution, evaluate the sum of weighted normalized objective functional values as defined below #Ob X Scorek = F wtdkl (7.10) l=1

4. Find the solution having largest Score.

Note that the weight factors can be determined after conducting a sensitivity analysis.

The tweets, in summary, are reported based on their occurrences in the original dataset. For example, the tweet which appears first in the dataset will be the first tweet in summary.

7.4 Experimental Setup

In this section, we will discuss the datasets used, evaluation measure, parameters settings and comparative methods.

151 Multi-objective Based Approach for Microblog Summarization

(a) (b)

(c) (d)

Figure 7.4: Word clouds of disaster events, namely (a) Sandyhook (SH); (b) Uttarakhand flood (UK); (c) Typhoon Hagupit in Philippines (TH); and (d) Bomb blasts in Hyderabad (HB).

7.4.1 Datasets

For microblog summarization work, we have used the datasets related to the four disaster events, namely, (a) Sandy Hook elementary school shooting in USA (SH); (b) Uttarakhand’s floods (UK); (c) Typhoon Hagupit in Philippines (TH), and (d) Bomb blasts in Hyderabad (HB). The number of tweets in these datasets are 2080, 2069, 1461, and, 1413, respectively. The same datasets are used in the paper [21] and briefly described in Table 7.1. Tweets in these datasets provide different relevant information like the current situation in various regions affected by the disaster, the number of casualties, and the contact number of helping authorities and hospitals etc. The reference/gold summary is also available with these datasets which are utilized only for evaluation at the end of the execution of our proposed approach as our approach is fully unsupervised in nature. The calculation of objective functions is also fully unsupervised in nature. The other steps of the proposed approach do not consult any supervised information. The number of tweets in gold summary are 37, 34, 41 and 33 for SH, UK, TH, and HB datasets, respectively. Before passing any dataset as an input to our algorithm, some pre-processing steps are executed on the given datasets. These include the removal of special characters, hashtags, stop words, user mentions and URLs. Lower case conversion of all the words is also carried out.

152 7.4 Experimental Setup

Table 7.1: Dataset descriptions for Microblog Summarization

S.No. Event Year Abbreviation #Tweets Popular Hashtags Sandy Hook elementary 1 2012 SH 2080 #schoolshooting, #stoptheviolence, #SandyHook school shooting in USA 2 Floods in Uttaranchal state of India 2013 UK 2069 #UttarakhandFloods, #Kedarnath, #prayers4all 3 Typhoon Hangupit in Philippines 2014 TH 1461 #TyphoonHagupit, #Phillippines, #RescuePH, 4 Bomb blasts in Hyderabad, India 2007 HB 1413 #Hyderabadblast, #india, #killed, #Serialblast

7.4.2 Comparative Methods

For comparison, we have considered one recent approach developed in the year 2018 namely, EnGraphSumm [21]. In EnGraphSumm, many versions are developed out of which we only consider top 4 versions namely, VecSim-ConComp-maxSumTFIDF, VecSim-ConComp-MaxDeg, VecSim-Community-maxSumTFIDF and VecSim-ConComp-MaxLen. Each one first generates summary using different existing algorithms and then, uses the ensembling strategy to select the tweets. These tweets are grouped by some graph-based method [21] and then from each group, one tweet is selected based on various features like maximum length of tweets, maximum degree of a node, etc. as a part of summary. Along these approaches, some other approaches like COWTS [126], Lex-Rank [32], LSA [129], LUHN [122], SumBasic [124], MEAD [177] SumDSR [125], are also taken into account for comparison.

7.4.3 Evaluation Measure

To check the performance/closeness of the generated summary with the actual summary, we have used ROUGE-N discussed in Section 2.3.2 of Chapter-2. It counts the number of overlapping units between the generated summary with the actual summary. A summary having the highest ROUGE score is considered more close to the actual summary. In our experiment, N takes the values of 1, 2, and L for ROUGE−1, ROUGE−2, and ROUGE-L, respectively. But, for comparison purpose with the existing algorithms, we make use of only ROUGE−2, and ROUGE- L scores as reference papers reported only these scores.

7.4.4 Parameters Used

Different parameter values used in our proposed framework are- DE parameters: | P |= 25, mat- ing pool size=5, threshold probability in mating pool construction (β)=0.8, maximum number of generations (gmax)=25, crossover probability (CR)=0.8, b=6, F=0.8. SOM parameters: ini- tial learning rate (η0)=0.6, training iteration in SOM=|P|, topology=rectangular 2D grid; grid q 1 Pm−1 2 size=N1 × N2 = 5 × 5, initial neighborhood size (σ0)= 2 ( i=1 Ni )/(m − 1). Sensitivity anal- ysis on DE parameters and SOM parameters is discussed in the next section. Minimum (Smin)

153 Multi-objective Based Approach for Microblog Summarization

and maximum (Smax) number of tweets to be in summary for SH, UK, TH and HB datasets are considered as [34, 40], [31, 37], [39, 44], and [31, 36], respectively. Word Mover Distance makes use of pre-trained word2vec model4 to calculate the distance between two tweets. This model was trained on 53 million tweets related to various disaster events [196]. The number of fitness function evaluations (NFE) for our proposed architecture is kept as 1950. Results obtained are averaged over 5 runs of the algorithm.

7.5 Discussion of Results

In this section, we will discuss the results obtained using supervised (SBest) and unsupervised (UBest) selection methods, comparison with exiting approaches and analysis of the results ob- tained.

7.5.1 Discussion of results obtained using SBest selection method

In Table 7.2, we have shown the average results over all datasets obtained by the proposed ap- proach, MOOTweetSumm, using both versions ‘with SOM’ and ‘without SOM’ based genetic operators. Various combinations of the objective functions (discussed in section 7.2) are also explored to identify which set of objective functions is the best suited for our task. The corre- sponding results are reported in Table 7.2. The best result was obtained by our approach when using ‘without SOM’ version with objective functions namely, maximum anti-redundancy (Ob1) and tf-idf score (Ob2). However, we have also reported different evaluation measures for each dataset in Table 7.3. From this table, it can be analyzed that objectives, Ob1, Ob2, and Ob3 are the best suited for TH. While for SH and HB datasets, our approach attains good results when objectives Ob1 and Ob3 are used. As comparative approaches report the average results over four datasets, therefore, to make a fair comparison, we have reported the average results in Table 7.5 in comparison with the state-of-the-art techniques.

Table 7.2: Average ROUGE Scores over all datasets attained by the proposed method us- ing supervised information. Here, † denotes the best results; it also indicates that results are statistically significant at 5% significance level. Approach SOM/Without SOM Objective functions Rouge-1 Rouge-2 Rouge-L Ob1+ Ob2+ Ob3 0.4912 0.2999 0.4850 With SOM Ob1+ Ob2 0.4738 0.3033 0.4678 Ob1+ Ob3 0.4843 0.3095 0.4790 MOOTweetSumm Ob1+ Ob2+ Ob3 0.4789 0.2984 0.4745 Without SOM Ob1+ Ob2 0.4900 0.3150 0.4860 Ob1+ Ob3 0.4903 0.3192 0.4848

4http://crisisnlp.qcri.org/lrec2016/lrec2016.html

154 7.5 Discussion of Results

Table 7.3: ROUGE Scores obtained by the proposed approach for different datasets using SBest selection method. Bold entries indicate the best results considering ‘with SOM’ and ‘without SOM’ based operators. With SOM Without SOM Event Name Objective functions Rouge-1 Rouge-2 Rouge-L Rouge-1 Rouge-2 Rouge-L Ob1+ Ob2+ Ob3 0.5842 0.3612 0.5776 0.5940 0.3612 0.5874 SH Ob1+ Ob2 0.5346 0.3303 0.5248 0.5842 0.3721 0.5842 Ob1+Ob3 0.5842 0.3775 0.5743 0.6139 0.3975 0.6073 Ob1+ Ob2+ Ob3 0.4400 0.2469 0.4329 0.4494 0.2577 0.4424 UK Ob1+ Ob2 0.4423 0.2714 0.4376 0.4541 0.2822 0.4447 Ob1+Ob3 0.4565 0.2791 0.4518 0.4471 0.2623 0.4400 Ob1+ Ob2+ Ob3 0.4181 0.2365 0.4097 0.3697 0.2213 0.3655 TH Ob1+ Ob2 0.3634 0.2158 0.3634 0.3845 0.2184 0.3782 Ob1+Ob3 0.3866 0.2296 0.3802 0.3634 0.2241 0.3550 Ob1+ Ob2+ Ob3 0.5223 0.3552 0.5198 0.5025 0.3534 0.5025 HB Ob1+ Ob2 0.5198 0.3776 0.5173 0.5371 0.3914 0.5371 Ob1+Ob3 0.5099 0.3517 0.5099 0.5371 0.3931 0.5371

In the literature, efficacy of SOM based reproduction operators (for constructing the mating pool) is already shown in solving various problems like document clustering in Chapter 3, doc- ument summarization in Chapter 4 and 5, development of an evolutionary algorithm [168], etc. But, from the obtained experimental results, it is evident that the effectiveness of the SOM based operators also depends on the datasets and problem statement chosen. SOM based operators are developed based on the assumptions that the mating pool should be restricted to neighboring solutions of the current solution. This restricts the genetic operations to be performed between neighboring solutions only. Thus exploitation was preferred more over exploration. But in the case of tweet-summarization, the neighborhood of a neuron mostly consists of re-tweets. Thus if genetic operators are applied on re-tweets, then good quality solutions may not be generated. Thus, in this case, SOM based genetic operators only help in exploitation. But, our summariza- tion task demands more exploration than exploitation. Therefore, our approach using ‘without SOM’ based genetic operators performs better than the ‘with SOM’ version.

Exploration vs. Exploitation Behaviour: In Figs 7.5(a)-(d) for all datasets, exploration vs. exploitation behaviour of our proposed algorithm is shown with respect to the number of gener- ations using two objectives, Ob1+ Ob2 (as it gives the average best result), with both versions, ‘with SOM’ and ‘without SOM’ based operators. As can be seen by the red line corresponding to ‘without SOM’ based operator version, the number of new good solutions generated per genera- tion is more as compared to ‘with SOM’ version, in most of the generations. That means ‘without SOM’ version explores the search space more efficiently. This is due to the random selection of three solutions out of the whole population to generate a new solution for the current solution

155 Multi-objective Based Approach for Microblog Summarization as usually done in MOBDE algorithm and thus, can provide the best average ROUGE score. However, both versions move towards exploitation as the number of new solutions generated is decreasing over the generations.

With SOM With SOM 20 Without SOM 20 Without SOM

15 15

10 10

5 5 #New Solutions Generarted Solutions #New Generarted Solutions #New

0 0

0 5 10 15 20 25 0 5 10 15 20 25 Generation Number Generation Number

(a) (b)

25 With SOM 16 With SOM Without SOM Without SOM 14 20

12

15 10

8

10 6

4

#New Solutions Generarted Solutions #New 5 Generarted Solutions #New

2

0 0

0 5 10 15 20 25 0 5 10 15 20 25 Generation Number Generation Number

(c) (b)

Figure 7.5: Figures showing the number of new solutions generated over the generations by our proposed approach using two objectives, Ob1+ Ob2; a comparative study between ‘with SOM’ and ‘without SOM’ based operators. Here, (a), (b), (c), and (d) correspond to SH, UK, TH and HB datasets, respectively.

To check whether the used objective functions are optimized or not over the generations, we have plotted graphs showing generation wise maximum objective functional (Ob1 and Ob2) values for all datasets. These graphs are shown in Fig 7.6 which shows that objective functional values are increasing over the iterations and become constant after a particular iteration due to the limited length of the tweets and vocabulary size.

7.5.2 Discussion of results obtained using UBest selection method

From the average results shown in Table 7.2, obtained using SBest selection method, it can be analyzed that (i) in case of SOM-based operator, our approach performs better when all

156 7.5 Discussion of Results

(a) (b)

(c) (d)

Figure 7.6: Generation objective function values using MOOTweetSumm (Without SOM, Ob1+ Ob2). Here, (a), (b), (c) and (d) correspond to SH, UK, TH and HB datasets, respectively.

objectives functions are optimized simultaneously; (ii) in case of without SOM-based operator, our approach performs well when two objective functions, Ob1 and Ob2, are optimized simulta- neously. Therefore, using the same set of objective functions for ‘with SOM’ and ‘without SOM’ based operators, we have explored the unsupervised method for selecting the best solution as discussed in section 7.3.7. The corresponding results are reported in Table 7.4. The weight factors assigned to different objective functions, Ob1, Ob2 and Ob3, when using SOM-based operator are 0.4, 0.3, and, 0.7, respectively. While in case of not using SOM-based operator, weight factors assigned to Ob1 and Ob2 are 0.3 and 0.7, respectively. Note that in the case of SOM-based operator, weight values of 0.2, 0.3, and, 0.5 assigned to Ob1, Ob2 and Ob3, respectively, generate the same results. These weight factors are determined after conducting a thorough sensitivity analysis. On comparing the results of ‘with SOM’ and ‘without SOM’ based operators of UBest, both give ROUGE-2 score of 0.3033, but, in terms of ROUGE-L, proposed approach using ‘without SOM’ based operator is able to achieve 0.4769 which is higher

157 Multi-objective Based Approach for Microblog Summarization than ROUGE-L score of 0.4681, obtained using with SOM-based operator. Note that ROUGE-L measures the matching of longest common subsequence between obtained summary and reference summary, thus, ROUGE-L can be more preferred than ROUGE-2. Similar discussions can be applied to ROUGE-1 score.

Table 7.4: ROUGE Scores obtained by the proposed approach for different datasets using UBest selection method. Ob1+Ob2+Ob3, With SOM Ob1+Ob2, Without SOM Event Name Rouge-1 Rouge-2 Rouge-L Rouge-1 Rouge-2 Rouge-L SH 0.5743 0.3848 0.5710 0.5842 0.3721 0.5842 UK 0.4376 0.2715 0.4329 0.4376 0.2577 0.4376 TH 0.3592 0.2019 0.3487 0.3592 0.1923 0.3487 HB 0.5223 0.3552 0.5198 0.3857 0.3914 0.5371 Average 0.4734 0.3033 0.4681 0.4417 0.3033 0.4769

Table 7.5: Average ROUGE Scores over all datasets attained by existing methods in comparison with the best results obtained by the proposed approach using SBest (Table 7.2) and UBest (Table 7.4) selection methods . Here, WOSOM refers to without SOM, SBest and UBest are the supervised and unsupervised selection methods. Approach Rouge-2 Rouge-L MOOTweetSumm (SBest, WOSOM, Ob1+Ob2) 0.3150† 0.4860† MOOTweetSumm (SBest, SOM, Ob1+Ob2+ Ob3) 0.2999 0.4850 MOOTweetSumm (UBest, WOSOM, Ob1+ Ob2) 0.3033 0.4769 MOOTweetSumm (UBest, SOM, Ob1+ Ob2+ Ob3) 0.3033 0.4681 VecSim–ConComp–MaxDeg 0.1919 0.4457 VecSim–ConComp–MaxLen 0.1940 0.4506 VecSim–ConComp–maxSumTFIDF 0.1886 0.4600 VecSim–Community–maxSumTFIDF 0.1898 0.4591 ClusterRank (CR) 0.0859 0.2684 COWTS (CW) 0.1790 0.4454 FreqSum (FS) 0.1473 0.3602 Lex-Rank (LR) 0.0489 0.1525 LSA (LS) 0.1599 0.4234 LUHN (LH) 0.1650 0.4015 Mead (MD) 0.1172 0.3709 SumBasic (SB) 0.1012 0.3289 SumDSDR (SM) 0.0985 0.2602

On comparing the best average ROUGE scores among SBest and Ubest selection methods, SBest performs better than Ubest which is quite obvious because of the use of supervised information. The best ROUGE-2 and ROUGE-L scores attained by SBest are 0.3150 and 0.4860, respectively, while, using UBest, ROUGE-2 and ROUGE-L scores are 0.3033 and 0.4709, respectively. Thus, UBest method is not able to reach the exact results (average ROUGE score) obtained by SBest. But, it can be inferred that the results of UBest are able to beat the results of existing approaches. Researchers are exploring different techniques in this context.

158 7.5 Discussion of Results

7.5.3 Comparative Analysis

The comparative methods VecSim–ConComp–MaxDeg, VecSim–ConComp–MaxLen, VecSim–ConComp–maxSumTFIDF, VecSim–Community–maxSumTFIDF are based on ensem- bling technique, i.e., they consider the summary generated by different existing algorithms, and, then generate the final summary in an unsupervised/supervised way. Although this is a promis- ing technique but very time-taking in the real-time scenario. Also, these approaches remove redundant tweets before applying the ensembling algorithm. The remaining algorithms like Luhn, Lex-Rank, MEAD etc. are very basic algorithms suggested in the literature [34]. The technique, COWTS, generates the summary based on the content words in the dataset. Our proposed approach is unique compared to all the existing approaches in the following ways:

• Note that all the comparative methods do not provide the user with a set of alternative solutions on the final Pareto front. Thus, they do not provide the end-user an opportunity to select a single best summary out of many choices as per his/her requirement, while in our approach, there is a flexibility for the end-user to select a single one based on some objective functional value or his/her expert knowledge.

• Moreover, unlike the other comparing approaches, redundant tweets are automatically removed from the resultant summary utilizing the anti-redundancy objective function in our approach.

Experimental results suggest that our algorithm is able to beat all these algorithms as it attains the ROUGE-2 and ROUGE-L values of 0.3150 and 0.4860, respectively, using the SBest selection method. In other words, our algorithm improves by 62.37% and 5.65% in terms of ROUGE−2 and ROUGE−L scores, respectively, over the state-of-the-art techniques. Lex-Rank performs poorly among all techniques. Note that the ‘improvement obtained’ is calculated using P roposedMethod−OtherMethod the formula ( OtherMethod × 100).

7.5.4 Quality of Summaries for Different Solutions

To illustrate the qualities of summaries corresponding to different solutions on the final Pareto front obtained at the final generation using the proposed approach utilizing ‘with SOM’ and ‘without SOM’ based genetic operators, we have also plotted the ranges of Rouge-2/L score values attained by rank-1 solutions in Fig 7.7. We have chosen rank-1 solutions because the best solution belongs to this set. From Fig 7.7(a), (b), and, (d) for SH, UK, and, HB datasets, respectively, it can be analyzed that some solutions in ‘without SOM’ version have low ROUGE- 2/L values but the best solution is identified by this version (as can be seen by Rouge values

159 Multi-objective Based Approach for Microblog Summarization

Sandyhook dataset Ukflood dataset

0.375 0.575 0.275 0.44 0.350 0.42 0.550 0.250 0.325 0.225 0.40 0.525 0.300 0.200 0.38 0.500 0.275 Rouge-L Rouge-L Rouge-2 Rouge-2 0.175 0.36 0.475 0.250 0.150 0.34 0.450 0.225 0.125 0.32 0.425 0.200 0.30 0.100 With SOM Without SOM With SOM Without SOM With SOM Without SOM With SOM Without SOM

(a) (b)

Hangupit dataset Hblast dataset

0.22 0.38 0.54 0.38 0.36 0.20 0.52 0.36 0.34 0.18 0.50 0.34 0.32 0.16 0.32 0.48 0.30 Rouge-L Rouge-L Rouge-2 Rouge-2 0.14 0.30 0.28 0.46

0.12 0.26 0.28 0.44 0.10 0.24 0.26 0.42 With SOM Without SOM With SOM Without SOM With SOM Without SOM With SOM Without SOM

(c) (d)

Figure 7.7: Box plots in sub-figures (a), (b), (c) and (d) for SH, UK, TH and HB datasets, respectively, show the variations of average Rouge-2/Rouge-L values of highest ranked (rank-1) solutions of each document. In each colored box, the horizontal colored line indicates the median value of rank-1 solutions.

corresponding to green bullets). But, for the UK dataset, the median value of rank-1 solutions is high when using ‘with SOM’ version. Thus, it can be inferred that the efficacy of SOM as a reproduction operator in the summarization framework simply depends on the datasets used. Not in all cases, SOM based operators will be effective in solving the summarization task.

7.5.5 Pareto Fronts Obtained

Pareto fronts obtained by our proposed approach corresponding to the best results obtained using the SBest selection method are shown in Fig 7.8 and 7.9 generated at the end of {0, 10, 20}th generation. The Pareto fronts obtained shown in Fig 7.8 and 7.9 corresponds to ‘with SOM’ and ‘without SOM’, respectively, for the TH dataset. Note that we have not shown the Pareto fronts for other datasets due to length constraint. In the 0th generation, solutions are initialized randomly and thus randomly distributed over the objective space. On comparing with and without the SOM version (Fig 7.8 and 7.9), it can be observed that using ‘without SOM’ version, we obtain a more optimized and diverse set of solution which also support our results reported in Table 7.2. In these figures, the ‘.’ indicates a solution’s objective functional values. Various colors represent different ranked or front solutions. Highest ranked solutions are indicated by color (blue) assigned to ‘fr-0’ as shown in the legend of Fig 7.8(a) and so on.

160 7.5 Discussion of Results

(a) (b)

(c)

Figure 7.8: Pareto optimal fronts obtained at the end of {0, 10, 20}th generation corresponding to TH dataset using ‘With SOM’ version.

7.5.6 Sensitivity Analysis on the Parameters Used

The performance of any algorithm depends on the proper choice of parameters used in that algo- rithm. Therefore, in this section, we have performed the sensitivity analysis on the parameters used. We have considered 5 parameters (η0, CR, b, F and, Q) for optimization, each having a possible range of values reported in Table 7.8. Note that we have excluded the population size

(|P |) and the maximum number of generations (gmax) from optimization process because more will be the gmax and |P |, high would be the execution time of the algorithm. Therefore, gmax and |P | are fixed as 25 and 25, respectively. The number of neurons in the SOM grid is considered equal to the population size. These neurons are arranged in the 2-D grid of size N1 ×N2 = 5×5. q 1 Pm−1 2 Initial neighborhood radius (σ0) is taken as 2 ( i=1 Ni )/(m − 1) inspired by [168], where m is the number of objective functions used in our approach. For the parameters listed in Table-7.8, their different combinations are tried. The correspond- ing results after executing the proposed approach using SOM-based operator and optimizing two

161 Multi-objective Based Approach for Microblog Summarization

(a) (d)

(c)

Figure 7.9: Pareto fronts obtained at the end of {0, 10, 20}th generation corresponding to TH dataset using ‘Without SOM’ version.

objectives, Ob1 and Ob2, are shown in Table 7.6 and Table 7.7. These results are obtained after a single execution of the proposed algorithm. In these tables, the first column indicates the parameter setting (PS) number, for example, PS01 is the first parameter setting. Table 7.9 reports the average ROUGE score for all datasets corresponding to different parameter settings mentioned in the Table 7.6 and Table 7.7. It can be evident from the Table 7.9 that the best average ROUGE scores were obtained using parameter setting number 3 i.e., PS03 (highlighted in bold). The results in the Table II, III and IV are generated using the best parameter setting (PS03). We assume that the best parameter settings obtained will also work when using our approach as without SOM-based operator optimizing any number of objective functions. If we talk about the best parameter values for individual datasets, then parameter setting numbers PS03, PS20, PS01 and PS15 are best suited for the SH, UK, TH and HB datasets, respectively.

162 7.6 An Application to Multi-document Summarization

7.5.7 Statistical significance test

To validate the results obtained by the proposed approach, a statistical significance test named, Welch’s t-test [187], is conducted at a 5% significance level as was done in the previous chapter. It is carried out to check whether the best average ROUGE scores (in Table 7.5) obtained by the proposed approach are statistically significant or occurred by chance. This t-test provides p-value. The minimum p-value signifies that our results are significant. The p-values obtained using Table 7.5 are (a) < .00001 using ROUGE-2 score; (b) .000368 using ROUGE-L score. Test results support the hypothesis that obtained improvements by the proposed approach are not occurred by chance, i.e., improvements are statistically significant.

7.6 An Application to Multi-document Summarization

To show the effectiveness of our proposed approach to other domain data, we have also performed multi-document summarization. The task is to generate a fixed length summary (in terms of the number of words) given a collection of documents. For this task, we have used, DUC 2002, standard datasets provided by the Document Understanding Conference. It contains 59 topics each having approx 10 documents. The corresponding multi-document summaries (two in number) each of 200 words are also available for each topic. Out of 59 topics, ten topics ranging from d061j to d070f from this dataset are considered while performing the experiments. The same set of topics are also considered in the comparative approaches (discussed below). The statistics about these specific topics like number of words, number of sentences, etc., is provided in Table 7.10.

7.6.1 Comparative Approaches and Differences with Our Approach

For comparison, two existing evolutionary-based approaches are considered. The first approach utilized adaptive differential evolution [201] for optimization in which DE parameters are adap- tive. In this approach, a weighted combination of two objectives, namely, anti-redundancy (AR)) and coverage (COV), is optimized. The mathematical definition of anti-redundancy is given in Eq. 7.3, while, coverage means the central theme of the document collection which should be PN covered in the summary. For a solution in the population, it is evaluated as i=1 sim(svi, O), where, N is the total number of sentences, svi is the vector representation (numeric vector) of the ith sentence belonging to the solution, O is the document vector calculated by averaging the sentence vectors. To represent the sentences in vector form, a well known tf-idf representation of vector space model in information retrieval [21], is utilized. To measure the similarity among

163 Multi-objective Based Approach for Microblog Summarization sentences and sentences to document vector, cosine similarity is utilized. In the second approach [202], these objectives (AR and COV) are optimized simultaneously (instead of using a weighted combination) and it uses well known genetic algorithm in the field of multi-objective optimization, i.e., non-dominating sorting genetic algorithm (NSGA-II) [29]. It also makes use of the same sentence vector representation strategy and similarity measure as used in adaptive DE. But, in our approach, semantic similarity measure (WMD) is utilized. We do not make use of any vector representation scheme; therefore, in place of O in the coverage function definition, we have considered the representative sentence (sR) whose index in the document collection is evaluated by calculating the minimum average dissimilarity of each sentence with other sentences in the topic, i.e.,

N  X  arg min distwmd(sR, sj) /(N − 1) (7.11) R j=1,R6=j

where, R = 1, 2 ...,N, N is the total number of sentences, distwmd is the word mover distance.

7.6.2 Results Obtained

In Table 7.10, we have shown the results obtained for different topics in terms of ROUGE-2 measure. It can be observed that our proposed approach improves by 14.28% and 3.42% over adaptive DE (in short, ADE) and NSGA-II, respectively. ADE and our proposed approach both are based on differential evolution. ADE and NSGA-II use syntactic similarity, while, our approach uses semantic similarity. Note that WMD makes use of a pre-trained word2vec model [43] on the Googlenews5 corpus which contains 3 billion words and each word vector is of 300 dimensions. In the future, we want to see the effect of using vector representation of sentences in the semantic space.

7.7 Conclusive Remarks

In this chapter, we presented a multi-objective based extractive summarization technique for solving the microblog summarization task. A multi-objective binary differential evolution (MOBDE) technique is used as the underlying optimization strategy in the proposed summarization sys- tem. SOM-based operator is also explored in fusion with MOBDE. The similarity/dissimilarity between two tweets is calculated utilizing the word mover distance to capture the semantic in- formation. Three objective functions are optimized simultaneously for selecting a useful subset

5https://github.com/mmihaltz/word2vec-GoogleNews-vectors

164 7.7 Conclusive Remarks of tweets present in the dataset/event. Experimental data and in-depth analysis of the results are provided in the chapter. It is clear from the obtained results that our approach outperforms the existing techniques. Results demonstrate that our proposed approach, MOOTweetSumm, obtained 62.37% and 5.65% improvements over the existing techniques in terms of ROUGE-2 and ROUGE-L evaluations measures, respectively. Results are also validated using statistical significance test. The application of the proposed approach is also shown for multi-document summarization task in which we have obtained 14.28% and 3.42% improvements over the two existing evolutionary-based techniques, ADE and NSGA-II, respectively. Our future work will extend the current approach for online summarization of microblogging tweets. In the current work, we have focused only on textual tweets. But, due to the length limitation of 140 characters, users can share multimedia content such as images, video and audio links, along with the tweet-text. And, in the literature, it has been shown that images play a supplementary role with the textual context, especially on microblogging sites where a tweet-text lacks the expressive power to present something. Therefore, the next chapter is dedicated to multi-modal microblog summarization.

165 Multi-objective Based Approach for Microblog Summarization

Table 7.6: Sensitivity analysis on the parameters used in the proposed algorithm utilizing SOM- based operator and optimizing two objectives, Ob1 and Ob2. Here PS and #TO stand for Parameter setting and the number of tweets obtained in predicted summary, respectively. Parameters Evaluation Measures Dataset #TO PS No. η0 CR b F Q Rouge-1 Rouge-2 Rouge-L SH 34 0.4983 0.3103 0.4884 UK 32 0.4306 0.2377 0.4235 PS01 0.2 0.8 6 0.8 5 TH 42 0.3929 0.2379 0.3866 HB 34 0.5198 0.3759 0.5198 SH 34 0.5049 0.3122 0.4950 UK 35 0.4400 0.2485 0.4329 PS02 0.4 0.8 6 0.8 5 TH 40 0.3340 0.1923 0.3277 HB 31 0.2782 0.3810 0.5272 SH 37 0.5545 0.3684 0.5512 UK 33 0.4706 0.2914 0.4565 PS03 0.6 0.8 6 0.8 5 TH 39 0.3697 0.2172 0.3634 HB 31 0.5050 0.3448 0.5000 SH 34 0.5215 0.3321 0.4116 UK 32 0.4118 0.2193 0.3976 PS04 0.2 0.2 6 0.8 5 TH 42 0.3235 0.1715 0.3172 HB 35 0.5322 0.3569 0.5322 SH 35 0.5281 0.3267 0.5248 UK 34 0.4471 0.2791 0.3638 PS05 0.2 0.6 6 0.8 5 TH 39 0.3445 0.1950 0.3403 HB 31 0.4901 0.3414 0.4827 SH 38 0.5512 0.3358 0.5413 UK 34 0.4588 0.2791 0.4494 PS06 0.2 0.8 5 0.8 5 TH 41 0.3193 0.1729 0.3130 HB 32 0.4901 0.3500 0.4901 SH 34 0.5050 0.3122 0.4983 UK 33 0.4353 0.2485 0.4235 PS07 0.2 0.8 7 0.8 5 TH 40 0.3151 0.1632 0.3109 HB 32 0.5050 0.3500 0.5025 SH 35 0.5347 0.3303 0.5281 UK 35 0.4424 0.2423 0.4282 PS08 0.2 0.8 6 0.2 5 TH 41 0.3298 0.1646 0.3277 HB 32 0.4851 0.3414 0.4827 SH 36 0.5281 0.3339 0.5182 UK 33 0.4118 0.2224 0.4047 PS09 0.2 0.8 6 0.5 5 TH 42 0.3508 0.2006 0.3445 HB 32 0.5074 0.3690 0.5074 SH 35 0.5677 0.3448 0.5611 UK 33 0.4329 0.2515 0.4259 PS10 0.2 0.8 6 0.8 4 TH 40 0.3697 0.2241 0.3613 HB 33 0.5025 0.3517 0.5025 SH 35 0.5479 0.3339 0.5347 UK 34 0.4282 0.2423 0.4188 PS11 0.2 0.8 6 0.8 6 TH 40 0.3676 0.2130 0.3613 HB 32 0.4876 0.3414 0.4851 SH 35 0.5281 0.3176 0.5182 UK 34 0.4541 0.2807 0.4447 PS12 0.6 0.8 6 0.8 4 TH 41 0.3361 0.1674 0.3319 HB 34 0.4950 0.3431 0.4950 SH 37 0.5512 0.3230 0.5513 UK 34 0.4588 0.2776 0.4494 PS13 0.6 0.8 6 0.8 6 TH 42 0.3655 0.2019 0.3613 HB 33 0.4728 0.3207 0.4678 SH 36 0.5116 0.3158 0.5050 UK 34 0.4329 0.2423 0.4212 PS14 0.6 0.8 6 0.2 5 TH 42 0.3655 0.2172 0.3634 HB 33 0.4827 0.3414 0.4802

166 7.7 Conclusive Remarks

Table 7.7: Sensitivity analysis of the parameters used in the proposed algorithm utilizing SOM- based operator and optimizing two objectives, Ob1 and Ob2. Here PS and #TO stand for Parameter setting and the number of tweets obtained in the predicted summary, respectively. Note that this table is a continuation of Table 7.6. Parameters Evaluation Measures Dataset #TO No. η0 CR b F Q Rouge-1 Rouge-2 Rouge-L SH 37 0.5545 0.3593 0.5479 UK 34 0.4471 0.2347 0.4400 PS15 0.6 0.8 6 0.5 5 TH 41 0.3319 0.1632 0.3256 HB 33 0.5248 0.3793 0.5248 SH 36 0.5050 0.3140 0.4983 UK 34 0.4659 0.3021 0.4588 PS16 0.6 0.8 7 0.8 5 TH 41 0.3866 0.2407 0.3845 HB 33 0.4901 0.3414 0.4876 SH 36 0.5248 0.3412 0.5149 UK 34 0.4306 0.2500 0.4212 PS17 0.6 0.8 5 0.8 5 TH 41 0.3151 0.1701 0.3109 HB 33 0.5099 0.3793 0.5099 SH 36 0.5413 0.3412 0.5347 UK 35 0.4706 0.2761 0.4541 PS18 0.6 0.6 6 0.8 5 TH 41 0.3298 0.1660 0.3256 HB 33 0.4950 0.3448 0.4950 SH 36 0.5446 0.3339 0.5380 UK 34 0.4165 0.2377 0.4047 PS19 0.6 0.2 6 0.8 5 TH 41 0.3277 0.1715 0.3235 HB 33 0.5000 0.3500 0.4975 SH 36 0.5083 0.3158 0.5017 UK 34 0.4494 0.2761 0.4353 PS20 0.6 0.8 5 0.8 4 TH 41 0.3718 0.2102 0.3697 HB 33 0.4975 0.3500 0.4975 SH 36 0.5611 0.3376 0.5479 UK 33 0.4235 0.2485 0.4141 PS21 0.4 0.8 5 0.8 4 TH 41 0.3382 0.1909 0.3361 HB 33 0.4926 0.3466 0.4926 SH 36 0.5578 0.3557 0.5578 UK 33 0.4259 0.2439 0.4165 PS22 0.4 0.2 5 0.8 5 TH 41 0.3172 0.1660 0.3151 HB 33 0.5000 0.3448 0.4975 SH 36 0.5413 0.3358 0.5380 UK 33 0.4635 0.2577 0.4541 PS23 0.4 0.6 6 0.8 4 TH 41 0.3403 0.1923 0.3340 HB 33 0.5074 0.3517 0.5050

Table 7.8: Range of possible values for each of 5 parameters Parameters Possible Values Initial learning rate (η0) [0.2, 0.4, 0.6] Crossover probability (CR) [0.2, 0.8, 0.6] Positive constant (b) [5, 6, 7] Scaling factor (F ) [0.2, 0.5, 0.8] Mating pool size (Q) [4, 5, 6]

167 Multi-objective Based Approach for Microblog Summarization

Table 7.9: Average ROUGE scores over all datasets corresponding to different parameter settings shown in Table 7.6 and Table 7.7. Here PS stands for Parameter Setting. Parameters Evaluation Measures No. η0 CR b F Q Rouge-1 Rouge-2 Rouge-L PS01 0.2 0.8 6 0.8 5 0.4604 0.2905 0.4546 PS02 0.4 0.8 6 0.8 5 0.3893 0.2835 0.4457 PS03 0.6 0.8 6 0.8 5 0.4749 0.3055 0.4678 PS04 0.2 0.2 6 0.8 5 0.4472 0.2700 0.4147 PS05 0.2 0.6 6 0.8 5 0.4524 0.2856 0.4279 PS06 0.2 0.8 5 0.8 5 0.4549 0.2844 0.4484 PS07 0.2 0.8 7 0.8 5 0.4401 0.2685 0.4338 PS08 0.2 0.8 6 0.2 5 0.4480 0.2697 0.4417 PS09 0.2 0.8 6 0.5 5 0.4495 0.2815 0.4437 PS10 0.2 0.8 6 0.8 4 0.4682 0.2930 0.4627 PS11 0.2 0.8 6 0.8 6 0.4578 0.2827 0.4500 PS12 0.6 0.8 6 0.8 4 0.4533 0.2772 0.4475 PS13 0.6 0.8 6 0.8 6 0.4621 0.2808 0.4550 PS14 0.6 0.8 6 0.2 5 0.4482 0.2792 0.4424 PS15 0.6 0.8 6 0.5 5 0.4645 0.2841 0.4596 PS16 0.6 0.8 7 0.8 5 0.4619 0.2995 0.4573 PS17 0.6 0.8 5 0.8 5 0.4451 0.2852 0.4392 PS18 0.6 0.6 6 0.8 5 0.4592 0.2820 0.4524 PS19 0.6 0.2 6 0.8 5 0.4472 0.2733 0.4409 PS20 0.6 0.8 5 0.8 4 0.4568 0.2880 0.4511 PS21 0.4 0.8 5 0.8 4 0.4538 0.2809 0.4477 PS22 0.4 0.2 5 0.8 5 0.4502 0.2776 0.4467 PS23 0.4 0.6 6 0.8 4 0.4631 0.2844 0.4578

Table 7.10: Statistics about the DUC2002 topics and corresponding average ROUGE-2 scores. Topic No. (↓) #Sentences #Words Adaptive DE NSGA-II Proposed d061j 184 3679 0.266 0.306 0.337 d062j 118 2691 0.188 0.200 0.200 d063j 249 4793 0.245 0.275 0.220 d064j 183 4080 0.194 0.233 0.392 d065j 284 5500 0.144 0.182 0.183 d066j 190 3894 0.201 0.181 0.258 d067f 122 2805 0.239 0.260 0.286 d068f 131 2565 0.491 0.496 0.294 d069f 327 1306 0.184 0.232 0.220 d070f 148 3116 0.224 0.262 0.332

168 CHAPTER 8

Multi-model Microblog Summarization

This chapter proposes a multi-objective evolutionary (MOEA) framework for multi-modal mi- croblog summarization task where tweet-texts associated with images are considered for sum- marization. Due to the utilization of population-based behaviour of MOEA, each solution in the population is allowed to perform an exploitation (local search) or exploration (global search) based on some mating restriction probabilities which are continuously updated based on the survival length of the solutions at each generation.

169 Multi-modal Microblog Summarization

8.1 Introduction

8.1.1 Overview

On microblogging sites, it has been noticed that the length limitation of 140 characters allows users to post multimedia content such as images, video and audio links. For example, in Figure 8.1, the image available with the tweet-text is shown describing infrastructure damage. Note that tweet-text in Figure 8.1, provides information about earthquake magnitude. When informa- tion from a tweet-text and image are combined, the result may convey some crucial or relevant information to a disaster management authority and, thus, help in their decision making to take the best action. Moreover, in the literature, it has been shown that images play a supplemen- tary role with the textual context, especially on microblogging sites where tweet-texts lack the expressive power to present something. Therefore, in this chapter, we developed an approach MMTweetSumm for summarizing the relevant multimedia tweets that extract an optimal subset of tweets considering multimedia content. It is important to note that the current chapter is the extension of work done in the previous chapter in which the approach MOOTweetSumm was developed for microblog summarization and considers only textual tweets.

Figure 8.1: Image available with the tweet-text during earthquake in Mexico.

In MOOTweetSumm, authors have used the concept of multi-objective optimization (MOO), which considers various statistical measures and simultaneously optimizes them to improve the quality of the summary. These measures are (a) anti-redundancy measuring the dissimilarity between the tweets; (b) tf-idf score of the tweet (sum of tf-idf scores of the words in the tweet); (c) length of the tweet and all these objective functions are of maximization type. Our algorithm also makes use of the MOO concept. The following are the points that make our approach different from our baseline approach MOOTweetSumm:

1. Anti-redundancy measure: In MOOTweetSumm, only textual-tweet information is consid-

170 8.1 Introduction

ered while measuring the anti-redundancy objective function. But, in MMTweetSumm, a textual tweet is combined with the text information (or image feature, used arbitrarily) describing the corresponding image available with the tweet. The benefit of doing this is to add supplementary information with the textual tweet to make it more informative.

2. Exploration vs exploitation behaviour: In any evolutionary algorithm, these are two impor- tant behaviours that decide whether the search for the optimal solution should be global or local. Here, both algorithms utilize MOBDE as the underlying optimization strategy, which is a type of MOEA algorithm (discussed in Section 2.1.9 of Chapter 2) and starts from a candidate set of solutions. Each solution may have a different search space. There- fore, in our approach, at each generation, each solution can perform either exploitation or exploration based on some kind of mating restriction probability [203]. This probability is self-adaptive based on the survival length (SL) of the solution. Here, SL represents the num- ber of survival generations over the last few generations (defined by the user). If SL value is more for a solution, then exploitation takes place; otherwise, exploration is conducted. While in MOOTweetSumm, the concept of self-adaptive mating restriction probability was not present; the best results were reported using only exploration behaviour.

3. Tweet-scoring functions: In this work, in addition to (a) anti-redundancy; (b) tf-idf score of the tweet; (c) length of the tweet, two additional tweet scoring features are also considered in our optimization framework: (a) BM25 [21], a bag-of-word based retrieval function designed to rank the short-texts; (b) RT (re-tweet) [131] which counts how many times a tweet is re-posted. A high value of re-post/re-tweet indicates that it has a lot of attention and interest from the users. A significant value of RT attracts much attention and has more importance. Similarly, this is the case with BM25. Note that these functions are never explored in integration with MOEA for the multi-modal microblog summarization task.

Currently, dense captioning task [204, 205] is gaining popularity due to its ability to predict a set of natural language descriptions across various regions of the image. Therefore, in our task, to extract the image features, we used the pre-trained model available at https://github.com/ jcjohnson/densecap that was trained on 94, 000 images. An example1 is illustrated in Figure 8.2 in which different regions predicted by the model are shown in various colour boundaries. The captions of the coloured boundaries are also written below in the figure having the same colour. However, there are other models like VGGNet16/19 [206], ResNet50 [206], InceptionV3 [207], etc., for image feature extraction, but they do not provide a textual description of the

1https://cs.stanford.edu/people/karpathy/densecap/

171 Multi-modal Microblog Summarization image. Also, ‘A picture is worth a thousand words’. Therefore, to see the textual description of the images, we utilized the dense captioning models for our task.

Figure 8.2: An example of dense captioning model taken from https://cs.stanford.edu/ people/karpathy/densecap

8.1.2 Major Contributions

The major contributions of the current work are enumerated below:

1. The current work is the first of its kind where both the image and the tweet-text are utilized simultaneously to generate a summary from micro-blog data generated during a disaster-event.

2. Only limited works exist for multi-modal microblog summarization [130, 132]. Moreover, no existing work has explored the image dense captioning model [204] for image feature extraction.

3. We created a new gold standard data set for multi-modal microblog summarization con- sidering three different disaster events. Some human annotators are deployed to manually generate the gold summary from these data.

4. None of the existing works explored the concept of MOEA for multi-modal microblog summarization.

5. To find the optimal summary, self-adaptive mating restriction probability was incorporated in the proposed MOEA based framework to explore the search space in an efficient manner.

172 8.2 Tweet-scoring Functions

Table 8.1: Notations used with their descriptions. Here, tf-idf refers to Term frequency-inverse document frequency. Symbol Description E Disaster event containing tweets NE Total number of situational tweets in E M Number of tweets to be in the summary S Obtained Summary tk kth tweet tavg Average number of words per tweet | tk | Number of words in the kth tweet T (w, tk) Term frequency of a word ’w‘ in kth tweet F(w, tk) Inverse-document frequency of a word ’w‘ in kth tweet L(tk) Length of kth tweet (tk) D(tk, tm) Word move distance between kth and mth tweets. P Population | P | Number of solutions in the population MaxGen Maximum number of generations CR Crossover probability F Control factor b Real positive constant

8.2 Tweet-scoring Functions

For any summarization system, selection of various statistical measures, helping in selection of informative sentences/tweets, is a crucial task. Therefore, in this work, we have explored five measures (also called as objective functions) and all should be maximized to obtain a good quality summary. Mathematical formulations of these functions are discussed below. Notations/symbols used all over the current chapter are described in the Table 8.1.

1. MaxAntiRedundancy (J1/J2): This measure is designed to avoid redundancy in the sum- mary. For its mathematical definition refer to Eq. 7.3 of Chapter 7. Two scenarios are

considered here: (a) if tweet tk (tl) is consisting of text-tweet concatenated with image

natural language feature, then MaxAntiRedundancy will be denoted as J1; (b) if tk (tl) is consisting of only text-tweet, then it is called as J2.

2. MaxSumTFIDF (J3)/MaxLength (J4): These measures are same as Ob2 and Ob3 of Chapter-7. Their mathematical definition of these measures are described in Eqs. 7.6 and 7.7, respectively.

3. MaxSumBM25 (J5): BM25 [208] is the ranking function in information retrieval used to rank the documents (tweets in our case) based on relevance to the query. It was basically designed for short texts like tweets. As per literature [21], it performs better than tf-idf

173 Multi-modal Microblog Summarization

[188] model when text is short-length like tweets. Therefore, it is adopted as one of the objective functions in our framework. Mathematically, it is described as

M  X  J5 = BM25(tk,Q) /M (8.1) tk∈S k=1,tk∈S

where, Q is a query with terms q1, q2, ....qn and BM25 score of a tweet tk ∈ S denoted as

W (=BM25(tk,Q)) is defined as

n X T (qi, tk) · (k1 + 1) W = F(q , t ) (8.2) i k F(q , t ) + k · (1 − b + b · (|t |/t )) i=1 i k 1 k avg

where, b and k1 are hyper parameters for BM25 generally chosen as k1 ∈ [1.2,2.0] b = 0.75. Note that here Q refers to the entire set of tweets in the disaster event E.

4. MaxRTScore (J6): On any social network, importance of the tweet can be revealed from the re-post number [131]. A high value of re-post indicates that it has lot of attention and interest from the users. Therefore, to evaluate the quality of summary S, it is evaluated as

M X J6 = log(RepostNumber(tk) + 1) (8.3) k=1 where, RepostNumber counts how many times a tweet is re-posted.

Note that first three objective (J2, J3 and J4) functions are same as discussed in our pre- liminary model MOOTweetSumm.

8.3 Dataset Creation

In this work, due to the unavailability of the dataset in a disaster-event scenario, we have constructed three datasets namely Hurricane Harvey2, Hurricane Irma3 and Srilanka Flood4, for our summarization task. For this purpose, CrisisMMD5 resource is utilized, which includes many disaster events having tweets with images. These tweets have a mixture of informative and non-informative (the tweets which show anger, despair and different moods of people during the event) tweets. The images associated with these tweets also may be informative or non- informative. Note that annotation of informative or non-informative categories for tweets and

2https://en.wikipedia.org/wiki/Hurricane Harvey 3https://en.wikipedia.org/wiki/Hurricane Irma 4https://en.wikipedia.org/wiki/2017 Sri Lanka floods 5https://crisisnlp.qcri.org/

174 8.3 Dataset Creation images are already provided by the annotators of CrisisMMD. As our task has considered multiple modalities (text and images) of the tweets; therefore, the advantage of incorporating image with the text-tweet is also considered while generating datasets. Below is a description of the developed datasets:

1. In the first step, we have extracted only informative multi-media tweets. Note that images associated with these tweets should also be informative. Rest of the tweets are excluded because they do not convey any assistance for human-aid during the disaster event.

2. Extracted multi-media informative tweets are pre-processed using tweet pre-processor6 available in python. Using this, we have removed the URLs, hashtags, mentions. Moreover, all tweets are converted into lower case and those tweets are removed which include less than 3 words as they do not covey any useful information (after observation).

3. In the third step, K-medoid [4] clustering algorithm is applied on the multi-media tweets. Note that K-medoid starts from K (user defined) random cluster centers or in other words, K random multi-media tweets. For tweet to tweet dissimilarity, WMD is utilized; while to calculate the image to image dissimilarity, first we have extracted image feature in the form of vector of 4096 dimension using VGG19Net and then cosine distance (CD) (1- coisne similarity) is calculated (for definition of cosine similarity, refer to section 2.1.3 of Chapter-2). A multi-media tweet is assigned to that cluster which has minimum average

distance ((WMDi,k + CDi,k)/2, where i and k are the multi-media tweet and cluster center, respectively). The detailed procedure of K-medoid is provided in Section 2.1.1 of Chapter 2. We have varied the range of clusters between [35, 100] and evaluated using a cluster validity index namely, silhouette index (SI) (for definition, refer to Table 2.1) which measures the cluster quality in terms of compactness (distance between multimedia tweets in the cluster) and separation (distance between clusters). SI ranges between [−1, 1] and higher value indicates better partitioning. For our purpose, we have chosen that value of K at which we obtain highest SI.

4. From the obtained clusters, four different summaries are created by extracting tweets having maximum value of four different features:

• tf-idf score of the tweet

• BM25 score

• number of content words

6https://pypi.org/project/tweet-preprocessor

175 Multi-modal Microblog Summarization

Table 8.2: Dataset Statistics. Here, #ITWIM: Informative tweets with informative images; #PITWIM: Pre-processed Informative tweets with informative images; #GT: Number of tweets in Gold Summary. Event Name Total Tweets #ITWIM #PITWIM #GT Hurricane Harvey 4,443 2,261 2,224 22 Hurricane Irma 4,525 2,031 2,001 37 Srilanka Flood 599 398 389 37

• length of the tweet

Here, content words include numerals (contact numbers, blood bank numbers, number of causalities, etc.), verbs like dead/injured/stranded, etc. The importance of content words is already shown in [126]. For remaining features, refer to [21].

5. Finally, ensemble approach is used. If a tweet occurs in at least three summaries, then it is added into the final summary. This final summary would be treated as the golden summary.

After forming the summary, the human evaluation was done by the two persons: one un- dergraduate student and one doctoral student. The overlapping tweets were found to be more than 85% for all datasets. The gold summaries obtained corresponding to Harvey, Irma, and SrilankaFlood are provided at github repository7. Other information like number of clusters (K), etc., are also provided in the same link.

2nd, 4th, 6th and 7th tweets should be present in the summary

Tweet 0 1 0 1 0 1 1 0 indices 1 2 3 4 5 6 7 8

Figure 8.3: Representation of a solution.

8.4 Problem Statement

If E is any disaster event containing NE number of tweets, then our task is to obtain a summary S, consisting of M number of tweets belonging to E which maximize

max{J1/J2(S),J3(S),J4(S),J5(S),J6(S))} (8.4)

7https://github.com/nsaini1988/Multi-modal-Microblog-Summarization.git

176 8.5 Proposed Methodology

Here, J1,J2...J6 are the objective functions or tweet scoring functions discussed in Section 8.2 and simultaneously optimized using the multi-objective binary differential evolution (MOBDE) [46] algorithm, which is a population-based meta-heuristic algorithm. Here, the population consists of a set of solutions represented in the form of binary vectors, and each solution is associated with fitness/objectives values. An example of solution representation is shown in Figure 8.3.

Note that (a) we have performed the ablation study by varying the objective function combi- nations; for example, {J1, J3}, {J1, J3, J4}, {J1, J4, J5}, {J2, J4, J5} are some possible sets of objective functions which need to be simultaneously optimized in different runs of the proposed algorithm (we have tried up to maximum 3 objective functions); (b) J1/J2 is kept common to cover diverse set of tweets.

8.5 Proposed Methodology

The steps of our approach (MMTweetSumm) are shown in Algorithm 4. Detailed description of the steps used are described below:

8.5.1 Population and Parameter Initialization

As our algorithm is based on MOBDE framework, therefore, it starts with a set of random binary solutions, called as population (step-1). Note that the number of 1’s in each solution should not exceed M. An example of solution representation is shown in Figure 8.3. The parameters like 1 2 |P| 1 2 |P| mating restriction probability (Bg = {βg , βg , . . . , βg }) and survival sign (Sg = {sg, sg, . . . , sg } i are also initialized. Note that B keeps on updating at each generation. High value of βg for solution Xi implies exploitation, otherwise, exploration. Note that B is initialized with zeros as i i initial generation demands more exploration. The variable sg for solution X implies survival length, i.e., the generation number at which it arrives.

8.5.2 Objective Functions Calculation

The objective functions which need to be simultaneously optimized, are evaluated for each solution (step-3). These objective functions (tweet scoring functions) are discussed in section 8.2. Afterward, the iterative procedure begins (step-5 to step-16) starting from first generation and continues until the maximum number of generations (MaxGen) is reached.

177 Multi-modal Microblog Summarization

Algorithm 4 Procedure of MMTweetSumm(K,MaxGen) P 1: P ← Initialize Population <, X2,X3,...,X| | > 2: Initialize control parameters ( Mating probability B = [0, 0,..., 0] Survival Sign S = [0, 0,..., 0]

3: For each solution X , evaluate objective functional values 4: g=1 . Current generation number 5: Repeat step-5 to 16 until g < MaxGen 6: Obtain K clusters {CP1,CP2,...,CPK } using P 7: for l=1 to K do 8: if | CPl |=1 then m n m n 9: Assign two nearest solutions X and X to CPl such that CPl={CPl ∪ X ∪ X } 0 10: P =[] . Population to store new solutions 11: For each solution Xi ∈ P, generate new solution

 (a) Find CP , k = {1, 2,...,K} such that Xi ∈ CP  k k  (b) Generate some random probability rand()  (c) if rand() < βi then Choose three random solutions r1, r2 and r3 from CP ,  g k   Otherwise choose from P to form a mating pool  (d) P rob(X ) ← Perform probability estimation operator using selected random solutions and X   0 (e) Y ← Convert P rob(X ) into a binary solution  00 0 (f) Y ← Perform crossover between Y and Xi   00 (g) Evaluate objective functions for Y  00 0 (h) Add Y into P

0 12: Merge old population (P) and new population (P ) 13: P ← Select the best | P | solutions based on their objective functional values using non- dominating sorting and crowding distance operator and use them for the next generation 14: Update survival sign S using survival length 15: Update B for (g + 1)th generation using S ← Update(g, S) 16: g ← CGen+1 17: return the best summary

178 8.5 Proposed Methodology

8.5.3 Grouping of Similar Solutions

In step-6, K-means clustering algorithm (refer to Chapter 2, Section 2.1.1) is applied on the population where K is the number of clusters. The main motive of K-means is to identify different grouping/clusters of solutions. These groupings will be utilized while performing ex- ploitation and exploration. In case of exploitation, solutions belonging to the same cluster will be preferred, while, for exploration, solutions belonging to different clusters will be used. Note that our algorithm utilizes rand/1/bin variant of differential evolution and if exploitation needs to be performed then there must exist at least three solutions in a cluster. Therefore, if such case is found, i.e., clusters having solutions less than 3, then nearby solutions are assigned to them such that each cluster has minimum three solutions. Steps-8 to 9 in Algorithm 4 describe these steps.

8.5.4 New Solution Generation

In step-11, new solution (also called a trail solution in MOBDE) generation takes place for each i i solution X ∈ P. For this, first we have to identify the cluster CPk, k = {1, 2,...,K} to which X belongs. Then, various genetic operators like mating pool construction (step-11(c)), mutation (step-11(d) and 11(e)) and crossover (step-11(f)), are applied in forming a new solution. Note i i that in mating pool construction for solution X , if random probability is less than βg, then solutions will be selected from the closest CPk required to perform exploitation; otherwise solu- tions will be selected from the whole population. For mutation, firstly, a probability estimation operator is applied between chosen random solutions in step-11(c) and current solution Xi as described in Eq. 2.10 of Chapter-2. Then, mutation and crossover is performed as per Eq. 2.11 and 2.12. The set of new solutions is called as new population or child population.

8.5.5 Selection of Top Best Solutions

In this step, | P | solutions are selected after merging the old population and child population. For more detail about this step, reader can refer to section 3.2.6 of Chapter 3.

8.5.6 Update Survival Length and Mating Restriction Probability

After finding the best solutions in step-13 of Algorithm 4, survival sign S will be updated which records the generation number in which solution begins to survive. In other words, if a solution i i i X begins to exist in generation ‘t’, then sg = t. More value of s indicates better solution. This survival sign S will be used to update the mating restriction probability. The procedure of

179 Multi-modal Microblog Summarization

Algorithm 5 Procedure of Update(g, S, H) 1: for each solution X i ∈ P do i i 2: Evaluate SL: ρg = min{g − s + 1,H} 3: for each solution X i ∈ P do 4: Evaluate mating restriction probability as

i min i ρg − ρg βg+1 = max min ρg − ρg

i i 5: if βg+1 > 0.95 then βg+1 = 0.95 6: else i i 7: if βg+1 < 0.05 then βg+1 = 0.05 8: return Updated B

i i updating B is described in Algorithm-5. To calculate βg for solution X at generation ‘g’, first i survival length (ρg) is calculated (step-2) as

i i ρg = min{g − s + 1,H} (8.5)

i Where, H denotes the last few generations prior to ‘g’ and defined by the user, ρg represents the number of survival generations over the last few generations (H). The reason for restricting the survival generation is to provide the opportunity for other newly joined good solutions to i exploit the search space. Note that the maximum value of ρg can be H, and higher value denotes that X i is a good quality solution and exploitation is required, while a low value indicates that i i the solution is newly generated. Then, for every solution X , ρg is normalized (step-4) to give i i rise to βg+1. If βg+1 is greater than 0.95 then solution at a high value of 0.95 is provided to i βg+1 for exploitation; otherwise value of 0.05 is used. The concept of adaptive mating restriction probability was incorporated motivated by the paper [203], and more details about restricting i βg+1 can be found in the same paper [203].

8.5.7 Selection of Single best Solution

In the final generation, we will get a set of Pareto Optimal Solutions all having equal importance. Therefore, single solution is selected from this pool, which provides the best summary.

8.6 Experimental Setup

This section describes evaluation measures, parameters used and comparative approaches.

180 8.7 Discussion of Results

8.6.1 Evaluation Measure

To evaluate the performance of our generated/predicted summary with respect to the gold summary, ROUGE-N score is used where N takes the value of 1, 2, and L to provide ROUGE-1, ROUGE-2, and ROUGE-L, respectively. More description of ROUGE-N is provided in section 2.3.2 of Chapter-2.

8.6.2 Parameters Used

Following are the various parameter values used in our approach- | P |= 25, MaxGen=25, crossover probability (CR)=0.8, b=6, F=0.8. Sensitivity analysis on DE parameters is provided in the preliminary version developed for microblog summarization task. The number of fitness function evaluations (NFE) is kept as 2000. To calculate the word mover distance between two tweets, we have utilized the pre-trained word2vec model8 on 53 million tweets related to disaster event [196]. While updating mating restriction probability, the value of H is considered as 3. The number of clusters (K) is considered to be 5 assuming that each cluster must have at least 5 solutions.

8.6.3 Comparative Approaches

For comparison, we have used the various existing algorithms like LexRank, Luhn, LSA, TexRank, FrequencySum. Moreover, we have also executed our preliminary version, MOOTweetSumm based on the multi-objective optimization concept for our task. Ablation study is also presented to analyze the simultaneous optimization of multiple objectives.

8.7 Discussion of Results

In this section we have discussed about results obtained by different techniques.

8.7.1 Box-plots showing qualities of summaries corresponding to different solutions

In our proposed model MMTweetSumm, different sets of objectives functions are simultaneously optimized in multiple runs. The objective function MaxAntiRedundancy is kept common to avoid the redundancy in the summary. Note that MaxAntiRedundancy function differs in terms of its usage (refer to section-8.2). As our algorithm is based on evolutionary concept, therefore, it provides a set of non-dominated solutions in the final generation (each representing a summary).

8http://crisisnlp.qcri.org/lrec2016/lrec2016.html

181 Multi-modal Microblog Summarization

(a)

(b)

(c)

Figure 8.4: Box plots in sub-figures (a), (b) and (c) for Harvey, Srilanka and Irma disaster events, respectively. These figures illustrate the range of Rouge-L values using different sets of objective functions.

The variations of ROUGE-L score values calculated for the obtained solutions in different runs for three datasets namely Harvey, Srilanka and Irma event, are shown in Figure-8.4 (called as box-plots). Note that in these box-plots, effect of considering MaxAntiRedundancy utilizing text+image (A1) features vs. only textual feature (denoted as A) with other objective func- tions, is shown. From these box-plots, following are the observations: for (a) harvey, (b) srilanka

182 8.7 Discussion of Results

Table 8.3: Comparison of ROUGE scores obtained using proposed approach MMTweetSumm and MOOTweetSumm. Here, R and Obj in second row refer to objective functions used and ROUGE; A1 and A represent MaxAntiRedundancy calculation using text+image and text, respectively; T, L and BM25 represent MaxSumTFIDF, MaxLength, and MaxSumBM25 objective functions, respectively. Dataset→ Harvey Srilanka Irma Method Obj. R-1 R-2 R-L Obj. R-1 R-2 R-L Obj. R-1 R-2 R-L A+T 0.38 0.27 0.39 A+T 0.53 0.46 0.53 A+T 0.52 0.39 0.52 MOOTweetSumm A1+T 0.38 0.27 0.39 A1+T 0.46 0.38 0.46 A1+T 0.47 0.33 0.46 A+L+T+BM25 0.53 0.44 0.53 A+T 0.68 0.59 0.68 A+T+BM25 0.59 0.49 0.59 MMTweetSumm A1+T+L 0.52 0.40 0.52 A1+L 0.66 0.60 0.66 A1+T 0.56 0.47 0.56

Table 8.4: Comparison of ROUGE scores attained by our method with the existing methods. Harvey Srilanka Irma Method ROUGE-1 ROUGE-2 ROUGE-L ROUGE-1 ROUGE-2 ROUGE-L ROUGE-1 ROUGE-2 ROUGE-L LexRank 0.20 0.01 0.20 0.39 0.17 0.37 0.27 0.06 0.27 LSA 0.25 0.10 0.26 0.56 0.38 0.55 0.28 0.12 0.29 LUhn 0.21 0.09 0.22 0.58 0.42 0.58 0.23 0.07 0.23 TexRank 0.20 0.04 0.20 0.57 0.35 0.57 0.16 0.03 0.16 FrequencySum 0.31 0.03 0.31 0.50 0.37 0.50 0.33 0.02 0.35 MOOTweetSumm 0.38 0.27 0.39 0.53 0.46 0.53 0.52 0.39 0.52 MMTweetSumm 0.53 0.44 0.53 0.68 0.59 0.68 0.59 0.49 0.59

and (c) irma datasets, the following combinations of objectives functions (a) A1+T+L, A1+RT, A1+L+RT, A1+T+RT, A1+BM25, A1+T+BM25, (b) A1+RT, A1+L+RT, A1+L+T+RT, A1+BM25, A1+RT+BM25; (c) A1+L+RT, A1+T+RT, A1+T+RT+BM25, respectively, per- form better in association with text + image features than when using only textual-features. For rest of the objective function combinations, the results using only textual tweet are better than using text+image features. Error-analysis is provided in the section 8.7.4.

8.7.2 Comparison among MMTweetSumm and MOOTweetSumm

In Table 8.3, comparisons amongst the results of MOOTweetSumm and MMTweetSumm, at- tained using their best set of objective functions are shown. Note that MOOTweetSumm is executed using only the best set of objective functions reported in the previous chapter i.e. MaxAntiRedundancy (J1) and MaxSumTFIDF (J3). From Table 8.3, it can be analyzed that MMTweetSumm is able to beat the results obtained using MOOTweetSumm. For Harvey, Sri- lanka, and Irma datasets, the best result obtained by MOOTweetSumm in terms of ROUGE-1, ROUGE-2 and ROUGE-L are : (a) 0.38, 0.27 and 0.39; (b) 0.53, 0.46 and 0.53; (c) 0.52, 0.39 and 0.52, respectively. While using MMTweetSumm, the ROUGE-1, ROUGE-2 and ROUGE-L scores for these three datasets are (a) 0.53, 0.44 and 0.53; (b) 0.68, 0.59 and 0.68; (c) 0.59, 0.49 and 0.59, respectively. These results clearly indicates that incorporation of adaptive restriction probability in MMTweetSumm helps in getting the optimal summary.

183 Multi-modal Microblog Summarization

(a) Text (b) Text+Image

Figure 8.5: Maximum ROUGE scores per generation attained by MMTweetSumm over Harvey dataset.

(a) Text (b) Text+Image

Figure 8.6: Maximum ROUGE scores per generation attained by MMTweetSumm over Irma dataset.

(a) Text (b) Text+Image

Figure 8.7: Number of new good solutions per generation by MMTweetSumm using Harvey dataset.

184 8.7 Discussion of Results

(a) Text (b) Text+Image #HurricaneHarvey almost up to my door https://t.co/AaU494ZCL5

Figure 8.8: Number of new good solutions per generation by MMTweetSumm using Irma dataset.

RT @SpaceCityWX: Flooding begins as RT @BBCJamesCook: The little town of Texas Faces Floods After Hurricane Harvey worst of TS Harvey moisture moves Rockport has suffered severe damage from Wreaks Havoc https://t.co/5R2P7JCYrV into Houston https://t.co/FtVem0Zp4j #HurricaneHarvey. https://t.co/fBZFMABaKZ

Figure 8.9: Informative tweets with informative images provided by annotators of CrisisMMD dataset [2]. RT @rootmess: ·Ω™8Hurricane Irma: a useful checklist·Ω™8 PLEASE SPREAD & KEEP SAFE https://t.co/BS1loU3HmU s IRMA. Great incident/disaster management. Thank you Digital Reality Skybox cyrusone data foundry https://t.co/uk8cPq4x07

Tweet-> RT @rootmess: ·Ω™8Hurricane Irma: a useful checklist·Ω™8 PLEASE SPREAD & KEEP SAFE tps://t.co/BS1loU3HmU

Figure 8.10: Four informative images with same tweet.

Tweet-> IRMA. Great incident/disaster management. Thank you Digital Reality Skybox cyrusone data foundry https://t.co/uk8cPq4x07 185 RT @rootmess: ·Ω™8Hurricane Irma: a useful checklist·Ω™8 PLEASE SPREAD & KEEP SAFE https://t.co/BS1loU3HmU s IRMA. Great incident/disaster management. Thank you Digital Reality Skybox cyrusone data foundry https://t.co/uk8cPq4x07

Tweet-> RT @rootmess: ·Ω™8Hurricane Irma: a useful checklist·Ω™8 PLEASE SPREAD & KEEP SAFE tps://t.co/BS1loU3HmU

Multi-modal Microblog Summarization

Tweet-> IRMA. Great incident/disaster management. Thank you Digital Reality Skybox cyrusone data foundry https://t.co/uk8cPq4x07

Figure 8.11: Informative tweet with its image in the form of newspaper cutting.

8.7.3 Comparison of MOOTweetSumm with Existing Methods

In Table 8.4, the best results of our proposed model and MOOTweetSumm are compared with the other existing methods. It is evident from Table 8.4 that our method outperform other methods. For Harvey and Srilanka datasets, LexRank performs the worst. While for Irma dataset, TexRank attains the low values of ROUGE scores. Although, our proposed system is better than existing methods for all datasets, but, overall, the best result is obtained when it consider only tweet-texts in MaxAntiRedundancy calculation as can be seen from Table 8.3. This behaviour is contradicting the fact that ‘image plays a supplementary role when combined with the tweet-text’. To investigate this, we have plotted

• the maximum ROUGE score value attained per generation

• the number of new good solutions proceeding to next generation (obtained in step-13 of Algorithm-4). in the context of our approach utilizing the best set of objective combinations (shown in Table- 8.3). In Figure-8.5 and 8.6, maximum ROUGE score values per generation are shown for Harvey and Irma datasets, respectively. Left side of these figures corresponds to the objective function, MaxAntiRedundancy, calculated only utilizing text-tweet; while on right hand side, calculation of MaxAntiRedundancy objective function makes use of text+image features. It is clearly visible that for both datasets, curves are fluctuating due to the simultaneous optimization of multiple objectives, which may have conflicting behavior. But, at the last generation, MaxAntiRedun-

186 8.7 Discussion of Results dancy utilizing text-tweet gives maximum ROUGE scores. It is also true that at a specific number of generations when using text+image (for example, at 16th generation for Harvey and 3 to 5 for Irma), those maximum ROUGE scores are already attained and then, start fluctu- ating and finally start decreasing. These behaviors inferred that MaxAntiRedundancy function utilizing text+image features requires specific values of the objectives functions, which is itself another task and different from the optimal parameter selection.

After observing the number of new good quality solutions generated per generation for Harvey and Irma datasets in Figure-8.7 and 8.8, respectively, it can be inferred that MaxAntiRedundancy objective function utilizing only tweet-text is able to generate more number of good solutions in comparison to using text and image features together.

After investigating maximum ROUGE score values per generation and the number of new good solutions, it has been observed that our proposed approach, MMTweetSumm performs the best when only tweet-text is considered for the calculation of anti-redundancy. The performance of the system degrades when tweet-text is enhanced with image-description for calculating the same. In order to further investigate the reason for the same, we have looked into the image dense captioning model used in our framework to generate the image dense captions. Detailed analyses on the datasets and image captioning model are provided in Section 8.7.4.

Figure 8.12: An example of caption generation by dense-caption model.

187 Multi-modal Microblog Summarization

Figure 8.13: Another example of caption generation by dense-caption model.

8.7.4 Error-analysis

As our results using textual tweet is better than text+image, we performed a thorough error analysis. The observations are listed below:

• The same tweet posted multiple times: Although this situation is an indicator that the tweet is real and popular, and attracts more number of people. Still, some users are posting multiple segments of the same image in multiple tweets (posting the same tweet again). For example, in Figure 8.10, the same tweet (posted four times) associated with different images is shown. These images describe the useful checklist during hurricane but in different images. From the perspective of the gold summary, we can not keep all tweets in summary, as it will increase redundancy. Therefore, handling such types of tweets is another challenging issue as we can not restrict the users to post multi-media information in multiple tweets.

• While creating dataset, we have considered only those tweets for which tweet-text and image both are annotated as informative. Note that the original dataset was a mixture of informative and non-informative tweets. And, informative tweets are those tweets that provide assistance to the needed people during a disaster like advice, cautions, rescue, warnings, etc. Some of the informative tweets with their images are provided in Figure- 8.9. However, some other types of images are also there which are in the form of newspaper

188 8.8 Conclusive Remarks

cutting, notice, list of useful items, etc. and some are very low-quality pictures. An example of such type of image with corresponding text-tweet is shown in Figure 8.11. In this case, the utilized dense captioning model is not able to generate captions; it just gives captions as ‘in the photo’.

• The dense captioning model utilized in our framework was trained on visual domain dataset9 having 94, 313 images, but not related to disaster event datasets. We have not found any dense captioning model applicable for the disaster-related tasks. For some of the images, it identifies the correct caption, but for some of the images like blur/low-quality images, the model does not perform well. Two examples related to Irma dataset are shown in Figure 8.12 and 8.13. In Figure 8.12, although the dense captioning model is not up to the mark in relation to the disaster-related event, but, generates the right captions. While, in Figure-8.13, due to the existence of map in the image form, the model is not able to generate the correct captions. Thus, from these examples, it is clear that a more robust dense-captioning model is required.

8.7.5 Statistical t-test

To identify the correctness of the best result obtained by our proposed approach utilizing only text information, we have conducted the statistical significance t-test [187] at 5% significance level. This test provides the p-value. The lesser p-value indicates the superiority of results over other methods. To conduct this test, two groups are required. In one group, we have kept the ROUGE-L (as it is more robust than ROUGE-1 and ROUGE-2) values obtained in 6 different runs of our approach; while in another group, ROUGE-L values of six existing methods (see Table 8.4) are kept. The obtained p-values for three datasets, namely Harvey, Irma, and Srilanka, are .000368, .000121, and .000001, respectively. These p-values signify that the best results obtained are statistically significant.

8.8 Conclusive Remarks

In this work, we explored the effectiveness of the image dense-captioning model for the multi- modal microblog summarization task in which we have to generate the textual summary using tweet-texts and images. Various objective functions helping in generating a good quality sum- mary are optimized simultaneously using population-based multi-objective binary differential evolution. Each solution in the population can search in the space using exploration and ex- ploitation behaviour. The concept of adaptive mating restriction probability (MRP) is incorpo-

9http://visualgenome.org/static/paper/Visual Genome.pdf

189 Multi-modal Microblog Summarization rated in our framework, which decides whether to exploit or explore the search space. This MRP keeps on updating in each generation based on the survival length of the solutions in the last few generations. Extensive experiments verified the effectiveness of our approach (using only textual tweets), and it was proven that it delivers better performance than all other existing approaches. In the future, we might extend our proposed model for performance improvements in terms of image-caption generation. In the next chapter, we will conclude the thesis and highlight several future works.

190 CHAPTER 9

Conclusions and Future Works

This final chapter concludes the thesis and presents potential areas for extending the approach described and explained in previous chapters. Directions for future research are suggested.

191 Conclusions and Future Works

9.1 Conclusions

In this thesis, novel unsupervised approaches are proposed for solving two important problems of text mining: document clustering and extractive summarization. Under the umbrella of extractive summarization, the areas comprising of single and multi-document summarization, figure summarization, microblog summarization, and multi-modal microblog summarization are explored. The techniques like multi-objective optimization (MOO) and evolutionary algorithms (EA) are utilized to solve these tasks. These were chosen primarily for their popularity in solving many real-life applications. The presented algorithms have shown that utilization of the MOO+EA (equivalent to MOEA) concept makes our algorithms more robust and capable of outperforming state-of-the-art techniques. However, due to the scarcity of labelled data, all algorithms are developed in an unsupervised way. In any MOEA based algorithm, usually, performance depends on the quality of new solutions generated at each generation. For some of the problems, the optimal solution may lie in the local region. Therefore, to capture this, a self-organization map based operator is explored as a mating pool construction tool in the MOEA framework. Here, SOM is a type of two-layer neural network. The use of SOM helps in identifying the neighbouring solutions, which are further utilized in generating a new solution using genetic operators. For all the works, a multi-objective differential evolution (MODE) algorithm (a type of MOEA) is utilized as the underlying optimization strategy due to its faster convergence rate and better performance than other MOEAs, as was reported in the literature. Thus, this dissertation has six key contributions. Below is the list of methods proposed in this thesis:

• First (Chapter 3), a novel multi-objective clustering framework, SMODoc clust, was de- veloped using the fusion of SOM in MODE. To improve the quality of the clusters, two internal cluster validity indices, namely the PBM index and silhouette index, are opti- mized simultaneously. The proposed approach can find the optimal partitioning and the appropriate number of clusters automatically. For evaluation, a set of scientific articles and web documents are considered. Various syntactic (tf, tf-idf) and semantics schemes (word2vec/glove) are utilized to represent the documents. The efficacy of SMODoc clust is compared with various clustering algorithms like K-means, a single-objective genetic algorithm, and three MOO-based clustering algorithms. Results show that the proposed multi-objective clustering is more effective than the existing techniques. It is important to note that in word2vec/glove representation, we performed the averaging of word vectors to get the document vectors, which may lose the semantic meaning. Therefore, in the future,

192 9.1 Conclusions

our method can be extended by using more robust representation like BERT and XLNet, among others, to represent the documents.

• Second (as reported in Chapter 4), a multi-objective clustering-based approach, ESDS SMODE is developed for extractive single-document summarization (ESDS). This work was done to show the potentiality of the approach developed in the first chapter. After getting the opti- mal partitioning, the weighted sum of various sentence scoring features like sentence length and sentence position is utilized to rank the sentences in each cluster. Then, top-scoring sentences are selected as a part of the summary. It should be noted that in the current approach, multi-objective clustering will be applied to sentences instead of documents. The performance of ESDS SMODE is compared with various supervised and unsupervised methods, including single and multi-objective evolutionary-based frameworks. Experimen- tal results illustrate the effectiveness of the proposed technique for a summary generation. In the future, this approach can be extended to perform an automatic adaptation of various parameters used in the proposed framework and applied to query-based single-document summarization.

• Third (Chapter 5), ESDS is posed as a binary optimization problem where different sen- tences of the document are selected based on various statistical measures like readability, cohesion, and coverage. These measures are simultaneously optimized using a binary ver- sion of MODE called MOBDE. The SOM-based operator is also explored in combination with MOBDE. At the end of the algorithm, it provides a set of non-dominated solutions (each solution represents a summary), and the user can select the best solution by com- paring it with a gold/reference summary. But, in real-time, the gold summary may not be available. In this case, various unsupervised approaches are explored to select the best solution. It was also shown that the performance of the developed summarization system not only depends on the chosen statistical measures but also depends on the type of simi- larity/dissimilarity function used. The obtained results prove that the proposed approach is better than state-of-the-art techniques.

• Fourth (Chapter 6), we present an unsupervised approach for the summarization of figures in the biomedical articles, namely, FigSum++. The current approach can generate the figure summary using the associated text in the article. Various syntactic and semantic measures measuring the relevance of the sentences concerning a figure to summarize are optimized simultaneously using MOBDE. A new version of measuring redundancy in sum- mary in terms of textual entailment is also proposed. Generally, MOBDE uses a single DE variant (rand/1/bin) for new solution generation. However, for the current work, en-

193 Conclusions and Future Works

semble to two different DE variants are used to maintain diversity among solutions and convergence towards the optimal solution (or optimal summary). The results clearly show that our system performs better than the existing systems. In the future, we would like to make parallel our summarization system by simultaneously generating summaries of all the figures of a given article.

• Fifth (Chapter-7), we have proposed a MOO-based method for microblog summarization, MOOTweetSumm, summarizing/extracting a set of situational tweets (proving human assistance in case of disaster event). The results obtained are compared with various state- of-the-art techniques and were shown to have significant improvements. The extension of the proposed approach is also shown to solve the multi-document summarization task.

• In the last work of the dissertation (i.e., Chapter 8), an unsupervised approach MMTweet- Summ is developed to handle multiple modalities of the tweets, i.e., tweet-texts and images. Note that MMTweetSumm is the extension of MOOTweetSumm developed in the previ- ous chapter. To extract the images’ natural language features, an image dense-captioning model is utilized. At each generation of the MOBDE framework, each solution can per- form either exploitation (local search) or exploration (global search) based on some mating restriction probability. This probability is self-adaptive based on the survival length (SL) of the solution. The high value of SL for a solution indicates that the solution is of good quality and thus, exploitation (local search) will take place; otherwise, exploration (global search).

9.2 Suggestions for Further Work

Though in this thesis we made all possible attempts to complete our research work in this particular area, there is still room for the improvement and there remain the possibility of immense potential work to be carried out in future. Some potential directions for future research are highlighted below:

• Image-captioning for disaster-related images: In the multi-modal microblog sum- marization task, we observed that users posted different types of images in the form of newspaper cuttings, word-clouds, and handwritten-notes. The existing image dense cap- tioning model is unable to generate captions of these types of images. For these images, it just provides captions like, ‘on the wall’, and ‘in the photo’. Moreover, for the re- maining images, the model does not perform well. Thus, there is a need to develop an image-captioning model specialized for disaster-related images. To meet that objective, a

194 9.2 Suggestions for Further Work

large corpus must be generated consisting of disaster-related images along with the corre- sponding captions. This corpus can be utilized for training some image dense captioning models.

• Online Microblog Summarization: This work aims to summarize continuously arriv- ing tweet-streams in case of any disaster event. In other words, several tweets arrive every minute/hour. Therefore, it becomes imperative to update the summary (initial generated summary) based on the relevance of the new tweets and changing information. A few works have been done to solve this task. However, in terms of techniques and improve- ment, the work demands more exploration. It is also planned to use some sophisticated word embedding techniques like BERT and tweet-specific embedding in association with emotion-aware embedding to better capture the semantic dis-similarities between tweets.

• Automatic literature survey writing: Due to the increasing rate of new articles in various scientific fields, it is a tedious task for the researchers to keep an update of the new advancements. Therefore, there is a dire need for automatic crawling of the scientific documents given a topic and then summarizing each scientific document using citation- based summarization (in which citations to a reference article are used to generate the summary of the reference paper). After generating the summary, we can arrange the year- wise document summaries to obtain a literature survey. This will help in generating the literature survey automatically for a given topic and, thus, in turn, keeping the researcher up to date.

• Legal Text/Case Summarization: In the judicial domain, judges, attorneys, and case- workers are always surrounded by a large volume of a legal text (District court/State Court/Supreme/High Court/Federal court), and it is not very easy to manage such cases. Moreover, the legal text is more challenging than text written in the scientific documents in terms of structure, hierarchy, ambiguity, size, and vocabulary. This needs an automatic or simplified framework that could help legal workers to manage the workload. In this di- rection, a few limited works have been carried out, yet they suffer from a lack of significant improvements. Therefore, these challenges demand a study of legal text summarization, and multi-objective optimization, which can help in this area.

• Multi-lingual Multi-document Summarization: The works addressed in this thesis covered only English text for summarization. This task is very tricky when the documents are written in multiple languages, and we must generate a single summary. In a multicul- tural region like India, the document content may be produced in various languages like

195 Conclusions and Future Works

Hindi, Malayalam, and Telugu. Therefore, there is a scope for future work in this direction.

196 References

[1] J. Handl and J. Knowles, “An evolutionary approach to multiobjective clustering,” IEEE transactions on Evolutionary Computation, vol. 11, no. 1, pp. 56–76, 2007. [2] F. Alam, F. Ofli, and M. Imran, “Crisismmd: Multimodal twitter datasets from natural disasters,” in Twelfth International AAAI Conference on Web and Social Media, 2018. [3] A. K. Abasi, A. T. Khader, M. A. Al-Betar, S. Naim, S. N. Makhadmeh, and Z. A. A. Alyasseri, “Link-based multi-verse optimizer for text documents clustering,” Applied Soft Computing, vol. 87, p. 106002, 2020. [4] C. C. Aggarwal and C. Zhai, Mining text data. Springer Science & Business Media, 2012. [5] T. Ghosal, D. Dey, A. Dutta, A. Ekbal, S. Saha, and P. Bhattacharyya, “A multiview clustering approach to identify out-of-scope submissions in peer review,” in 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL). IEEE, 2019, pp. 392–393. [6] J. Cheng and M. Lapata, “Neural summarization by extracting sentences and words,” arXiv preprint arXiv:1603.07252, 2016. [7] R. Nallapati, F. Zhai, and B. Zhou, “Summarunner: A recurrent neural network based sequence model for extractive summarization of documents.” in AAAI, 2017, pp. 3075–3081. [8] E. Hovy, C.-Y. Lin et al., “Automated text summarization in summarist,” Advances in automatic text summarization, vol. 14, 1999. [9] S. Narayan, N. Papasarantopoulos, S. B. Cohen, and M. Lapata, “Neural extractive summarization with side information,” arXiv preprint arXiv:1704.04530, 2017. [10] A. M. Rush, S. Chopra, and J. Weston, “A neural attention model for abstractive sentence summariza- tion,” in Proceedings of international Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2015, pp. 379–389. [11] F. Liu, J. Flanigan, S. Thomson, N. Sadeh, and N. A. Smith, “Toward abstractive summarization using semantic representations,” in HLT-NAACL, 01 2015, pp. 1077–1086. [12] A. Jatowt, “Web page summarization using dynamic content,” in Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters, 2004, pp. 344–345. [13] S. Rastkar, G. C. Murphy, and G. Murray, “Automatic summarization of bug reports,” IEEE Transactions on Software Engineering, vol. 40, no. 4, pp. 366–380, 2014. [14] D. Patel, S. Shah, and H. Chhinkaniwala, “Fuzzy logic based multi document summarization with improved sentence scoring and redundancy removal technique,” Expert Systems with Applications, vol. 134, pp. 167– 177, 2019. [15] A. Kanapala, S. Pal, and R. Pamula, “Text summarization from legal documents: a survey,” Artificial Intelligence Review, vol. 51, no. 3, pp. 371–402, 2019. [16] C. Barros, E. Lloret, E. Saquete, and B. Navarro-Colorado, “Natsum: Narrative abstractive summarization through cross-document timeline generation,” Information Processing & Management, vol. 56, no. 5, pp. 1775–1793, 2019. [17] A. Cohan and N. Goharian, “Scientific document summarization via citation contextualization and scientific discourse,” International Journal on Digital Libraries, vol. 19, no. 2-3, pp. 287–303, 2018. [18] M. Upadhyay, D. Radhakrishnan, and M. Natarajan, “Summarization and processing of email on a client computing device based on content contribution to an email thread using weighting techniques,” Oct. 16 2018, uS Patent 10,102,192.

197 References

[19] S. A. Bahrainian, “Just-in-time information retrieval and summarization for personal assistance,” Ph.D. dissertation, Universit`adella Svizzera italiana, 2019. [20] B. P. Ramesh, R. J. Sethi, and H. Yu, “Figure-associated text summarization and evaluation,” PloS one, vol. 10, no. 2, 2015. [21] S. Dutta, V. Chandra, K. Mehra, A. K. Das, T. Chakraborty, and S. Ghosh, “Ensemble algorithms for microblog summarization,” IEEE Intelligent Systems, vol. 33, no. 3, pp. 4–14, 2018. [22] M. El-Haj, “Multiling 2019: Financial narrative summarisation,” in Proceedings of the Workshop MultiLing 2019: Summarization Across Languages, Genres and Sources, 2019, pp. 6–10. [23] I. Varga, M. Sano, K. Torisawa, C. Hashimoto, K. Ohtake, T. Kawai, J.-H. Oh, and S. De Saeger, “Aid is out there: Looking for help from tweets during a large scale disaster,” in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, 2013, pp. 1619–1629. [24] G. Neubig, Y. Matsubayashi, M. Hagiwara, and K. Murakami, “Safety information mining—what can nlp do in a disaster—,” in Proceedings of 5th International Joint Conference on Natural Language Processing, 2011, pp. 965–973. [25] R. P. Futrelle, “Handling figures in document summarization,” In Proceedings of the Workshop at the Annual Meeting of the Association for Computational Linguistics, pp. 61–65, 2004. [26] H. Yu, S. Agarwal, M. Johnston, and A. Cohen, “Are figure legends sufficient? evaluating the contribution of associated text to biomedical figure comprehension,” Journal of biomedical discovery and collaboration, vol. 4, no. 1, p. 1, 2009. [27] K. Deb, “Multi-objective optimization,” in Search methodologies. Springer, 2014, pp. 403–449. [28] S. Saha and S. Bandyopadhyay, “A symmetry based multiobjective clustering technique for automatic evolution of clusters,” Pattern recognition, vol. 43, no. 3, pp. 738–751, 2010. [29] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist multiobjective genetic algorithm: Nsga-ii,” IEEE transactions on evolutionary computation, vol. 6, no. 2, pp. 182–197, 2002. [30] T. Kohonen, “The self-organizing map,” Neurocomputing, vol. 21, no. 1, pp. 1–6, 1998. [31] S. Jungjit and A. Freitas, “A lexicographic multi-objective genetic algorithm for multi-label correlation based feature selection,” in Proceedings of the Companion Publication of the 2015 Annual Conference on Genetic and Evolutionary Computation. ACM, 2015, pp. 989–996. [32] G. Erkan and D. R. Radev, “Lexrank: Graph-based lexical centrality as salience in text summarization,” Journal of artificial intelligence research, vol. 22, pp. 457–479, 2004. [33] R. Mihalcea, “Graph-based ranking algorithms for sentence extraction, applied to text summarization,” in Proceedings of the ACL Interactive Poster and Demonstration Sessions, 2004, pp. 170–173. [34] S. Dutta, V. Chandra, K. Mehra, S. Ghatak, A. K. Das, and S. Ghosh, “Summarizing microblogs during emergency events: A comparison of extractive summarization algorithms,” in Emerging Technologies in Data Mining and Information Security. Springer, 2019, pp. 859–872. [35] J. Johnson, A. Karpathy, and L. Fei-Fei, “Densecap: Fully convolutional localization networks for dense captioning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4565–4574. [36] K. Price, R. M. Storn, and J. A. Lampinen, Differential evolution: a practical approach to global optimization. Springer Science & Business Media, 2006. [37] K. Deb, Multi-objective optimization using evolutionary algorithms. John Wiley & Sons, 2001, vol. 16. [38] S. Bandyopadhyay, S. Saha, U. Maulik, and K. Deb, “A simulated annealing-based multiobjective opti- mization algorithm: Amosa,” IEEE transactions on evolutionary computation, vol. 12, no. 3, pp. 269–283, 2008. [39] D. Zhang and B. Wei, “Comparison between differential evolution and particle swarm optimization algo- rithms,” in Mechatronics and Automation (ICMA), 2014 IEEE International Conference on. IEEE, 2014, pp. 239–244. [40] J. Vesterstrom and R. Thomsen, “A comparative study of differential evolution, particle swarm optimiza- tion, and evolutionary algorithms on numerical benchmark problems.” in IEEE Congress on Evolutionary Computation, vol. 2, 2004, pp. 1980–1987. [41] C. D. Manning, P. Raghavan, and H. Sch¨utze, Introduction to Information Retrieval. Cambridge University Press,Cambridge, England, 2009.

198 [42] B. Fortuna, M. Grobelnik, and D. Mladenic, “Visualization of text document corpus,” Informatica, vol. 29, no. 4, 2005. [43] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013. [44] J. Pennington, R. Socher, and C. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543. [45] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang, “Biobert: pre-trained biomedical language representation model for biomedical text mining,” arXiv preprint arXiv:1901.08746, 2019. [46] L. Wang, X. Fu, M. I. Menhas, and M. Fei, “A modified binary differential evolution algorithm,” in Life System Modeling and Intelligent Computing. Springer, 2010, pp. 49–57. [47] A. K. Jain and R. C. Dubes, Algorithms for clustering data. Prentice-Hall, Inc., 1988. [48] L. K. P. J. RDUSSEEUN, “Clustering by means of medoids,” 1987. [49] M. Van der Laan, K. Pollard, and J. Bryan, “A new partitioning around medoids algorithm,” Journal of Statistical Computation and Simulation, vol. 73, no. 8, pp. 575–584, 2003. [50] S. C. Johnson, “Hierarchical clustering schemes,” Psychometrika, vol. 32, no. 3, pp. 241–254, 1967. [51] Q. Le and T. Mikolov, “Distributed representations of sentences and documents,” in Proceedings of the 31st International Conference on Machine Learning (ICML-14), 2014, pp. 1188–1196. [52] H. Wang, “Introduction to word2vec and its application to find predominant word senses,” URL: http://compling. hss. ntu. edu. sg/courses/hg7017/pdf/word2vec and its application to wsd. pdf, 2014. [53] M. Kusner, Y. Sun, N. Kolkin, and K. Weinberger, “From word embeddings to document distances,” in International Conference on Machine Learning, 2015, pp. 957–966. [54] S.-H. Liu, K.-Y. Chen, Y.-L. Hsieh, B. Chen, H.-M. Wang, H.-C. Yen, and W.-L. Hsu, “Exploring word mover’s distance and semantic-aware embedding techniques for extractive broadcast news summarization.” in INTERSPEECH, 2016, pp. 670–674. [55] R. M. Aliguliyev, “A new sentence similarity measure and sentence based extractive technique for automatic text summarization,” Expert Systems with Applications, vol. 36, no. 4, pp. 7764–7772, 2009. [56] B. Desgraupes, “Clustering indices,” University of Paris Ouest-Lab Modal’X, vol. 1, p. 34, 2013. [57] K. Suresh, D. Kundu, S. Ghosh, S. Das, and A. Abraham, “Data clustering using multi-objective differential evolution algorithms,” Fundamenta Informaticae, vol. 97, no. 4, pp. 381–403, 2009. [58] S. Saha and S. Bandyopadhyay, “A generalized automatic clustering algorithm in a multiobjective frame- work,” Applied Soft Computing, vol. 13, no. 1, pp. 89–108, 2013. [59] U. Maulik and S. Bandyopadhyay, “Performance evaluation of some clustering algorithms and validity indices,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 12, pp. 1650–1654, 2002. [60] D. L. Davies and D. W. Bouldin, “A cluster separation measure,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PAMI-1, no. 2, pp. 224–227, April 1979. [61] M. K. Pakhira, S. Bandyopadhyay, and U. Maulik, “Validity index for crisp and fuzzy clusters,” Pattern recognition, vol. 37, no. 3, pp. 487–501, 2004. [62] S. S. Haykin, Neural networks and learning machines. Pearson Upper Saddle River, NJ, USA:, 2009, vol. 3. [63] A. Romanov and C. Shivade, “Lessons from natural language inference in the clinical domain,” arXiv preprint arXiv:1808.06752, 2018. [64] D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, and L. Specia, “Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation,” arXiv preprint arXiv:1708.00055, 2017. [65] A. B. Abacha, C. Shivade, and D. Demner-Fushman, “Overview of the mediqa 2019 shared task on textual inference, question entailment and question answering,” in Proceedings of the 18th BioNLP Workshop and Shared Task, 2019, pp. 370–379. [66] D. Dasgupta and Z. Michalewicz, Evolutionary algorithms in engineering applications. Springer Science & Business Media, 2013. [67] R. Storn and K. Price, “Differential evolution–a simple and efficient heuristic for global optimization over continuous spaces,” Journal of global optimization, vol. 11, no. 4, pp. 341–359, 1997. [68] K.-L. Du and M. Swamy, “Particle swarm optimization,” in Search and optimization by metaheuristics. Springer, 2016, pp. 153–173.

199 References

[69] R. S. Parpinelli, H. S. Lopes, and A. A. Freitas, “Data mining with an ant colony optimization algorithm,” IEEE transactions on evolutionary computation, vol. 6, no. 4, pp. 321–332, 2002. [70] J. Carvalho, A. Prado, and A. Plastino, “A statistical and evolutionary approach to sentiment analysis,” in 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT), vol. 2. IEEE, 2014, pp. 110–117. [71] H. Lu, J. Chen, K. Yan, Q. Jin, Y. Xue, and Z. Gao, “A hybrid feature selection algorithm for gene expression data classification,” Neurocomputing, vol. 256, pp. 56–62, 2017. [72] K. Zhang, H. Du, and M. W. Feldman, “Maximizing influence in a social network: Improved results using a genetic algorithm,” Physica A: Statistical Mechanics and its Applications, vol. 478, pp. 20–30, 2017. [73] C. A. C. Coello and G. B. Lamont, Applications of multi-objective evolutionary algorithms. World Scientific, 2004, vol. 1. [74] E. Zitzler, K. Deb, and L. Thiele, “Comparison of multiobjective evolutionary algorithms: Empirical re- sults,” Evolutionary computation, vol. 8, no. 2, pp. 173–195, 2000. [75] M. R. Bonyadi and Z. Michalewicz, “Particle swarm optimization for single objective continuous space problems: a review,” 2017. [76] J. Kennedy, “Particle swarm optimization,” in Encyclopedia of machine learning. Springer, 2011, pp. 760–766. [77] M. Dorigo and G. Di Caro, “Ant colony optimization: a new meta-heuristic,” in Proceedings of the 1999 congress on evolutionary computation-CEC99 (Cat. No. 99TH8406), vol. 2. IEEE, 1999, pp. 1470–1477. [78] N. Srinivas and K. Deb, “Muiltiobjective optimization using nondominated sorting in genetic algorithms,” Evolutionary computation, vol. 2, no. 3, pp. 221–248, 1994. [79] H. Zhang, A. Zhou, S. Song, Q. Zhang, X.-Z. Gao, and J. Zhang, “A self-organizing multiobjective evolu- tionary algorithm,” IEEE Transactions on Evolutionary Computation, vol. 20, no. 5, pp. 792–806, 2016. [80] B.-C. Wang, H.-X. Li, J.-P. Li, and Y. Wang, “Composite differential evolution for constrained evolutionary optimization,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, no. 99, pp. 1–14, 2018. [81] R. M. Alguliev, R. M. Aliguliyev, and N. R. Isazade, “Desamc+ docsum: Differential evolution with self- adaptive mutation and crossover parameters for multi-document summarization,” Knowledge-Based Sys- tems, vol. 36, pp. 21–38, 2012. [82] X. Cui, T. E. Potok, and P. Palathingal, “Document clustering using particle swarm optimization,” in Proceedings 2005 IEEE Swarm Intelligence Symposium, 2005. SIS 2005. IEEE, 2005, pp. 185–191. [83] H.-T. Zheng, B.-Y. Kang, and H.-G. Kim, “Exploiting noun phrases and semantic relationships for text document clustering,” Information Sciences, vol. 179, no. 13, pp. 2249–2262, 2009. [84] G. A. Miller, “Wordnet: a lexical database for english,” Communications of the ACM, vol. 38, no. 11, pp. 39–41, 1995. [85] Y. Cai and J.-S. Yuan, “Text clustering based on improved dbscan algorithm,” Computer Engineering, vol. 12, p. 018, 2011. [86] W. B. A. Karaa, A. S. Ashour, D. B. Sassi, P. Roy, N. Kausar, and N. Dey, “Medline text mining: an enhancement genetic algorithm based approach for document clustering,” in Applications of Intelligent Optimization in Biology and Medicine. Springer, 2016, pp. 267–287. [87] X. Fu, K. Huang, B. Yang, W.-K. Ma, and N. D. Sidiropoulos, “Robust volume minimization-based matrix factorization for remote sensing and document clustering,” IEEE Transactions on Signal Processing, vol. 64, no. 23, pp. 6254–6268, 2016. [88] L. M. Abualigah, A. T. Khader, M. A. Al-Betar, and O. A. Alomari, “Text feature selection with a ro- bust weight scheme and dynamic dimension reduction to text document clustering,” Expert Systems with Applications, vol. 84, pp. 24–36, 2017. [89] S. Bandyopadhyay, U. Maulik, and A. Mukhopadhyay, “Multiobjective genetic clustering for pixel classifi- cation in remote sensing imagery,” IEEE transactions on Geoscience and Remote Sensing, vol. 45, no. 5, pp. 1506–1511, 2007. [90] S. Bandyopadhyay, A. Mukhopadhyay, and U. Maulik, “An improved algorithm for clustering gene expres- sion data,” Bioinformatics, vol. 23, no. 21, pp. 2859–2865, 2007. [91] J. C. Bezdek, Pattern recognition with fuzzy objective function algorithms. Springer Science & Business Media, 2013.

200 [92] N. Saini, S. Saha, C. Soni, and P. Bhattacharyya, “Automatic evolution of bi-clusters from microarray data using self-organized multi-objective evolutionary algorithm,” Applied Intelligence, pp. 1–18, 2019. [93] A. Mukhopadhyay, U. Maulik, and S. Bandyopadhyay, “Multiobjective genetic algorithm-based fuzzy clus- tering of categorical attributes,” IEEE transactions on evolutionary computation, vol. 13, no. 5, pp. 991– 1005, 2009. [94] R. Dong, “Differential evolution versus particle swarm optimization for pid controller design,” in Natural Computation, 2009. ICNC’09. Fifth International Conference on, vol. 3. IEEE, 2009, pp. 236–240. [95] J.-Y. Yeh, H.-R. Ke, W.-P. Yang, and I.-H. Meng, “Text summarization using a trainable summarizer and ,” Information processing & management, vol. 41, no. 1, pp. 75–95, 2005. [96] D. Shen, J.-T. Sun, H. Li, Q. Yang, and Z. Chen, “Document summarization using conditional random fields.” in IJCAI, vol. 7, 2007, pp. 2862–2867. [97] J. Lafferty, A. McCallum, and F. C. Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” 2001. [98] X. Wan, J. Yang, and J. Xiao, “Manifold-ranking based topic-focused multi-document summarization.” in IJCAI, vol. 7, 2007, pp. 2903–2908. [99] H. Oliveira, R. D. Lins, R. Lima, and F. Freitas, “A regression-based approach using integer linear pro- gramming for single-document summarization,” in 2017 IEEE 29th International Conference on Tools with Artificial Intelligence (ICTAI). IEEE, 2017, pp. 270–277. [100] A. Schrijver, Theory of linear and integer programming. John Wiley & Sons, 1998. [101] D. M. Dunlavy, D. P. O’Leary, J. M. Conroy, and J. D. Schlesinger, “Qcs: A system for querying, clustering and summarizing documents,” Information processing & management, vol. 43, no. 6, pp. 1588–1605, 2007. [102] R. Ferreira, L. de Souza Cabral, R. D. Lins, G. P. e Silva, F. Freitas, G. D. Cavalcanti, R. Lima, S. J. Simske, and L. Favaro, “Assessing sentence scoring techniques for extractive text summarization,” Expert systems with applications, vol. 40, no. 14, pp. 5755–5764, 2013. [103] M. Peyrard, “Principled approaches to automatic text summarization,” Ph.D. dissertation, Vom Fachbereich Informatik Der, Technischen Universitat Darmstadt Genehmigt, Technische Universitat Darmstadt, 2019. [Online]. Available: https://tuprints.ulb.tu-darmstadt.de/9012/8/Peyrard Maxime PhD Thesis.pdf [104] W. Song, L. C. Choi, S. C. Park, and X. F. Ding, “Fuzzy evolutionary optimization modeling and its appli- cations to unsupervised categorization and extractive summarization,” Expert Systems with Applications, vol. 38, no. 8, pp. 9112–9121, 2011. [105] M. Mendoza, S. Bonilla, C. Noguera, C. Cobos, and E. Le´on,“Extractive single-document summarization based on genetic operators and guided local search,” Expert Systems with Applications, vol. 41, no. 9, pp. 4158–4169, 2014. [106] J. D. Knowles and D. W. Corne, “M-paes: A memetic algorithm for multiobjective optimization,” in Proceedings of the 2000 Congress on Evolutionary Computation. CEC00 (Cat. No. 00TH8512), vol. 1. IEEE, 2000, pp. 325–332. [107] M. Mendoza, C. Cobos, and E. Le´on,“Extractive single-document summarization based on global-best harmony search and a greedy local optimizer,” in Mexican International Conference on Artificial Intelligence. Springer, 2015, pp. 52–66. [108] R. M. Alguliyev, R. M. Aliguliyev, N. R. Isazade, A. Abdi, and N. Idris, “Cosum: Text summarization based on clustering and optimization,” Expert Systems, p. e12340, 2018. [109] K. Svore, L. Vanderwende, and C. Burges, “Enhancing single-document summarization by combining ranknet and third-party sources,” in Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL), 2007. [110] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender, “Learning to rank using gradient descent,” in Proceedings of the 22nd international conference on Machine learning. ACM, 2005, pp. 89–96. [111] R. Passonneau, K. Kukich, J. Robin, V. Hatzivassiloglou, L. Lefkowitz, and H. Jing, “Generating summaries of workflow diagrams,” in Proceedings of the International Conference on Natural Language Processing and Industrial Applications, 1996, pp. 204–210. [112] R. P. Futrelle, “Summarization of diagrams in documents,” Advances in Automated Text Summarization, pp. 403–421, 1999.

201 References

[113] S. Agarwal and H. Yu, “Figsum: automatically generating structured text summaries for figures in biomed- ical literature,” in AMIA Annual Symposium Proceedings, vol. 2009. American Medical Informatics Asso- ciation, 2009, p. 6. [114] P. Wu and S. Carberry, “Toward extractive summarization of multimodal documents,” in Proceedings of the Workshop on Text Summarization at the Canadian Conference on Artificial Intelligence, 2011, pp. 53–61. [115] S. Bhatia and P. Mitra, “Summarizing figures, tables, and algorithms in scientific publications to augment search results,” ACM Transactions on Information Systems (TOIS), vol. 30, no. 1, p. 3, 2012. [116] M. A. H. Khan, D. Bollegala, G. Liu, and K. Sezaki, “Multi-tweet summarization of real-time events,” in Social Computing (SocialCom), 2013 International Conference on. IEEE, 2013, pp. 128–133. [117] L. Shou, Z. Wang, K. Chen, and G. Chen, “Sumblr: continuous summarization of evolving tweet streams,” in Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval. ACM, 2013, pp. 533–542. [118] A. Olariu, “Efficient online summarization of microblogging streams,” in Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers, 2014, pp. 236–240. [119] A. Zubiaga, D. Spina, E. Amig´o,and J. Gonzalo, “Towards real-time summarization of scheduled events from twitter streams,” in Proceedings of the 23rd ACM conference on Hypertext and social media. ACM, 2012, pp. 319–320. [120] N. Garg, B. Favre, K. Reidhammer, and D. Hakkani-T¨ur,“Clusterrank: a graph based method for meeting summarization,” in Tenth Annual Conference of the International Speech Communication Association, 2009. [121] Y. Gong and X. Liu, “Generic text summarization using relevance measure and latent semantic analysis,” in Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2001, pp. 19–25. [122] H. P. Luhn, “The automatic creation of literature abstracts,” IBM Journal of research and development, vol. 2, no. 2, pp. 159–165, 1958. [123] D. R. Radev, E. Hovy, and K. McKeown, “Introduction to the special issue on summarization,” Computa- tional linguistics, vol. 28, no. 4, pp. 399–408, 2002. [124] A. Nenkova and L. Vanderwende, “The impact of frequency on summarization,” Microsoft Research, Red- mond, Washington, Tech. Rep. MSR-TR-2005, vol. 101, 2005. [125] Z. He, C. Chen, J. Bu, C. Wang, L. Zhang, D. Cai, and X. He, “Document summarization based on data reconstruction.” in AAAI, 2012. [126] K. Rudra, S. Ghosh, N. Ganguly, P. Goyal, and S. Ghosh, “Extracting situational information from mi- croblogs during disaster events: a classification-summarization approach,” in Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. ACM, 2015, pp. 583–592. [127] K. Rudra, N. Ganguly, P. Goyal, and S. Ghosh, “Extracting and summarizing situational information from the twitter social media during disasters,” ACM Transactions on the Web (TWEB), vol. 12, no. 3, p. 17, 2018. [128] C. De Maio, G. Fenza, V. Loia, and M. Parente, “Time aware knowledge extraction for microblog summa- rization on twitter,” Information Fusion, vol. 28, pp. 60–74, 2016. [129] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” Journal of machine Learning research, vol. 3, no. Jan, pp. 993–1022, 2003. [130] J. Bian, Y. Yang, and T.-S. Chua, “Multimedia summarization for trending topics in microblogs,” in Pro- ceedings of the 22nd ACM international conference on Information & Knowledge Management, 2013, pp. 1807–1812. [131] J. Bian, Y. Yang, H. Zhang, and T.-S. Chua, “Multimedia summarization for social events in microblog stream,” IEEE Transactions on multimedia, vol. 17, no. 2, pp. 216–228, 2014. [132] F. Amato, A. Castiglione, V. Moscato, A. Picariello, and G. Sperl`ı,“Multimedia summarization using social media content,” Multimedia Tools and Applications, vol. 77, no. 14, pp. 17 803–17 827, 2018. [133] Y. Rizk, H. S. Jomaa, M. Awad, and C. Castillo, “A computationally efficient multi-modal classification approach of disaster-related twitter images,” in Proceedings of the 34th ACM/SIGAPP symposium on applied computing, 2019, pp. 2050–2059. [134] C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,” Text Summarization Branches Out, 2004.

202 [135] M. Steinbach, G. Karypis, V. Kumar et al., “A comparison of document clustering techniques,” in KDD workshop on text mining, vol. 400, no. 1. Boston, 2000, pp. 525–526. [136] A. Starczewski, “A new validity index for crisp clusters,” Pattern Analysis and Applications, vol. 20, no. 3, pp. 687–700, 2017. [137] F. Kov´acs,C. Leg´any, and A. Babos, “Cluster validity measurement techniques,” in 6th International symposium of hungarian researchers on computational intelligence, 2005. [138] S. Saha and S. Bandyopadhyay, “Some connectivity based cluster validity indices,” Applied Soft Computing, vol. 12, no. 5, pp. 1555–1565, 2012. [139] S. Bandyopadhyay and U. Maulik, “Nonparametric genetic clustering: comparison of validity indices,” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 31, no. 1, pp. 120–125, 2001. [140] O. Arbelaitz, I. Gurrutxaga, J. Muguerza, J. M. P´erez, and I. Perona, “An extensive comparative study of cluster validity indices,” Pattern Recognition, vol. 46, no. 1, pp. 243–256, 2013. [141] N. Saini, S. Chourasia, S. Saha, and P. Bhattacharyya, “A self organizing map based multi-objective frame- work for automatic evolution of clusters,” in International Conference on Neural Information Processing. Springer, 2017, pp. 672–682. [142] K. Deb and S. Tiwari, “Omni-optimizer: A generic evolutionary algorithm for single and multi-objective optimization,” European Journal of Operational Research, vol. 185, no. 3, pp. 1062–1087, 2008. [143] M. Lichman, “UCI machine learning repository,” 2013. [Online]. Available: http://archive.ics.uci.edu/ml [144] A. Cardoso-Cachopo, “Improving Methods for Single-label Text Categorization,” PdD Thesis, Instituto Superior Tecnico, Universidade Tecnica de Lisboa, 2007. [145] S. Bandyopadhyay and S. Saha, “A new principal axis based line symmetry measurement and its application to clustering,” in International Conference on Neural Information Processing. Springer, 2008, pp. 543–550. [146] P. Dutta and S. Saha, “Fusion of expression values and protein interaction information using multi-objective optimization for improving gene clustering,” Computers in biology and medicine, vol. 89, pp. 31–43, 2017. [147] E. Loper and S. Bird, “Nltk: The ,” in Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics - Volume 1, ser. ETMTNLP ’02. Stroudsburg, PA, USA: Association for Computational Linguistics, 2002, pp. 63–70. [Online]. Available: https://doi.org/10.3115/1118108.1118117 [148] T. Korenius, J. Laurikkala, K. J¨arvelin, and M. Juhola, “Stemming and lemmatization in the clustering of finnish text documents,” in Proceedings of the thirteenth ACM international conference on Information and knowledge management. ACM, 2004, pp. 625–633. [149] S. Bandyopadhyay and S. Saha, “A point symmetry-based clustering technique for automatic evolution of clusters,” IEEE Transactions on Knowledge and Data Engineering, vol. 20, no. 11, pp. 1441–1457, 2008. [150] D. Kalyanmoy et al., Multi objective optimization using evolutionary algorithms. John Wiley and Sons, 2001. [151] C. M. Fonseca and P. J. Fleming, “An overview of evolutionary algorithms in multiobjective optimization,” Evolutionary computation, vol. 3, no. 1, pp. 1–16, 1995. [152] S. Acharya, S. Saha, J. G. Moreno, and G. Dias, “Multi-objective search results clustering,” in Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, 2014, pp. 99–108. [153] E. Mezura-Montes, M. Reyes-Sierra, and C. A. C. Coello, “Multi-objective optimization using differential evolution: a survey of the state-of-the-art,” in Advances in differential evolution. Springer, 2008, pp. 173–196. [154] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le, “Xlnet: Generalized autoregres- sive pretraining for language understanding,” in Advances in neural information processing systems, 2019, pp. 5753–5763. [155] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018. [156] B. L. Welch, “The generalization of ‘student’s’ problem when several different population variances are involved,” Biometrika, vol. 34, no. 1/2, pp. 28–35, 1947. [Online]. Available: http: //www.jstor.org/stable/2332510

203 References

[157] D. G. Roussinov and H. Chen, “A scalable self-organizing map algorithm for textual classification: A neural network approach to thesaurus generation,” 1998. [158] R. M. Aliguliyev, “Clustering techniques and discrete particle swarm optimization algorithm for multi- document summarization,” Computational Intelligence, vol. 26, no. 4, pp. 420–448, 2010. [159] S. Saha, S. Mitra, and S. Kramer, “Exploring multiobjective optimization for multiview clustering,” ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 12, no. 4, pp. 1–30, 2018. [160] X. Wan, “Towards a unified approach to simultaneous single-document and multi-document summariza- tions,” in Proceedings of the 23rd international conference on computational linguistics. Association for Computational Linguistics, 2010, pp. 1137–1145. [161] A. Mukhopadhyay, U. Maulik, and S. Bandyopadhyay, “Multiobjective genetic clustering with ensemble among pareto front solutions: Application to mri brain image segmentation,” in Advances in Pattern Recog- nition, 2009. ICAPR’09. Seventh International Conference on. IEEE, 2009, pp. 236–239. [162] J. Kupiec, J. Pedersen, and F. Chen, “A trainable document summarizer,” in Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 1995, pp. 68–73. [163] S. Mirjalili, S. Saremi, S. M. Mirjalili, and L. d. S. Coelho, “Multi-objective grey wolf optimizer: a novel algorithm for multi-criterion optimization,” Expert Systems with Applications, vol. 47, pp. 106–119, 2016. [164] A. Sadollah, H. Eskandar, A. Bahreininejad, and J. H. Kim, “Water cycle algorithm for solving multi- objective optimization problems,” Soft Computing, vol. 19, no. 9, pp. 2587–2603, 2015. [165] S. Mirjalili, S. M. Mirjalili, and A. Lewis, “Grey wolf optimizer,” Advances in engineering software, vol. 69, pp. 46–61, 2014. [166] H. Eskandar, A. Sadollah, A. Bahreininejad, and M. Hamdi, “Water cycle algorithm–a novel metaheuristic optimization method for solving constrained engineering optimization problems,” Computers & Structures, vol. 110, pp. 151–166, 2012. [167] A. Sadollah, H. Eskandar, A. Bahreininejad, and J. H. Kim, “Water cycle algorithm with evaporation rate for solving constrained and unconstrained optimization problems,” Applied Soft Computing, vol. 30, pp. 58–71, 2015. [168] H. Zhang, A. Zhou, S. Song, Q. Zhang, X. Z. Gao, and J. Zhang, “A self-organizing multiobjective evo- lutionary algorithm,” IEEE Transactions on Evolutionary Computation, vol. 20, no. 5, pp. 792–806, Oct 2016. [169] E. Shareghi and L. S. Hassanabadi, “Text summarization with harmony search algorithm-based sentence extraction,” in Proceedings of the 5th international conference on Soft computing as transdisciplinary science and technology. ACM, 2008, pp. 226–231. [170] J. Sadeghi, S. Sadeghi, and S. T. A. Niaki, “A hybrid vendor managed inventory and redundancy allocation optimization problem in supply chain management: An nsga-ii with tuned parameters,” Computers & Operations Research, vol. 41, pp. 53–64, 2014. [171] S. Khalilpourazari and S. Khalilpourazary, “Optimization of production time in the multi-pass milling process via a robust grey wolf optimizer,” Neural Computing and Applications, vol. 29, no. 12, pp. 1321– 1336, 2018. [172] J. Sadeghi and S. T. A. Niaki, “Two parameter tuned multi-objective evolutionary algorithms for a bi- objective vendor managed inventory model with trapezoidal fuzzy demand,” Applied Soft Computing, vol. 30, pp. 567–576, 2015. [173] H. Li and Q. Zhang, “Multiobjective optimization problems with complicated pareto sets, moea/d and nsga-ii,” IEEE Transactions on evolutionary computation, vol. 13, no. 2, pp. 284–302, 2009. [174] H.-S. Park and C.-H. Jun, “A simple and fast algorithm for k-medoids clustering,” Expert systems with applications, vol. 36, no. 2, pp. 3336–3341, 2009. [175] R. L. Cilibrasi and P. M. Vitanyi, “The google similarity distance,” IEEE Transactions on knowledge and data engineering, vol. 19, no. 3, 2007. [176] M. A. Fattah and F. Ren, “Ga, mr, ffnn, pnn and gmm based models for automatic text summarization,” Computer Speech & Language, vol. 23, no. 1, pp. 126–144, 2009. [177] D. R. Radev, H. Jing, M. Sty´s, and D. Tam, “Centroid-based summarization of multiple documents,” Information Processing & Management, vol. 40, no. 6, pp. 919–938, 2004.

204 [178] C. N. Silla, G. L. Pappa, A. A. Freitas, and C. A. Kaestner, “Automatic text summarization with genetic algorithm-based attribute selection,” in Ibero-American Conference on Artificial Intelligence. Springer, 2004, pp. 305–314. [179] V. Gupta, P. Chauhan, S. Garg, A. Borude, and S. Krishnan, “An statistical tool for multi-document summarization,” International Journal of Scientific and Research Publications, vol. 2, no. 5, 2012. [180] D. Liu, Y. He, D. Ji, and H. Yang, “Genetic algorithm based multi-document summarization,” in Pacific Rim International Conference on Artificial Intelligence. Springer, 2006, pp. 1140–1144. [181] S. Bird and E. Loper, “Nltk: the natural language toolkit,” in Proceedings of the ACL 2004 on Interactive poster and demonstration sessions. Association for Computational Linguistics, 2004, p. 31. [182] J. H. Lau and T. Baldwin, “An empirical evaluation of doc2vec with practical insights into document embedding generation,” arXiv preprint arXiv:1607.05368, 2016. [183] K. Mani, I. Verma, H. Meisheri, and L. Dey, “Multi-document summarization using distributed bag-of- words model,” in IEEE/WIC/ACM International Conference on Web Intelligence (WI). IEEE, 2018, pp. 672–675. [184] R. D. Lins, R. F. Mello, and S. Simske, “Doceng’19 competition on extractive text summarization,” in Proceedings of the ACM Symposium on Document Engineering 2019, 2019, pp. 1–2. [185] H. Oliveira, R. Lima, R. D. Lins, F. Freitas, M. Riss, and S. J. Simske, “A concept-based integer linear programming approach for single-document summarization,” in 2016 5th Brazilian Conference on Intelligent Systems (BRACIS). IEEE, 2016, pp. 403–408. [186] A. Jangra, A. Jatowt, M. Hasanuzzaman, and S. Saha, “Text-image-video summary generation using joint integer linear programming,” in European Conference on Information Retrieval. Springer, 2020, pp. 190– 198. [187] B. L. Welch, “The generalization ofstudent’s’ problem when several different population variances are in- volved,” Biometrika, vol. 34, no. 1/2, pp. 28–35, 1947. [188] J. Ramos et al., “Using tf-idf to determine word relevance in document queries,” in Proceedings of the first instructional conference on machine learning, vol. 242, 2003, pp. 133–142. [189] A. Huang, “Similarity measures for text document clustering,” in Proceedings of the sixth new zealand computer science research student conference (NZCSRSC2008), Christchurch, New Zealand, vol. 4, 2008, pp. 9–56. [190] Y. Wang, Z. Cai, and Q. Zhang, “Differential evolution with composite trial vector generation strategies and control parameters,” IEEE Transactions on Evolutionary Computation, vol. 15, no. 1, pp. 55–66, 2011. [191] A. W. Mohamed, H. Z. Sabry, and T. Abd-Elaziz, “Real parameter optimization by an effective differential evolution algorithm,” Egyptian Informatics Journal, vol. 14, no. 1, pp. 37–53, 2013. [192] S. Das and P. N. Suganthan, “Differential evolution: A survey of the state-of-the-art,” IEEE transactions on evolutionary computation, vol. 15, no. 1, pp. 4–31, 2011. [193] G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval,” Information processing & management, vol. 24, no. 5, pp. 513–523, 1988. [194] I. Rish et al., “An empirical study of the naive bayes classifier,” in IJCAI 2001 workshop on empirical methods in artificial intelligence, vol. 3, no. 22, 2001, pp. 41–46. [195] L. Wang, Support vector machines: theory and applications. Springer Science & Business Media, 2005, vol. 177. [196] M. Imran, P. Mitra, and C. Castillo, “Twitter as a lifeline: Human-annotated twitter corpora for nlp of crisis-related messages,” in Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). Paris, France: European Language Resources Association (ELRA), may 2016. [197] Y. Qu, C. Huang, P. Zhang, and J. Zhang, “Microblogging after a major disaster in china: a case study of the 2010 yushu earthquake,” in Proceedings of the ACM 2011 conference on Computer supported cooperative work. ACM, 2011, pp. 25–34. [198] T. Sakaki, M. Okazaki, and Y. Matsuo, “Earthquake shakes twitter users: real-time event detection by social sensors,” in Proceedings of the 19th international conference on World wide web. ACM, 2010, pp. 851–860. [199] K. Rudra, P. Goyal, N. Ganguly, P. Mitra, and M. Imran, “Identifying sub-events and summarizing disaster- related information from microblogs,” in The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 2018, pp. 265–274.

205 References

[200] Z. Wang and G. P. Rangaiah, “Application and analysis of methods for selecting an optimal solution from the pareto-optimal front obtained by multiobjective optimization,” Industrial & Engineering Chemistry Research, vol. 56, no. 2, pp. 560–574, 2017. [201] R. M. Alguliev, R. M. Aliguliyev, and C. A. Mehdiyev, “Sentence selection for generic document summa- rization using an adaptive differential evolution algorithm,” Swarm and Evolutionary Computation, vol. 1, no. 4, pp. 213–222, 2011. [202] H. H. Saleh, N. J. Kadhim, and A. A. Bara’a, “A genetic based optimization model for extractive multi- document text summarization,” Iraqi Journal of Science, vol. 56, no. 2B, pp. 1489–1498, 2015. [203] X. Li, H. Zhang, and S. Song, “A self-adaptive mating restriction strategy based on survival length for evolutionary multiobjective optimization,” Swarm and evolutionary computation, vol. 43, pp. 31–49, 2018. [204] J. Johnson, A. Karpathy, and L. Fei-Fei, “Densecap: Fully convolutional localization networks for dense captioning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. [205] L. Yang, K. Tang, J. Yang, and L.-J. Li, “Dense captioning with joint inference and visual context,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2193–2202. [206] K. Ahmad, M. L. Mekhalfi, N. Conci, F. Melgani, and F. D. Natale, “Ensemble of deep models for event recognition,” ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 14, no. 2, pp. 1–20, 2018. [207] S. Vesal, N. Ravikumar, A. Davari, S. Ellmann, and A. Maier, “Classification of breast cancer histology images using transfer learning,” in International conference image analysis and recognition. Springer, 2018, pp. 812–819. [208] S. Robertson, H. Zaragoza et al., “The probabilistic relevance framework: Bm25 and beyond,” Foundations and Trends R in Information Retrieval, vol. 3, no. 4, pp. 333–389, 2009.

206 Publications

Journals

1. Saini, N., N., Saha, S., Harsh, A., & Bhattacharyya, P. (2018): Sophisticated SOM based genetic operators in multi-objective clustering framework. Applied Intelligence, 49(5), 1803-1822. (Impact factor: 2.88)

2. Saini, N., Saha, S., & Bhattacharyya, P. (2018): Automatic Scientific Document Cluster- ing Using Self-organized Multi-objective Differential Evolution. Cognitive Computation, 11(2), 271-293. (Impact factor: 4.87) 3. Saini, N., Saha, S., Jangra, A., & Bhattacharyya, P. (2018): Extractive single document summarization using multi-objective optimization: Exploring self-organized differential evolution, grey wolf optimizer and water cycle algorithm. Knowledge-Based Systems, 164, 45-67. (Impact factor: 5.10) 4. Saini, N., S., Saha, S., Chakraborty, D., & Bhattacharyya, P. (2019): Extractive single document summarization using binary differential evolution: Optimization of difference sentence quality measures, PLoS ONE 14(11): e0223477. (Impact factor: 2.76, h5 index: 176) 5. Saini, N., Saha, S., Potnuru, V., Grover, R., & Bhattacharyya, P. (2019): Figure- Summarization: A Multiobjective optimization based approach. IEEE Intelligent Systems. (Impact factor: 4.64) 6. Saini, N., Saha, S., Bhattacharyya, P., Tuteja, H. (August 2019): Textual Entailment based Figure Summarization for Biomedical Articles, ACM Transactions on Multimedia Computing Communications and Applications. (accepted) (Impact Factor: 2.25) 7. Saini, N., S., Saha, S., & Bhattacharyya, P. (2019): A Multi-objective Based Approach for Microblog Summarization, IEEE Transactions On Computational Social Systems. 8. Saini, N., Saha, S., Mansoori, S., & Bhattacharyya, P. (2020): Fusion of self-organizing map and granular self-organizing map for microblog summarization. Soft Computing, pp.1-13. (Impact Factor: 3.05). Conferences

1. Saini, N., Chourasia, S., Saha, S., & Bhattacharyya, P. (2017): A self-organizing map based multi-objective framework for automatic evolution of clusters. In International Con- ference on Neural Information Processing (ICONIP 2017) (pp. 672-682). Springer, Cham. (Core ranking: A). (Core ranking: A, h5 index: 21).

207 References

2. Saini, N., S., Saha, S., Kumar, A., & Bhattacharyya, P. (September 2019): Multi- document Summarization using Adaptive Composite Differential Evolution. In Interna- tional Conference on Neural Information Processing (ICONIP 2019). Springer. (Core ranking: A, h5 index: 21).

3. Saini, N., Kumar. Sushil, S., Saha, S., & Bhattacharyya, P. (2020): Mining Graph-based Features in Multi-objective Framework for Microblog Summarization, In IEEE Congress on Evolutionary Computation (IEEE CEC 2020). (Core ranking: A, h5 index: 68)

Other related Accepted Journal Publication

1. Saini, N., Saha, S., Soni & C., Bhattacharyya, P. (September 2019): Automatic Evolution of Bi–clusters from Microarray Data using Self-Organized Multi-objective Evolutionary Algorithm, Applied Intelligence. (Impact factor: 2.88)

Other related Accepted Conference Publications

1. Saini, N., Saha, S., & Bhattacharyya, P. (2018): Cascaded SOM: An improved tech- nique for automatic email classification. In 2018 International Joint Conference on Neural Networks (IJCNN 2018) (pp. 1-8). IEEE. (Core ranking: A, h5 index: 36)

2. Saini, N., Grover, R., S., Saha, S., & Bhattacharyya, P. (September 2019): Scientific Document Clustering using Granular Self-organizing Map. In International Conference on Neural Information Processing (ICONIP 2019). Springer. (Core ranking: A). (Core ranking: A, h5 index: 21)

3. Saini, N., Saha, S., Bhattacharyya, P. (September 2019): Incorporation of Neighborhood Concept in Enhancing SOM based Multi-label Classification. In International Conference on Pattern Recognition and Machine Intelligence (PReMI 2019). Springer.

4. Saini, N., Reddy, S., Saha, S., & Bhattacharyya, P.: A Multi-view Clustering Approach for Scientific Document Summarization Using Citation Context, In IEEE International Conference on Pattern Recognition (ICPR 2020). IEEE. (accepted) (Core ranking: B, h5 index: 38)

Under Review Journals

1. Saini, N., S., Saha, S., Bhattacharyya, P., Mrinal, S., Mishra, S. & : On Multi-modal Microblog Summarization, In IEEE Transactions On Multimedia. (Impact factor: 5.45)

2. Saini, N., D. Bansal, S., Saha, S., & Bhattacharyya, P.: Textual Entailment based Multi- objective Multi-view Search Results Clustering, In Expert Systems with Applications. (Impact factor: 5.85)

3. Saini, N., Saha, S., & Bhattacharyya, P.: Microblog Summarization using Self-adaptive Multi-objective Binary Differential Evolution, In Neural Computing and Applications. (Impact factor: 4.66)

208