Using Semantic Folding with Textrank for Automatic Summarization

DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2017 Using semantic folding with TextRank for automatic summarization SIMON KARLSSON KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION Using semantic folding with TextRank for automatic summarization SIMON KARLSSON [email protected] Master in Computer Science Date: June 27, 2017 Principal: Findwise AB Supervisor at Findwise: Henrik Laurentz Supervisor at KTH: Stefan Nilsson Examiner at KTH: Olov Engwall Swedish title: TextRank med semantisk vikning för automatisk sammanfattning School of Computer Science and Communication i Abstract This master thesis deals with automatic summarization of text and how semantic folding can be used as a similarity measure between sentences in the TextRank algorithm. The method was implemented and compared with two common similarity measures. These two similarity measures were cosine similarity of tf-idf vectors and the number of overlapping terms in two sentences. The three methods were implemented and the linguistic features used in the construction were stop words, part-of-speech filtering and stemming. Five different part-of- speech filters were used, with different mixtures of nouns, verbs, and adjectives. The three methods were evaluated by summarizing documents from the Document Understanding Conference and comparing them to gold-standard summarization cre- ated by human judges. Comparison between the system summaries and gold-standard summaries was made with the ROUGE-1 measure. The algorithm with semantic folding performed worst of the three methods, but only 0.0096 worse in F-score than cosine similarity of tf-idf vectors that performed best. For semantic folding, the average precision was 46.2% and recall 45.7% for the best-performing part-of-speech filter. ii Sammanfattning Det här examensarbetet behandlar automatisk textsammanfattning och hur semantisk vikning kan användas som likhetsmått mellan meningar i algoritmen TextRank. Meto- den implementerades och jämfördes med två vanliga likhetsmått. Dessa två likhetsmått var cosinus-likhet mellan tf-idf-vektorer samt antal överlappande termer i två meningar. De tre metoderna implementerades och de lingvistiska särdragen som användes vid konstruktionen var stoppord, filtrering av ordklasser samt en avstämmare. Fem olika filter för ordklasser användes, med olika blandningar av substantiv, verb och adjektiv. De tre metoderna utvärderades genom att sammanfatta dokument från DUC och jäm- föra dessa mot guldsammanfattningar skapade av mänskliga domare. Jämförelse mellan systemsammanfattningar och guldsammanfattningar gjordes med måttet ROUGE-1. Algoritmen med semantisk vikning presterade sämst av de tre jämförda metoderna, dock bara 0.0096 sämre i F-score än cosinus-likhet mellan tf-idf-vektorer som presterade bäst. För semantisk vikning var den genomsnittliga precisionen 46.2% och recall 45.7% för det ordklassfiltret som presterade bäst. iii Acknowledgements I would like to thank Henrik Laurentz for invaluable guidance throughout the entire project. This thesis would simply not be what it is if it was not for you, Simon Stenström for the initial discussions that made this thesis possible, Stefan Nilsson for your feedback on the scientific approach of the project, Olov Engwall for examining this thesis, Findwise for making me feel welcome every single day, Josefine for your never ending love and support throughout my years at KTH, Family and friends for always being there for me. Contents 1 Introduction 1 1.1 Problem definition . .2 1.2 Objective . .2 1.3 Purpose . .2 1.4 Delimitations . .3 1.5 Outline of the report . .3 2 Theory 5 2.1 Automatic summarization . .5 2.1.1 Categories of automatic summarization . .6 2.1.2 TextRank: Automatic summarization using graph representations7 2.1.3 Sentence similarity . .9 2.2 Distributional semantics . 10 2.2.1 Semantic folding . 11 2.3 Linguistic features . 12 2.3.1 Tokenization . 12 2.3.2 Stop words . 12 2.3.3 PoS tagging . 12 2.3.4 Stemming . 12 2.3.5 Lemmatisation . 14 2.4 Evaluation . 14 2.4.1 Instrinsic evaluation . 14 2.4.2 Extrinsic evaluation . 15 2.5 Related work . 15 2.5.1 Distributional semantics . 15 2.5.2 Automatic summarization using context . 16 3 Implementation 18 3.1 Architecture . 18 3.2 Preprocessing . 18 3.2.1 Sentences tokenization . 19 iv CONTENTS v 3.2.2 Word tokenization . 21 3.2.3 Stop words . 21 3.2.4 PoS filtering . 21 3.2.5 Stemming . 21 3.3 Creation of the graph . 22 3.3.1 Tf-Idf . 23 3.3.2 Term overlap . 23 3.3.3 Semantic folding . 24 3.4 Scoring of the graph . 24 3.5 Picking the sentences . 24 4 Evaluation 25 4.1 Data . 25 4.2 Gold-standard comparison . 25 4.3 Preprocessing of system and reference summaries . 26 4.4 Measures . 26 4.4.1 Precision . 26 4.4.2 Recall . 27 4.4.3 F1 score . 27 4.4.4 Upper bound of the Rouge-score . 27 5 Results 28 5.1 The average score . 28 6 Discussion 30 6.1 Methodology . 30 6.2 Results . 31 6.3 Criticism . 32 6.4 Ethics and sustainability . 33 7 Conclusion 34 7.1 Future work . 34 7.1.1 Train model on specific domain . 34 7.1.2 Preprocessing . 35 7.1.3 Extrinsic evaluation . 35 7.1.4 Evaluate readability . 35 Bibliography 36 A Preprocessing 42 A.1 Steps of the porter stemming algorithm . 42 A.2 The stoplist used . 45 CONTENTS vi B Results 46 B.1 The median score . 46 B.2 The standard deviation score . 47 Abbreviations DUC Document Understanding Conference. 2, 25–27, 31, 34 idf inverse document frequency. 9, 19, 22, 23, 25, 28–32, 34, 47, 48 JUNG Java Universal Network/Graph Framework. 24 LSA Latent Semantic Analysis. 15–17 PoS Part of Speech. 3, 12, 18, 19, 21, 23, 28, 29, 32, 34, 35 ROUGE Recall-Oriented Understudy for Gisting Evaluation. 2, 25, 26, 29, 30, 46–48 tf term frequency. 9, 19, 22, 23, 25, 28–32, 34, 47, 48 vii Glossary co-occurrence When two terms occurs alongside in a text. 2 corpus A large set of text. 3, 9, 12, 16, 19–21, 23, 24, 32 n-gram A sequence of n words from a text. 2, 11, 26, 27 snippet A small piece of text. 11 vocabulary A set of words used for a specific purpose. 9, 12, 14 viii Chapter 1 Introduction This thesis presents a novel approach to automatic summarization, using the TextRank algorithm with semantic folding. This method was evaluated against two state-of-the- art methods. Automatic summarization is the concept of creating a summary of a text in such a way as to maximize the relevant information from the original text in the summary. The summary can either be extractive, which means that a subset of the sentences from the original text is chosen to the summary, or abstractive, which means that entirely new sentences are constructed and placed in the summary. One approach to extractive summarization is to create a graph from the original text in which the nodes represents sentences and edges represent some kind of similarity between the corresponding sentences. Sentences are added to the summary based on centrality in the graph, with the hypothesis that central sentences are important and hence should be in the summary. This algorithm is known as TextRank [1] 1. This thesis evaluates the performance of this algorithm using a novel method of as- sessing the similarity between sentences in TextRank, based on distributional semantics. Distributional semantics builds upon the hypothesis that words which are similar in meaning occur in similar contexts, which means that the similarity of text units can be estimated by examining the distributional similarity between text units [3]. Semantic folding is a distributional semantics method with the purpose of capturing the meaning of a text unit with the context in which it appears [4]. The novel similarity measure described in this thesis is based on semantic folding. 1Another graph-based summarization algorithm, LexRank [2], was developed at the same time as TextRank, using different similarity measures. However, TextRank will be used to denote this summarization method throughout this thesis. 1 CHAPTER 1. INTRODUCTION 2 1.1 Problem definition The research question of this thesis is: How will the TextRank algorithm perform with semantic folding as similarity measure? To answer the research question, the method of using semantic folding as a similarity measure in TextRank was evaluated against two state-of-the-art methods of comparing the similarity between texts. These similarity measures are cosine similarity of tf-idf vectors and the number of overlapping terms of two sentences. To answer the research question, the problem was divided into four parts: 1. Implementation of the TextRank algorithm with the overlap measure for similarity between sentences. 2. Implementation of the TextRank algorithm with the cosine similarity of tf-idf vectors measure. 3. Incorporation of semantic folding in the TextRank algorithm for similarity between sentences. 4. Evaluation of the three different similarity measures above. 1.2 Objective The objective of this thesis was to implement the TextRank algorithm for automatic summarization with three different similarity measures, cosine similarity of tf-idf vectors, term overlap and a novel approach based on semantic folding. The different methods were evaluated by comparing the constructed summaries with human cre- ated gold standard summaries from the Document Understanding Conference (DUC). The comparison was made with the Recall-Oriented Understudy for Gisting Evalua- tion (ROUGE) metric set, and specifically with the ROUGE-N measure, which is based on co-occurrences of n-grams [5]. 1.3 Purpose Information overload is a term describing the problem of making decisions due to the existence of to much information.

Using Semantic Folding with Textrank for Automatic Summarization

Approximate Pattern Matching Using Hierarchical Graph Construction and Sparse Distributed Representation

Automatic Summarization of Student Course Feedback

Using N-Grams to Understand the Nature of Summaries

Automatic Summarization of Medical Conversations, a Review Jessica Lopez

Exploring Sentence Vector Spaces Through Automatic Summarization

Keyphrase Based Evaluation of Automatic Text Summarization

Latent Semantic Analysis and the Construction of Coherent Extracts

Soft Similarity and Soft Cosine Measure: Similarity of Features in Vector Space Model

Leveraging Word Embeddings for Spoken Document Summarization

Evaluating Vector-Space Models of Word Representation, Or, the Unreasonable Effectiveness of Counting Words Near Other Words

Anomalous Behavior Detection Framework Using HTM-Based Semantic Folding Technique

Approximate Pattern Matching Using Hierarchical Graph Construction and Sparse Distributed Representation