DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2017

Using with TextRank for

SIMON KARLSSON

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION Using semantic folding with TextRank for automatic summarization

SIMON KARLSSON [email protected]

Master in Computer Science Date: June 27, 2017 Principal: Findwise AB Supervisor at Findwise: Henrik Laurentz Supervisor at KTH: Stefan Nilsson Examiner at KTH: Olov Engwall Swedish title: TextRank med semantisk vikning för automatisk sammanfattning School of Computer Science and Communication i

Abstract

This master thesis deals with automatic summarization of text and how semantic fold- ing can be used as a between sentences in the TextRank . The method was implemented and compared with two common similarity measures. These two similarity measures were of tf-idf vectors and the number of overlapping terms in two sentences.

The three methods were implemented and the linguistic features used in the construc- tion were stop , part-of-speech filtering and . Five different part-of- speech filters were used, with different mixtures of nouns, verbs, and adjectives.

The three methods were evaluated by summarizing documents from the Document Understanding Conference and comparing them to gold-standard summarization cre- ated by human judges. Comparison between the system summaries and gold-standard summaries was made with the ROUGE-1 measure. The algorithm with semantic fold- ing performed worst of the three methods, but only 0.0096 worse in F-score than cosine similarity of tf-idf vectors that performed best. For semantic folding, the average preci- sion was 46.2% and recall 45.7% for the best-performing part-of-speech filter. ii

Sammanfattning

Det här examensarbetet behandlar automatisk textsammanfattning och hur semantisk vikning kan användas som likhetsmått mellan meningar i algoritmen TextRank. Meto- den implementerades och jämfördes med två vanliga likhetsmått. Dessa två likhetsmått var cosinus-likhet mellan tf-idf-vektorer samt antal överlappande termer i två mening- ar.

De tre metoderna implementerades och de lingvistiska särdragen som användes vid konstruktionen var stoppord, filtrering av ordklasser samt en avstämmare. Fem olika filter för ordklasser användes, med olika blandningar av substantiv, verb och adjektiv.

De tre metoderna utvärderades genom att sammanfatta dokument från DUC och jäm- föra dessa mot guldsammanfattningar skapade av mänskliga domare. Jämförelse mel- lan systemsammanfattningar och guldsammanfattningar gjordes med måttet ROUGE-1. Algoritmen med semantisk vikning presterade sämst av de tre jämförda me- toderna, dock bara 0.0096 sämre i F-score än cosinus-likhet mellan tf-idf-vektorer som presterade bäst. För semantisk vikning var den genomsnittliga precisionen 46.2% och recall 45.7% för det ordklassfiltret som presterade bäst. iii

Acknowledgements I would like to thank

Henrik Laurentz for invaluable guidance throughout the entire project. This thesis would simply not be what it is if it was not for you,

Simon Stenström for the initial discussions that made this thesis possible,

Stefan Nilsson for your feedback on the scientific approach of the project,

Olov Engwall for examining this thesis,

Findwise for making me feel welcome every single day,

Josefine for your never ending love and support throughout my years at KTH,

Family and friends for always being there for me. Contents

1 Introduction 1 1.1 Problem definition ...... 2 1.2 Objective ...... 2 1.3 Purpose ...... 2 1.4 Delimitations ...... 3 1.5 Outline of the report ...... 3

2 Theory 5 2.1 Automatic summarization ...... 5 2.1.1 Categories of automatic summarization ...... 6 2.1.2 TextRank: Automatic summarization using graph representations7 2.1.3 Sentence similarity ...... 9 2.2 Distributional ...... 10 2.2.1 Semantic folding ...... 11 2.3 Linguistic features ...... 12 2.3.1 Tokenization ...... 12 2.3.2 Stop words ...... 12 2.3.3 PoS tagging ...... 12 2.3.4 Stemming ...... 12 2.3.5 ...... 14 2.4 Evaluation ...... 14 2.4.1 Instrinsic evaluation ...... 14 2.4.2 Extrinsic evaluation ...... 15 2.5 Related work ...... 15 2.5.1 ...... 15 2.5.2 Automatic summarization using context ...... 16

3 Implementation 18 3.1 Architecture ...... 18 3.2 Preprocessing ...... 18 3.2.1 Sentences tokenization ...... 19

iv CONTENTS v

3.2.2 tokenization ...... 21 3.2.3 Stop words ...... 21 3.2.4 PoS filtering ...... 21 3.2.5 Stemming ...... 21 3.3 Creation of the graph ...... 22 3.3.1 Tf-Idf ...... 23 3.3.2 Term overlap ...... 23 3.3.3 Semantic folding ...... 24 3.4 Scoring of the graph ...... 24 3.5 Picking the sentences ...... 24

4 Evaluation 25 4.1 Data ...... 25 4.2 Gold-standard comparison ...... 25 4.3 Preprocessing of system and reference summaries ...... 26 4.4 Measures ...... 26 4.4.1 Precision ...... 26 4.4.2 Recall ...... 27 4.4.3 F1 score ...... 27 4.4.4 Upper bound of the Rouge-score ...... 27

5 Results 28 5.1 The average score ...... 28

6 Discussion 30 6.1 Methodology ...... 30 6.2 Results ...... 31 6.3 Criticism ...... 32 6.4 Ethics and sustainability ...... 33

7 Conclusion 34 7.1 Future work ...... 34 7.1.1 Train model on specific domain ...... 34 7.1.2 Preprocessing ...... 35 7.1.3 Extrinsic evaluation ...... 35 7.1.4 Evaluate readability ...... 35

Bibliography 36

A Preprocessing 42 A.1 Steps of the porter stemming algorithm ...... 42 A.2 The stoplist used ...... 45 CONTENTS vi

B Results 46 B.1 The median score ...... 46 B.2 The standard deviation score ...... 47 Abbreviations

DUC Document Understanding Conference. 2, 25–27, 31, 34 idf inverse document frequency. 9, 19, 22, 23, 25, 28–32, 34, 47, 48

JUNG Java Universal Network/Graph Framework. 24

LSA . 15–17

PoS Part of Speech. 3, 12, 18, 19, 21, 23, 28, 29, 32, 34, 35

ROUGE Recall-Oriented Understudy for Gisting Evaluation. 2, 25, 26, 29, 30, 46–48 tf term frequency. 9, 19, 22, 23, 25, 28–32, 34, 47, 48

vii Glossary

co-occurrence When two terms occurs alongside in a text. 2 corpus A large set of text. 3, 9, 12, 16, 19–21, 23, 24, 32 n-gram A sequence of n words from a text. 2, 11, 26, 27 snippet A small piece of text. 11 vocabulary A set of words used for a specific purpose. 9, 12, 14

viii Chapter 1

Introduction

This thesis presents a novel approach to automatic summarization, using the TextRank algorithm with semantic folding. This method was evaluated against two state-of-the- art methods.

Automatic summarization is the concept of creating a summary of a text in such a way as to maximize the relevant information from the original text in the summary. The summary can either be extractive, which means that a subset of the sentences from the original text is chosen to the summary, or abstractive, which means that entirely new sentences are constructed and placed in the summary. One approach to extractive summarization is to create a graph from the original text in which the nodes represents sentences and edges represent some kind of similarity between the corresponding sentences. Sentences are added to the summary based on centrality in the graph, with the hypothesis that central sentences are important and hence should be in the summary. This algorithm is known as TextRank [1] 1. This thesis evaluates the performance of this algorithm using a novel method of as- sessing the similarity between sentences in TextRank, based on distributional semantics. Distributional semantics builds upon the hypothesis that words which are similar in meaning occur in similar contexts, which means that the similarity of text units can be estimated by examining the distributional similarity between text units [3]. Semantic folding is a distributional semantics method with the purpose of capturing the meaning of a text unit with the context in which it appears [4]. The novel similarity measure described in this thesis is based on semantic folding.

1Another graph-based summarization algorithm, LexRank [2], was developed at the same time as TextRank, using different similarity measures. However, TextRank will be used to denote this summariza- tion method throughout this thesis.

1 CHAPTER 1. INTRODUCTION 2

1.1 Problem definition

The research question of this thesis is:

How will the TextRank algorithm perform with semantic folding as similarity measure?

To answer the research question, the method of using semantic folding as a similarity measure in TextRank was evaluated against two state-of-the-art methods of compar- ing the similarity between texts. These similarity measures are cosine similarity of tf-idf vectors and the number of overlapping terms of two sentences. To answer the research question, the problem was divided into four parts:

1. Implementation of the TextRank algorithm with the overlap measure for similar- ity between sentences.

2. Implementation of the TextRank algorithm with the cosine similarity of tf-idf vec- tors measure.

3. Incorporation of semantic folding in the TextRank algorithm for similarity be- tween sentences.

4. Evaluation of the three different similarity measures above.

1.2 Objective

The objective of this thesis was to implement the TextRank algorithm for automatic summarization with three different similarity measures, cosine similarity of tf-idf vec- tors, term overlap and a novel approach based on semantic folding. The different methods were evaluated by comparing the constructed summaries with human cre- ated gold standard summaries from the Document Understanding Conference (DUC). The comparison was made with the Recall-Oriented Understudy for Gisting Evalua- tion (ROUGE) set, and specifically with the ROUGE-N measure, which is based on co-occurrences of n-grams [5].

1.3 Purpose

Information overload is a term describing the problem of making decisions due to the existence of to much information. Information overload is becoming more of a problem as the information online has been steadily increasing since the 90’s, with a growth of digital information with a rate of ten every five years [6, 7]. As the trouble of infor- mation overload gets bigger, methods of assessing the relevance of textual information also get more important. The process of deciding whether a text is relevant or not CHAPTER 1. INTRODUCTION 3

could be made simpler by presenting the text as a shorter version of itself, containing the same information. This is the purpose of an automatic summarization system. At the same time, this great amount of information could potentially be utilized with se- mantic folding or other distributional semantics methods to form a quantified measure of similarity between text units which possibly could be used to improve automatic summarization.

This thesis was commissioned by Findwise AB. The main interest of Findwise is to evaluate the validity for using semantic folding in the context of different natural lan- guage problems.

1.4 Delimitations

This thesis was limited to extractive summarization. Furthermore, summarization can be done either based on one source document, or multiple source documents. This the- sis only handled the former case, namely single-document summarization. When pro- cessing text, typically some preprocessing will be needed. Preprocessing of text could include removing stop words and reducing words to their primary forms. This thesis only evaluated different forms of Part of Speech (PoS) filtering for the summarization system, with the rest of the preprocessing steps fixed. The semantic folding model needs training on a , and for this thesis, it was only trained on non-domain specific data, namely with English .

1.5 Outline of the report

The outline of the report will follow the structure of:

1. Introduction

2. Theory/background

3. Implementation

4. Evaluation

5. Results

6. Discussion

7. Conclusion

Chapter 2 presents the theory needed to understand this thesis and presents some related work. Chapter 3 describes the steps necessary to replicate the work of the thesis. Chapter 4 describes how the evaluation of the proposed system was done. Chapter 5 CHAPTER 1. INTRODUCTION 4

presents the results of the thesis, and they are discussed in Chapter 6. The thesis is concluded with some final conclusions and suggestions of future work in Chapter 7. Chapter 2

Theory

This chapter provides the reader with all the background theory that is necessary to understand the rest of this thesis. Section 2.1 gives an introduction to automatic sum- marization and presents the key algorithm of this thesis; TextRank. Section 2.2 describes distributional semantics including semantic folding and how it is applied to calculate the similarity between sentences. Section 2.3 describes some linguistic features that must be taken into consideration when creating an automatic summarization system. Section 2.4 describes different ways of evaluating an automatic summarization system. The chapter ends with Section 2.5 which describes some related work relevant to this thesis.

2.1 Automatic summarization

The early work of automatic summarization goes back to 1958 when Luhn [8] automat- ically created summaries of technical literature. He defined the significance of words as a combination of the frequency of a given word in the document and the relative position of the word in the sentence it appears. This significance was then used to pick sentences from the source text to put in the summary by combining the significance of all words in the sentence. Edmundson [9] extended the novel work of Luhn by taking more features into consideration, like cue words 1, title and heading words and the lo- cation of the sentence. Kupiec et al. [10] trained a classifier determining if a sentence should be present in the summary or not by representing a sentence using 7 different linguistic features. By introducing multi-feature representation into the field of auto- matic summarization in this way led the research into the era of based approaches.

According to Hassel [11], the original incentive of the automatic summarization re- search was different applications of digitalization of books and scientific papers, as

1A connective expression that signals semantic relations in a text.

5 CHAPTER 2. THEORY 6

Luhn’s efforts of summarizing technical literature indicate. This incentive later grew as the internet became established and new digital information became available online in a rate as never before. One way in which summarization has become an ordinary part of the online activity is the way uses automatic summarization to summarize a term searched for or a page retrieved [12].

With the rise of the internet came the problem of summarizing multiple documents regarding the same topic [13]. This type of automatic summarization is called multi- document summarization and it brings new challenges to automatic summarization. Early key research in multi-document summarization includes that of Lin et al. [14] that used the ideas of Luhn and Edmunson and applied them to multi-document sum- marization.

In 2004 came the influential work of Mihalcea et al. [1] and Günes et al. [2] which at the same time developed a new way of determining sentence importance in a text. This way of extracting important sentences has been used a lot in automatic summariza- tion by selecting the most important sentences to the output summary. This approach, named TextRank (or LexRank), is still often used as a benchmark when new summa- rization methods are constructed [13, 15, 16, 17, 18]. As the key algorithm of this thesis, it will be presented in detail in Section 2.1.2.

2.1.1 Categories of automatic summarization As already pointed out at some earlier places in the thesis there are multiple different types of automatic summarization, with different input source texts, different goals for the summary or other parameters differentiating them. This section will go through the different categories and explain the difference between them.

Extractive vs. abstractive In extractive summarization, a subset of the text from the original text is chosen to the summary, e.g sentences, paragraphs or phrases [19]. Extractive summarization can be viewed as a process with two main steps; representation of the source text and a mechanism for selecting sentences from the source representation. In abstractive sum- marization, the content of the text is learned, and a new summary is constructed in new words [19]. A majority of the research done has been on extractive summarization rather than abstractive [13]. Extractive summarization will be explained more in depth in section 2.1.2

Single-document vs. multi-document In single-document summarization, there is only one document as input to the summa- rization algorithm. In multi-document summarization, multiple documents possibly CHAPTER 2. THEORY 7

regarding the same topic are to be summarized. This raises problems due to the fact that highly similar sentences describing the same things might be present in different documents, which means that a mechanism to only selecting one of these sentences must be constructed [20].

Indicative vs. informative In informative summarization, the goal is to make the summary hold as much infor- mation from the original text as possible, while in indicative summarization, the goal is merely to indicate the subject of the original text [21].

Generic vs. query-oriented Generic summarization focus on creating a general summarization of the input text, while query-oriented summarization creates a summary based on the information need given by a user [22].

2.1.2 TextRank: Automatic summarization using graph representations This section will describe an algorithm for automatic summarization called TextRank. The algorithm can be described as a three-step process including sentence representa- tion, sentence ranking, and sentence selection. The following subsections will describe each of these steps.

Sentence representation The input text is represented as a graph, where each sentence is converted to a node where an edge between two nodes represents similarity between the two sentences [2, 1]. There are multiple different ways of measuring the similarity between sentences and the two original measures will be explained in section 2.1.3. Günes et al. [2] sug- gests an unweighted graph, with edges between sentences if the similarity is above a predefined threshold, whilst Mihalcea et al. [1] suggest a fully connected graph with edges between nodes stating the similarity between the corresponding sentences, no matter how similar they are.

Sentence ranking When the sentences have been converted to a graph, the next step is to rank each node (sentence) in the graph, which is done with the PageRank algorithm. PageRank was developed by Brin et al. [23, 24] in 1998 and is the basis of the Google search engine. PageRank determines the relative importance of web pages using the link structure of the web. The result is a ranking where web pages with high centrality on the web are given preference and are considered more important. The web is modeled as a graph CHAPTER 2. THEORY 8

where web pages are represented by nodes with directed edges if there is a link from a page to another. Even if the model was initially developed for web pages, it can be used on all graphs with edges representing some kind of recommendation of nodes, no matter what the nodes represent. In the context of graph-based summarization, the nodes represent sentences, and the edges represent similarity between them, but the underlying mechanics work just the same as first described by Brin et al. Informally, the model can be thought of as a random surfer that is surfing the web completely at random. The PageRank score is then the fraction of time the random surfer is spending on each web page, assuming that the surfer stays the same amount of time on each page. At each transition from a page, the random surfer picks one of the outlinks at random, which means that if the set of outlinks from page j is oj, the probability of going to each neighbour from j is 1 . This gives a simplified definition |oj | of the PageRank score as in Equation 2.1, where PR(u) is the PageRank of node u, L(u) is the number of outlinks from u, and Bu is the set of pages having a link to u:

X 1 PR(u) = · PR(v) (2.1) L(v) v∈Bu However, the model described so far has a problem. If a page does not have any outlinks, the surfer will be stuck and will not have any possibilities to keep the transi- tions. To account for this, a probability d to jump to any other node at each transition is introduced. This gives the final definition of the PageRank score as in Equation 2.2, with the same notations as in Equation 2.1 and with N as the total number of nodes in the graph.

d X 1 PR(u) = + (1 − d) · PR(v) (2.2) N L(v) v∈Bu The PageRank can be found by starting with an even score for each web page and iterating through each page and update its PageRank according to above definition un- til the changes are smaller than a given convergence criteria.

In the context of TextRank, the nodes represent sentences instead of pages, and the edges are the similarity between these sentences instead of links between web pages. Note that in the case of similarity between sentences, there will always be opposite edges between nodes due to the reflexive relationship between sentences, instead of with web pages where links often go in only one direction.

Sentence selection When each sentence has been scored, the last step is to select which sentences to out- put as the summary given the desired length of the summary. One usual approach is to simply pick the highest scoring sentences until the desired length is obtained [1]. CHAPTER 2. THEORY 9

This might yield a summary that is slightly longer than the desired length. Another approach is to consider the sentence selection as an optimization problem and solve it with the knapsack algorithm, where a set of items with weights and values is picked to maximize the value given a weight limit [13]. More formally, let vi and wi be the value and weight of item i, W the weight limit and define xi to be one if the item is picked or zero if it is not picked. The optimization problem can be stated as

Pn Pn maximize i=1 vixi subject to i=1 wixi ≤ W and xi ∈ {0, 1}.

In the application of picking sentences, vi and wi will be the score and length of sentence i, and W will be the desired length of the summary.

2.1.3 Sentence similarity Section 2.1.2 described the general characteristics of the TextRank algorithm, leaving out how to assess the similarity between sentences. That is described in this section.

Cosine similarity of tf-idf vectors One approach is to calculate the similarity between sentences by representing them as vectors and calculate the distance between these vectors in a geometric space [2]. The dimensionality of each sentence vector is the total number of terms in the vocabulary, with each position in the vector corresponding to a term, and the value based on the occurrence of that term. One possible approach would be to only count the number of occurrences of the term in the sentence, but that will give usual term preference over unusual terms, even if unusual terms often defines a text better than the usual terms that most text contains. To account for this, the frequency of a term is weighted with the inverse document frequency (idf). The purpose of idf is to boost the value of rare terms. This is done by taking the logarithm of the number of documents N in the given corpus divided by the number of documents that contains a given term nt, as 2.3 shows. N log (2.3) nt The idf-score will be high for a term if it only is present in a small number of docu- ments in the corpus. The idf-score is combined with the term frequency (tf) giving the so called tf-idf score. The tf-idf for a given term t, document d and corpus D, is defined in 2.4.

tf − idf(t, d, D) = tf(t, d) · idf(t, D) (2.4) This yields a vector for each sentence, with each element the tf-idf score of the term corresponding to that index of the vector. The distance between the vectors is then CHAPTER 2. THEORY 10

calculated with the cosine distance. Equation 2.5 displays the cosine similarity measure, where Ai and Bi are the components of vector A and B.

n P A B A · B i i similarity = cos(θ) = = i=1 (2.5) kAkkBk s n s n P 2 P 2 Ai Bi i=1 i=1 The edges of the similarity graph described in section 2.1.2 are weighted accord- ing to the cosine similarity measure. Another possibility is to create the graph with unweighted edges, where edges exist between nodes if the similarity is above some predefined threshold. According to Erkan et al. [2] a threshold of 0.1 has shown to perform best.

Term overlap Instead of calculating the distance between word vectors, one can simply calculate the number of terms overlapping between sentences. The sentence similarity defined by Mihalcea et al. [1] is the term overlap divided by the logarithm of the length of each sentence, as shown in 2.6.

S1 ∩ S2 (2.6) log(|S1|) + log(|S2|))

The original paper of TextRank also suggest to only consider specific word classes of each sentence, which gives the definition in 2.7 below, where F is a function keeping only the terms from the given word classes.

F (S1) ∩ F (S2) (2.7) log(|S1|) + log(|S2|))

2.2 Distributional semantics

Section 2.1.3 described how the similarity of two sentences can be calculated with word vectors and term overlap. This section explains a different way of assessing the similar- ity between text units, namely with the concept of semantic folding, which is a method of doing distributional semantics, which creates a representation of a text unit based on the context in which it appears. Distributional semantics builds on the distributional hypothesis which states that: "words which are similar in meaning occur in similar contexts" [25]. The hypothesis states that there is a correlation between distributional CHAPTER 2. THEORY 11

similarity and meaning similarity, which means that the meaning similarity of text units can be estimated by examining the distributional similarity between text units [3]. This section deals with semantic folding and three other methods of distributional semantics are briefly explained in Section 2.5.

2.2.1 Semantic folding Semantic folding theory was created by De Sousa Webber [4] with the purpose to cap- ture the context of text units in a way so that they can be compared. Semantic folding uses a set of reference documents to capture the context of words, which results in a word vector for each word which represents its meaning based on its context. The word vectors of individual words are then combined to form vectors of sentences that in turn can be compared to find the similarity of two sentences. The creation of the word vectors works as follow: 1. Each document in the document set is split into snippets. Each of these snippets will capture the context for every word contained in it. The text snippets is nor- mally in the form of n-grams where n ≤ 3. 2. Each snippet is mapped to a 2-dimensional vector so that snippets that share words are close to each other in the vector. 3. A vector is created for each word by setting the corresponding position in its vector to one if the word is present in that index of the 2-dimensional vector from step 2. This will result in a vector for each word that represents the semantic meaning of the word based on its context in the document set, a so-called semantic fingerprint. The semantic fingerprints for individual words are then combined to form fingerprints for sentences, which in turn can be used to calculate similarities between sentences, based on the similarity of their semantic fingerprints. Figure 2.1 illustrates the process of combining the fingerprints for all words in a sentence to a sentence fingerprint. To keep the sparsity of the vectors the bit stacks of each entry in the combined sentence fingerprint is used to calculate which entries should be active, which is illustrated in Figure 2.2.

Competitive learning The mapping in step 2 is made with competitive learning which is a family of artificial neural network methods. Rumelhart et al. [26] describe competitive learning as a three step process: 1. Start with a set of units that are all the same except for some randomly distributed parameter, which makes each of them respond slightly differently to a set of input patterns. CHAPTER 2. THEORY 12

2. Limit the "strength" of the response from each unit so that it only responds to one pattern.

3. Allow the units to compete in some way for the right to respond to a given subset of inputs.

The essence of a competitive learning network is to have a set of nodes where each node responds to a subset of similar inputs.

2.3 Linguistic features

When creating an automatic summarization system, some preprocessing is needed based on some linguistic features of the input text. This will be the subject of this sec- tion.

2.3.1 Tokenization Tokenization is the part of the preprocessing when the input text is split into units called tokens [27]. The tokens could be both individual words or entire sentences.

2.3.2 Stop words Stop words are words that are excluded from the vocabulary because they are too com- mon to have any meaningfulness. A usual strategy to create a stop list of stop words is to sort the terms by the total number of times they occur in the corpus, and manually go through the most occurring terms and add them to the stop list if they do not have any meaningful semantics based on the domain of the corpus. Figure 2.3 shows some usual stop words [28].

2.3.3 PoS tagging A PoS tagger classify words as specific word classes [28]. An example of this would be to tag a word as a noun, adjective or verb etc. Often more fine-grained PoS tags are used that classify words with tags like noun-plural for example [29]. A PoS filter is a filter that filters a text based on the PoS tags of the tokens [27].

2.3.4 Stemming A word will occur in a text in different forms, e.g organize, organizes, organizing. Fur- thermore, there might be related words with similar meanings, like democracy, demo- cratic and democratization. Even if the words are different on a syntactical level, they might have the same meaning. Stemming is a heuristic approach that chops of the ends of words such that these similar words look the same [28]. Below four examples from CHAPTER 2. THEORY 13

Figure 2.1: An illustration of the process of combining the fingerprints of individual words to a fingerprint of a sentence. Image credited to De Sousa Webber [4].

Figure 2.2: An illustration of the process of combining the bit stacks of each individual words to form the fingerprint of a sentence using a threshold. Image credited to De Sousa Webber [4]. CHAPTER 2. THEORY 14

a, an, and, are, as, at, be, by, for, from, has, he, in, is, it, its, of, on, that, the, to, was, were, will, with

Figure 2.3: A list of some terms usually considered stop words.

Wikipedia [30] is shown. The first three successfully chops the end of the word to form the stem, while the fourth illustrates the weakness of the heuristic when trying to form stems of different forms of "argue" and "argus", which all gets the same stem. cats, catlike, catty → cat stems, stemmer, stemming, stemmed → stem fishing, fished, fisher, → fish argue, argued, argues, arguing, argus → argu

2.3.5 Lemmatisation Lemmatisation is closely related to stemming in that it has the same goal. But where stemming uses a heuristic, lemmatisation uses vocabulary and morphological analysis of words to return the base or dictionary form of a word [28]. For example, "better" is reduced to its lemma "good" and "walking" is reduced to its lemma "walk" [31]. Note that in the case of "walking", stemming and lemmatization results in the same base.

2.4 Evaluation

This section will give an overview on how to evaluate an automatic summarization system. Automatic summarization evaluation can be split into two types; intrinsic and extrinsic evaluation [32]. Intrinsic evaluation focuses on the performance of the sum- marization system itself, while extrinsic evaluation measures the performance of the evaluation system by performing another task with the summarizer [11].

2.4.1 Instrinsic evaluation When using an intrinsic method, there are basically two criteria for evaluating a sum- mary; readability and informativeness of the summary. The coherence of the summary is high if it lacks dangling anaphors, gaps in rhetorical structure and if it preserves en- vironment dependent structures, such as lists etc. Informativeness, on the other hand, is high if the information from the source text is preserved in the summary [32]. Most automatic summarization evaluation is intrinsic and is often evaluated with a gold- standard comparison, which means that the output from the summarizer is compared with a summary created by a human judge [11]. CHAPTER 2. THEORY 15

Instrinsic measures Two often used measures with intrinsic evaluation is sentence precision and sentence recall. Sentence recall measures the ratio of sentences in the gold-standard summary that is present in the created summary, and precision measures the ratio of the sentences in the generated summary that is in the gold-standard summary [11].

2.4.2 Extrinsic evaluation Extrinsic evaluation, in contrast to intrinsic, does not focus on the summarization sys- tem itself but evaluates it by considering the end user, and how well that user can perform some given task with the system. An example of such a task is the question game in which the user should answer a couple of questions about a text after reading the summary of it [11].

2.5 Related work

There has been a great amount of research in the field of automatic summarization. However, the amount of research of context based methods in automatic summariza- tion is not that great. This section will describe some related work on automatic sum- marization using context-based methods. Section 2.5.1 will describe some methods of capturing the contexts of text, comparable with semantic folding, and section 2.5.2 will describe some related work using those methods.

2.5.1 Distributional semantics Section 2.2 explained distributional semantics and how semantic folding can be used to utilize the distributional hypothesis. However, semantic folding is not the only method that utilizes the distributional hypothesis. This section will shortly explain Latent Se- mantic Analysis (LSA), random indexing and that are three methods that are similar, but not equal to, semantic folding.

Latent semantic analysis LSA is a distributional semantics method, where the document set is represented by a matrix X where each term is represented as a row, and each text unit as a column, as shown in Figure 2.4. The element at Xij is the number of occurrences of term i in text unit j. Singular value decomposition is used to reduce the number of rows of the matrix at the same time as the structure of the columns stays the same. The of two terms is then compared with the cosine similarity measure between the two term vectors [33]. CHAPTER 2. THEORY 16

dj ↓   x1,1 . . . x1,n T  . .. .  ti →  . . .  xm,1 . . . xm,n

Figure 2.4: The LSA matrix. The rows represent the terms and the columns represent each text unit. The values represent the number of occurrences of a term in the text unit.

Random indexing Random indexing is another distributional semantics method with the idea to accu- mulate context vectors based on the occurrence of words in contexts [34]. Random indexing captures the contexts with a two-step operation:

1. Each text unit is assigned a unique and randomly generated index vector.

2.A context vector is created for each text unit by going through the entire docu- ment set. Whenever a text unit occurs within a sliding context window, the text unit’s context vector is added with all the index vectors of the words in the sliding context window.

The similarity between two text units can then be calculated using the context vec- tors.

Word2Vec Word2vec was created by Tomas Mikolov et al. [35] and models natural language with neural networks that creates a vector space of word vectors from a large text corpus.

2.5.2 Automatic summarization using context Chatterjee et al. [15] used a similar approach to context based similarity using a graph- based similarity method with random indexing, scoring sentences with the PageRank algorithm. The approach was evaluated by summarizing fifteen documents containing 200 to 300 words each and the result summaries were compared with manually created summaries. The method was compared with the commercially available summary sys- tems Copernic and Word. They used Precision, Recall and F-score when comparing the summaries with the reference summary and the results showed that their proposed method produced summaries more close to the reference summaries. Chaterjee et al. [36] later made some improvements to this method.

Ning Jianfei et al. [37] used word2vec together with TextRank to extract keywords. CHAPTER 2. THEORY 17

They did this by creating a graph with nodes representing the words of a document with edges between nodes represented by similarity according to the word vectors from word2vec. Using this method increased the accuracy of extracting keywords from the individual documents.

Yihong Gong et al. [38] first suggested using LSA in automatic summarization and evaluated it against a standard method. The two methods were compared against human-generated summaries and were found to perform compara- ble. Other reports regarding LSA in automatic summarization includes that of Makbule Gulcin Ozsoy et al. [39] that evaluated different LSA based summarization methods on Turkish and English texts. The LSA based methods were among the best-performing methods. Chapter 3

Implementation

The goal of this thesis was to evaluate if semantic folding could be incorporated in the TextRank algorithm and perform equally with the original sentence similarity methods term overlap and cosine similarity of tf-idf vectors. This chapter together with the next one describes how this was done. This chapter deals with the overall architecture of the summarization system, and the next will describe how the evaluation was done. Section 3.1 gives an overview of the architecture of the system. Section 3.2 explains the preprocessing pipeline used. Section 3.3 describes the creation of the graph of sentences with edges based on similarity and Section 3.4 describes how this graph was scored with PageRank. Section 3.5 explains how the sentences were picked to the summary based on the PageRank scores.

3.1 Architecture

Figure 3.1 shows the pipeline for the summarization system built for this thesis. The SentenceExtractor performed sentence tokenization of the input text. The Preprocessor performed word tokenization of the input sentences, removed stop words, stemmed the sentences, and did PoS filtering of tokens. The GraphCreator created a graph, with each sentence as a node, with edges between similar sentences. The GraphScorer used PageRank to score the graph, which yielded a score for each sentence. The Senten- cePicker picked sentences to the summary until the preferred length of the summary was met. These sentences were put in the order as they were in the source text and were presented as the summary. The following sections will describe each of these steps in more detail and motivate the implementation choices made.

3.2 Preprocessing

Figure 3.2 shows an overview of the preprocessing steps for each method. As can be seen in the figure, there are basically two different pipelines. The two methods using

18 CHAPTER 3. IMPLEMENTATION 19

Figure 3.1: An overview of the system showing all the individual components. the tf-idf and overlap measures used the entire pipeline, while the semantic folding method only used the four first steps; sentence tokenization, word tokenization, removal and PoS filtering. The stem was kept for semantic folding since it is a method based on context and not syntax. If two words with different stems have the same meaning, they will occur in the same context and get similar fingerprints anyway. If they do not get the same fingerprints, they do not occur in the same context and should be considered different. The rest of this section will describe each step in the preprocessing pipeline in more detail.

3.2.1 Sentences tokenization To represent a text as a graph, with sentences as nodes, the text needs to be tokenized into sentences. A naïve approach would be to always tokenize at punctuation. How- ever, the sentence "I hold an M.Sc in Computer Science" would then be tokenized into the following two sentences; "I hold an M." and "Sc in Computer Science", which is clearly wrong. For this thesis, Open NLP’s SentenceDetector [40] was used, which is a sen- tence tokenizer which tries to determine if a punctuation character marks the end of a sentence or not. The Open NLP SentenceDetector is a method that needs a trained model. For this thesis, the Open NLP’s default model for sentence detection [41] was decided to be sufficient after manual inspection of sentence tokenization of the training corpus. The model that was used was trained before any preprocessing was done, and hence the sentence tokenization has to be done before any other preprocessing for this thesis as well. CHAPTER 3. IMPLEMENTATION 20

Figure 3.2: An overview of the preprocessing pipeline used for each similarity measure.

The model can handle multiple hard cases like abbreviations, dates, emails etc. Be- low multiple hard cases that are correctly tokenized by the used sentence tokenizer are displayed. The sentences are from the evaluation corpus.

"The chairman, Sen. Claiborne Pell, D-R.I., and Sen. Alan Cranston, D- Calif., voted “present,” meaning they took no position."

"People in Alorton, Ill., a village of 2,700 about five miles southeast of East St. Louis, were injured when trees fell on their houses or mobile homes, said Bill Gamblin of the St. Clair County Emergency Services and Disaster Agency."

"The heaviest rainfall during the six hours up to 1 p.m. EST was 1.22 inches at Peru, Ind. Wind up to 65 mph damaged roofs, power lines and trees in parts of Kentucky and Louisiana, the weather service said." CHAPTER 3. IMPLEMENTATION 21

3.2.2 Word tokenization Word tokenization is done by splitting a sentence into multiple tokens, each repre- senting a word. This was done with the Learnable Tokenizer [42] created by Apache OpenNLP, which is a maximum entropy tokenizer that detects token boundaries based on a probability model. The model used for this thesis was the default model created by Apache OpenNLP [41].

3.2.3 Stop words The stop words were chosen based on the most occurring words in the training corpus. A list of the 100 most occurring words was created and this list was manually filtered to keep only those words with no relevance to the domain. The complete stoplist con- taining 71 words can be found in Appendix A.2.

3.2.4 PoS filtering The PoS tagger used for this thesis was the OpenNLP Part-of-Speech Tagger [43]. It decides the tag of the token using both the token itself and the context of the token. The PoS tag to use is decided with a probability model. The model used for this thesis was ’en-pos-maxent’ [41]. Since the PoS tagging possibly can influence the results heavily, each method was tested with five different PoS filters to test which performed best. Table 3.1 displays these PoS filters. The word classes in the table is those kept by the filter.

PoS-filter - Nouns Nouns, Verbs Nouns, Adjectives Nouns, Verbs, Adjectives

Table 3.1: The PoS filters used for the evaluation. The word classes shown in the table is those that is kept by the filter.

3.2.5 Stemming For stemming, Apache OpenNLP’s implementation of the Porter stemming algorithm [44] was used. The algorithm works by iteratively removing the suffix of a word, end- ing up with a stem. CHAPTER 3. IMPLEMENTATION 22

Each step of the algorithm consists of a set of rules on the form (condition)S1 → S2 where the stem S1 of a word is replaced with S2, if the word before the prefix satisfies the condition. The longest matching rule is applied, which means that the rule with the longest matching S1 is used if there are multiple rules that match. Below the first set of rules of the Porter stemming algorithm is displayed.

SSES→SS

IES→I

SS→SS

S→ ε

This will, for example, reduce "possesses" to "possess" and "policies" to "polici", ac- cording to the first and the second rule.

This reduction is done in four steps, which is shown in its entirety in Appendix A.1.

3.3 Creation of the graph

There were basically two choices that had to be made when choosing the type of graph; if the graph should be weighted or unweighted, and if the graph should be directed or undirected. As explained in Section 2.1.2, Günes et al. suggests an unweighted graph, with edges between sentences if the similarity is above a predefined threshold, whilst Mihalcea et al. suggest a fully connected graph with edges between nodes stating the similarity between the corresponding sentences, no matter how similar they are. As in the case of the former, which used cosine similarity of tf-idf vectors, the threshold approach is possible since the similarity score always is in the range of 0 to 1. How- ever, in the case of the latter, using the term overlap measure, there are no boundaries on how similar two sentences can be, and hence it is impossible to find a reasonable threshold to use. Since this thesis used the term overlap measure, the graph needed to be weighted for that measure, and to make the results comparable between the three similarity measures, the same approach was used for all of them. The similarity of two sentences could be considered reflexive. However, using an undirected graph would not have worked when using weighted edges running the PageRank algorithm. Instead, a directed graph was used, and two edges were created between two nodes, one at each direction, with the same value. After that, the value of the outedges from each node was normalized to sum to 1, to withhold the underlying semantics of the PageRank algorithm, where the outedges from each node is consid- ered transaction probabilities. This was the reason for using directed edges. CHAPTER 3. IMPLEMENTATION 23

The graph was created by iterating through each pair of sentences and adding a di- rected edge in both directions with the value of the similarity between them. This is shown in Algorithm 1. Data: N number of sentences S for i ← 0 to N do for j ← i to N do if i 6= j then addEdge(i, j, similarity(S(i), S(j))); addEdge(j, i, similarity(S(i), S(j))); end end end Algorithm 1: Pseudo code for the creation of the graph. The similarity assessment was different for each similarity measure, and the rest of this section will go through how the similarity assessment was done for each of them.

3.3.1 Tf-Idf As explained in Section 2.1.3, tf-idf is defined as below, where tf(t,d) is the term fre- quency of term t in sentence d, N the number of documents in the training corpus and nt is the number of documents in the training corpus containing the term t.

N tf(t, d) · log (3.1) nt

To create a training corpus to learn the tf-idf statistics from, about 60% of the eval- uation set was used. How this was done is further explained in Section 4.1. The cosine similarity of the tf-idf vectors were calculated as explained in section 2.1.3.

3.3.2 Term overlap This thesis used the term overlap measure with PoS filtering as recommended by Mi- halcea et al., which was explained in Section 2.1.3. It is defined as below, where S1 and S2 are sentences, and F a function filtering out tokens of given word classes.

F (S1) ∩ F (S2) (3.2) log(|S1|) + log(|S2|)) CHAPTER 3. IMPLEMENTATION 24

3.3.3 Semantic folding Semantic folding is implemented by Cortical.io, where the fingerprint of each term of the corpus is indexed and stored in a database called Retina [4]. The Retina database is available through a public API [45]. This thesis used the Java client SDK [46] for accessing the API. The model used was trained on 400000 Wikipedia documents [4]. The similarity between the fingerprints was calculated with the cosine similarity between them.

3.4 Scoring of the graph

After the input had been converted to a graph it was scored with the PageRank al- gorithm. This thesis used the PageRank implementation provided by Java Universal Network/Graph Framework (JUNG) [47]. The key parameter with PageRank is the alpha value, which states the probability for the random surfer to jump to a random node at each transaction in the graph. Brin et al. [24] suggested an alpha value of 0.85 in the reference paper of PageRank, and it is also a value often used by others [24, 48] so it was used for this thesis as well. The algorithm was executed until convergence for each run.

3.5 Picking the sentences

After the sentences were scored with PageRank, they were picked from highest score to lowest until a given threshold was met. This method was chosen over the knapsack approach since it is the method used by the reference paper of the TextRank algorithm and also of many other automatic summarization studies [2, 1, 13, 15], which will make it easier to compare the results of this thesis with other similar studies. The procedure is shown in pseudocode in Algorithm 2 Data: Set S of sentences ordered by rank, maximum length threshold t summary ← ∅; while numW ords(summary) < t do summary ← summary + first(S) end Algorithm 2: Pseudo code for the sentence picking procedure. This gave a summary that was slightly longer than the threshold. When evaluating the summaries, they were split at the threshold word number to make the comparison between summaries fair. Chapter 4

Evaluation

The automatic summarizers were evaluated by comparing their output summaries with the gold-standard summaries of DUC using the ROUGE-N measure. This is explained in more detail throughout this chapter.

4.1 Data

The evaluation data consisted of 438 documents from DUC 2002, divided into 60 doc- ument sets, each describing a specific news happening [49]. The partition of the docu- ments into test, training and validation sets is shown in Table 4.1. Out of the total 438 documents, 136 were used for testing, 262 for training and 40 for validation. In order to create a test set evenly balanced among the different news topics, documents were chosen to the test set randomly from each document set in an iterative fashion. The test documents were used for calculating the idf statistics for the summarizer using tf-idf as similarity measure. Each evaluation document from DUC came with a set of corresponding gold-standard summaries from human judges.

Test Training Validation Total # 136 262 40 438 % 0.311 0.598 0.091 1.00

Table 4.1: The partition of the evaluation data into test, training and validation sets.

4.2 Gold-standard comparison

As explained in Section 2.4 there are two different types of evaluation; instrinsic and extrinsic, and for this thesis instrinsic evaluation was chosen. A factor for this was

25 CHAPTER 4. EVALUATION 26

that extrinsic evaluation requires a set of test persons, which is not needed for intrin- sic evaluation. Furthermore, instrinsic evaluation is the most used of the two [11] and the reference paper of TextRank uses instrinsic evaluation [2, 1]. To assess the perfor- mance of each summary from the test set created by the summarizers of this thesis (sys- tem summaries), they were compared to the gold-standard abstract summaries of DUC (reference summaries). The regular precision and recall measurements of individual sentences can not be used since the system summaries are in the form of extracts and the reference summaries are in the form of abstracts. Instead, the ROUGE-N metrics from the ROUGE metric set was used which is a metric created for evaluating auto- matic summarizations, and makes it possible to compare extracts with abstracts [5]. ROUGE-N was used to compare each system summary with the corresponding refer- ence summary using n-gram overlaps. In the context of n-gram overlapping, precision is defined as the number of n-grams in the system summary that is present in the refer- ence summary and recall is defined as the number of n-grams in the reference summary that is present in the system summary. For this thesis, 1-grams were used since it has shown to be a reasonable choice when comparing summaries [5, 50].

4.3 Preprocessing of system and reference summaries

Before the n-gram overlap was calculated (as described in Section 4.2), both the system summary and the reference summary was stemmed. The reason for this is that the system summaries were in the form of extracts, but the reference summaries were in the form of abstracts, which means that the same information from the system summary could possibly be described in the reference summary with a different inflection. With stemming applied, the words can be compared with their stem instead.

4.4 Measures

ROUGE-N uses three measures; precision, recall and F1 score.

4.4.1 Precision Precision is the fraction of the n-grams in the system summary that is also present in the reference summary, among all the n-grams in the system summary.

|{N-grams in reference summary} ∩ {N-grams in system summary}| precision = |{N-grams in system summary}|

Precision measures the fraction of n-grams in the system summary that is relevant, according to the reference summary. CHAPTER 4. EVALUATION 27

4.4.2 Recall The recall is the fraction of the n-grams in the system summary that is also present in the reference summary, among all the n-grams in the reference summary.

|{N-grams in reference summary} ∩ {N-grams in system summary}| recall = |{N-grams in reference summary}|

Recall measures the fraction of n-grams that is relevant according to the reference summary, that is present in the system summary.

4.4.3 F1 score

The F1 score is a harmonic mean between precision and recall, with the purpose to get a weighted average of the two. The F1 score is defined as below. precision · recall F = 2 · 1 precision + recall

4.4.4 Upper bound of the Rouge-score Since it is not possible to rewrite sentences in extractive summarization it is not possible to get a perfect score, since the gold standard summaries will be different and hence impossible to get exactly equal to. Bengtsson et al. [13] did an experiment on the DUC 2002 data where the summaries were constructed by first selecting sentences from the gold-standard and then picking the sentences in the source text that most resembled that sentence. Even if this does not necessarily give the formal upper bound of the score, it gives an idea of the largest possible score. That summarizer got a F1 score of 0.5657. Chapter 5

Results

5.1 The average score

Table 5.1 shows the results for the summarizer using semantic folding as similarity measure, Table 5.2 shows the results for the overlap measure and Table 5.3 shows the results for the tf-idf measure. Each row displays the results of a PoS filter, and the columns shows the recall, precision and F1 score. For semantic folding the summarizer with the PoS filter using nouns, adjectives and verbs performed best for all three measures, with 0.4566, 0.4622 and 0.4592 for recall, precision and F1 score respectively. For overlap the summarizer with the PoS filter using nouns and adjectives performed best for all three measures, with 0.4660, 0.4712 and 0.4684 for recall, precision and F1 score respectively. For tf-idf the summarizer with the PoS filter using nouns, adjectives and verbs performed best for all three measures, with 0.4662, 0.4718 and 0.4688 for recall, precision and F1 score respectively. Table 5.4 shows the overall score with each method and its corresponding best- performing PoS filter. The best summarizer overall was the one using the tf-idf simi- larity measure with a PoS filter of nouns, adjectives, and verbs. This summarizer had a recall that was 0.002 higher than overlap and 0.0096 higher than semantic folding, pre- cision 0.006 and 0.0096 higher than overlap and semantic folding, and F1 score 0.004 and 0.0096 higher than overlap and semantic folding. Table 5.4 also shows the results of a random summarizer picking sentences ran- domly. This summary scores 0.0624, 0.0632 and 0.0628 worse for recall, precision and F-score than with semantic folding as similarity measure. The same numbers for the tf-idf similarity measure are 0.072, 0.0728 and 0.0724, and for overlap 0.0718, 0.0722 and 0.072. The median and standard deviation for the score is displayed in Appendix B.1 and B.2 respectively.

28 CHAPTER 5. RESULTS 29

PoS-filter Recall Precision F1 - 0.4511 0.4567 0.4537 N 0.4556 0.4605 0.4579 N, Adj 0.4551 0.4601 0.4574 N, Vb 0.4514 0.4571 0.4540 N, Adj, Vb 0.4566 0.4622 0.4592

Table 5.1: The average ROUGE-1 scores for using TextRank with semantic folding as similarity measure.

PoS-filter Recall Precision F1 - 0.4385 0.4440 0.4410 N 0.4645 0.4695 0.4668 N, Adj 0.4660 0.4712 0.4684 N, Vb 0.4474 0.4540 0.4505 N, Adj, Vb 0.4490 0.4551 0.4519

Table 5.2: The average ROUGE-1 scores for using TextRank with term overlap as similarity measure.

PoS-filter Recall Precision F1 - 0.4592 0.4647 0.4618 N 0.4612 0.4667 0.4638 N, Adj 0.4640 0.4701 0.4668 N, Vb 0.4616 0.4674 0.4643 N, Adj, Vb 0.4662 0.4718 0.4688

Table 5.3: The average ROUGE-1 scores for using TextRank with cosine similarity of tf-idf vectors as similarity measure.

Similarity measure PoS-tagger Recall Precision F1 Semantic Folding N, Adj, Vb 0.4566 0.4622 0.4592 Tf-idf N, Adj, Vb 0.4662 0.4718 0.4688 Overlap N, Adj 0.4660 0.4712 0.4684 Random-sim - 0.3942 0.3990 0.3964

Table 5.4: The best performing PoS filter for each method together with the score of the random summarizer. Chapter 6

Discussion

This chapter discusses certain aspects of this thesis. Section 6.1 discusses the chosen methodology and Section 6.2 discusses the results of this thesis. Section 6.3 discusses some potential criticism of this thesis.

6.1 Methodology

A different possible methodology would have been to only implement the semantic folding based algorithm and compare the results with the evaluation by Mihalcea et al. that evaluated TextRank with term overlap and by Günes et al. that evaluated Tex- tRank with cosine similarity of tf-idf vectors. However, since the preprocessing is not presented in detail in either of those reports, it would have been hard to implement an algorithm comparable with the others. Furthermore, another limitation of the above- proposed methodology is that it is not described if the ROUGE-N score presented in the above papers is regarding the precision, recall or F-score, which would have made the comparison hard.

Section 2.4 described different methods on how to evaluate an automatic summariza- tion system, and pointed out that the goal of intrinsic evaluation is to measure the informativeness and readability of the summary. It is important to point out that the evaluation method chosen for this thesis does not evaluate the readability of the sum- maries, but merely the information content. This means that the results of this thesis might be different if evaluated with readability in mind or with extrinsic methods. The TextRank algorithm was constructed to maximize the information content rather than the readability or other properties of a summarization system, and hence the evaluation was chosen as it was.

As mentioned in section 4.4.4 there is an upper bound to the ROUGE-N score since the gold-standard abstracts are different from the extract system summaries, so a per-

30 CHAPTER 6. DISCUSSION 31

fect score will be impossible to obtain. An experimental upper bound for the DUC 2002 data has been found to be around 0.5657, but it is hard to give any hard limits of what can be achieved by an algorithm on this dataset. However, the score is still useful when comparing different and that is why it was chosen for this thesis.

The sentence picking strategy (see Section 3.5) chosen for this thesis was the naive ap- proach where sentences are chosen to the summary one by one in decreasing order of the score until the threshold is met and the summary is chopped off at 100 words. This approach guarantees that the most important sentences are chosen to the summary, but does not guarantee that the optimal score for the summary as a whole is found. Choosing sentences with the knapsack algorithm guarantees an optimal solution for the summary as a whole, but does not guarantee that the most important sentences will be in the summary. It is possible that the latter approach would yield a different result for this thesis, but the state-of-the-art evaluation of automatic summarization is of the former approach, and hence the results of this thesis will be possible to compare with other studies, which would not have been possible with the knapsack approach.

As described in Section 3.3 the outedges for all nodes in the similarity graph was nor- malized to one to maintain the original semantics of the PageRank algorithm. Nor- malizing the outedges to one will keep the interpretation of walking the graph with probabilities. Omitting the semantics of the original PageRank algorithm would be possible, and could possibly affect the results.

6.2 Results

To consider a similarity measure a valid choice for TextRank it has to at least outperform the random summarizer. All the three methods of this thesis outperform the random summarizer and could, therefore, be considered valid choices of summarizers. Seman- tic folding is highly competitive as a similarity measure of TextRank compared to the others since it closely compares to them and also clearly beats the random summarizer baseline. However, there is a small difference, with semantic folding performing less than a percent worse than both tf-idf and term overlap for all measures. Considering that semantic folding was not trained on domain-specific data unlike the tf-idf model was, one could argue that semantic folding is comparable to tf-idf and the overlap mea- sures, and would possibly perform better with some training on domain specific data, since it is reasonable to assume that it would be better with domain specific data. This is however not a conclusion that can be made for certain from this study, but rather something that should be investigated in future work.

The performance of the random summarizer could perhaps appear quite high, with a score only about 10% worse than the other summarizers. A likely reason for the fairly CHAPTER 6. DISCUSSION 32

high score is that the evaluation documents are short and hence compact with informa- tion. This means that however the sentences are chosen, they will often contain at least some important points which lead to the reasonable high score of the random summa- rizer. However, for longer texts, the gap between the random summarizer and the other would most likely be greater.

As described in Section 3.2.4, the similarity measures were evaluated with five differ- ent PoS filters. For semantic folding, the PoS filter keeping nouns, adjectives and verbs performed best as can be seen in Table 5.4 in Chapter 5. The second best PoS filter was the one keeping only the nouns. For cosine similarity of tf-idf the PoS filtering gave the same results as semantic folding, regarding the best performing PoS filter, but the second best performing PoS filter was the one keeping nouns and adjectives. For term overlap, the PoS filter keeping nouns and adjectives performed best. The second best PoS filter was the one keeping only the nouns. From the results, it is safe to say that PoS filtering actually increases the performance of automatic summarization given the similarity measures of this thesis. Omitting PoS filtering entirely resulted in the worst score for all measures with all methods. The overlap measure seems to suffer most from removing the PoS filter, performing 2.74% worse than the PoS filter performing best. The same number is 0.55% and 0.70% for semantic folding and tf-idf respectively. A possible reason for this could be that the PoS filter is really the only thing determining the importance of terms using this measure. With the tf-idf measure, the idf statistics can still describe the importance of the terms, even without the PoS filter.

Another consideration is that using term overlap is completely language independent, in contrast to both tf-idf and semantic folding. Semantic folding needs to be trained on a text corpus to learn the context, and tf-idf needs to learn the idf statistics. Using term overlap, however, is completely language independent (not taking into consideration any preprocessing that might be language dependent). So when in doubt of the domain or when a training corpus is not available, term overlap could be a better choice than semantic folding and tf-idf.

6.3 Criticism

A possible source of criticism would be that the tf-idf summarizer was trained on the evaluation corpus while semantic folding was trained on wikipedia. The most likely effect of this would be that the tf-idf is gaining from this when comparing to semantic folding. Since the research question is to evaluate if semantic folding is feasible as a similarity measure, this is not a problem since it will higher the bar rather than lower it, and semantic folding still performs comparable with tf-idf. If it was the other way around, with semantic folding having an advantageous setting in the comparison, then it would be a problem answering the research question. CHAPTER 6. DISCUSSION 33

The implementation of semantic folding by Cortical.io is a black box, which means that the details of the implementation are hidden. The problem of this is that the im- plementation can not be reviewed. However, this issue does not prevent to answer the research question regarding if semantic folding is feasible to use as a similarity measure in automatic summarization.

6.4 Ethics and sustainability

There were no ethical issues directly involved with the research during this thesis. A possible risk, however, concerning automatic summarization in general, is that major and important points could be excluded from a text if the summarization is done with- out human supervision, which may transform a text to something else than the author of the text intended to. If care is not taken, a text could become misleading and spread incorrect information. Chapter 7

Conclusion

Using semantic folding as similarity measure in TextRank performed less than 1 per- centage worse than using cosine similarity of tf-idf vectors which were the similarity measure that scored best in the evaluation of this thesis. Since the difference between the measures is so small, semantic folding can be considered a viable choice for similar- ity measure in TextRank, compared with the other similarity measures evaluated in this thesis. However, the choice of similarity measure depends on more than just the scores on the DUC documents. Semantic folding is a viable choice when constructing an auto- matic summarization system if the domain is known and there are resources available to train the model on. If the domain is unknown or if there are not enough resources, another similarity measure, like term overlap, could be a better choice. Furthermore, according to the results, semantic folding is not as sensitive to different types of PoS filtering as the two other similarity measures evaluated, which could also be a factor to consider when choosing similarity measure. The PoS filter that performed best for semantic folding was the one keeping nouns, adjectives and verbs.

7.1 Future work

This section concludes this thesis with some suggestions of future work.

7.1.1 Train model on specific domain This project used a general semantic folding model trained on English Wikipedia arti- cles and this probably did have a negative effect on the results for semantic folding. A relevant improvement for future work would, therefore, be to train the semantic fold- ing model on a domain relevant for the evaluation. The semantic folding similarity measure will probably gain with a model trained on the domain it will be used on.

34 CHAPTER 7. CONCLUSION 35

7.1.2 Preprocessing Another relevant work for the future would be to investigate the preprocessing for se- mantic folding more thoroughly and how that would affect the results. This thesis only evaluated different PoS filters but other preprocessing steps like stop words, stemming or lemmatization could also impact the results.

7.1.3 Extrinsic evaluation Only intrinsic evaluation was done for this thesis. It is possible that extrinsic evaluation would find properties in the summaries that were not found when using only intrinsic evaluation. Hence, performing extrinsic evaluation and observe how this impacts the results would be a relevant addition to this thesis.

7.1.4 Evaluate readability As stated in Section 2.4 there are basically two criteria for evaluating a summary with intrinsic evaluation; readability and informativeness of the summary. For this thesis, only the informativeness of the summaries was considered during evaluation, and it would, therefore, be interesting to investigate what would be the results if more focus was on the readability of the summaries instead. Bibliography

[1] Rada Mihalcea and Paul Tarau. Textrank: Bringing order into texts. In Pro- ceedings of EMNLP 2004, pages 404–411, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/ W04-3252.

[2] Günes Erkan and Dragomir R. Radev. Lexrank: Graph-based lexical centrality as salience in text summarization. In J. Artif. Int. Res., pages 457–479, USA, December 2004. AI Access Foundation. URL http://dl.acm.org/citation.cfm?id= 1622487.1622501.

[3] Magnus Sahlgren. The distributional hypothesis. Italian Journal of Linguistics, 20(1): 33–54, 2008. ISSN 11202726. URL https://www.diva-portal.org/smash/ get/diva2:1041938/FULLTEXT01.pdf.

[4] Francisco De Sousa Webber. Semantic folding theory and its application in seman- tic fingerprinting. CoRR, abs/1511.08855, 2015. URL http://arxiv.org/abs/ 1511.08855.

[5] C Y Lin. Rouge: A package for automatic evaluation of summaries. Proceedings of the workshop on text summarization branches out (WAS 2004), (1):25–26, 2004. ISSN 00036951. URL http://anthology.aclweb.org/W/W04/W04-1013.pdf.

[6] Julia Murphy and Max Roser. Internet, 2017. URL https://ourworldindata. org/internet/. Accessed: 2017-02-21.

[7] The Economist. Data, data everywhere, 2010. URL http://www.economist. com/node/15557443. Accessed: 2017-02-21.

[8] H. P. Luhn. The automatic creation of literature abstracts. IBM J. Res. Dev., 2 (2):159–165, April 1958. ISSN 0018-8646. doi: 10.1147/rd.22.0159. URL http: //dx.doi.org/10.1147/rd.22.0159.

[9] H. P. Edmundson. New methods in automatic extracting. J. ACM, 16(2):264–285, April 1969. ISSN 0004-5411. doi: 10.1145/321510.321519. URL http://doi. acm.org/10.1145/321510.321519.

36 BIBLIOGRAPHY 37

[10] Julian Kupiec, Jan Pedersen, and Francine Chen. A trainable document summa- rizer. In Proceedings of the 18th Annual International ACM SIGIR Conference on Re- search and Development in Information Retrieval, SIGIR ’95, pages 68–73, New York, NY, USA, 1995. ACM. ISBN 0-89791-714-6. doi: 10.1145/215206.215333. URL http://doi.acm.org/10.1145/215206.215333.

[11] Martin Hassel. Resource lean and portable automatic text summarization. Trita- CSC-A, ISSN 1653-5723 ; 2007:9, 2007. ISBN 978-91-7178-704-0. doi: 10. 1017/CBO9781107415324.004. URL http://www.diva-portal.org/smash/ record.jsf?pid=diva2%3A12198&dswid=4310.

[12] Bianca Bosker. Google knowledge graph could make clicking unnec- essary, 2017. URL http://www.huffingtonpost.com/2012/05/16/ google-knowledge-graph_n_1521292.html. Accessed: 2017-04-27.

[13] Jonatan Bengtsson and Christoffer Skeppstedt. Automatic extractive single document summarization an unsupervised approach. Sweden, Gothenburg, 2012. Chalmers. URL http://publications.lib.chalmers.se/records/ fulltext/174136/174136.pdf.

[14] Chin-Yew Lin and Eduard Hovy. From single to multi-document summarization: A prototype system and its evaluation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 457–464, 2002. URL http:// citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.124.5675.

[15] Niladri Chatterjee and Shiwali Mohan. Extraction-based single-document sum- marization using random indexing. In Proceedings - International Conference on Tools with Artificial Intelligence, ICTAI, 2007. ISBN 076953015X. doi: 10.1109/ICTAI.2007. 28. URL http://ieeexplore.ieee.org/document/4410421/.

[16] René Arnulfo García-Hernández, Romyna Montiel, Yulia Ledeneva, Eréndira Rendón, Alexander Gelbukh, and Rafael Cruz. Text Summarization by Sen- tence Extraction Using Unsupervised Learning, pages 133–143. Springer Berlin Heidelberg, Berlin, Heidelberg, 2008. ISBN 978-3-540-88636-5. doi: 10.1007/978-3-540-88636-5_12. URL http://dx.doi.org/10.1007/ 978-3-540-88636-5_12.

[17] Daniel S. Leite, Lucia Helena Machado Rino, Thiago A. S. Pardo, and Maria Das Gracas Volpe Nunes. Extractive automatic summarization: Does more linguistic knowledge make a difference? São Carlos - SP, Brazil, 2007. Núcleo Interinstitu- cional de Lingüística Computacional (NILC). URL http://citeseerx.ist. psu.edu/viewdoc/summary?doi=10.1.1.145.9662. BIBLIOGRAPHY 38

[18] Daniel Saraiva Leite and Lucia Helena Rino. Combining multiple features for au- tomatic text summarization through machine learning. In Proceedings of the 8th In- ternational Conference on Computational Processing of the Portuguese Language, PRO- POR ’08, pages 122–132, Berlin, Heidelberg, 2008. Springer-Verlag. ISBN 978-3- 540-85979-6. doi: 10.1007/978-3-540-85980-2_13. URL http://dx.doi.org/ 10.1007/978-3-540-85980-2_13.

[19] Gurpreet Singh Lehal Vishal Gupta. A survey of text summarization extractive techniques. JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLI- GENCE, 2(3):258–268, August 2010. URL https://www.researchgate. net/publication/228619779_A_Survey_of_Text_Summarization_ Extractive_Techniques.

[20] Inderjeet Mani and Eric Bloedorn. Multi-document Summarization by Graph Search and Matching. Proceedings of the fourteenth national conference on artificial intelligence and ninth conference on Innovative applications of artificial intelligence, cmp- lg/971:622–628, 1997. doi: 10.3115/1119467.1119476. URL http://arxiv.org/ abs/cmp-lg/9712004.

[21] Eduard Hovy and Chin-Yew Lin. Automated text summarization in sum- marist. Advances in Automatic Text Summarization, pages 81–94, 1999. doi: 10.3115/1119089.1119121. URL http://research.microsoft.com/en-us/ um/people/cyl/download/papers/mit-book-paper-final-cyl.pdf.

[22] Dragomir R. Radev, Hongyan Jing, and Malgorzata Budzikowska. Centroid- based summarization of multiple documents: , utility-based evaluation, and user studies. Information Processing & Management 40.6 (2004): 919-938., 40(6):10, 2000. ISSN 03064573. doi: 10.1016/j.ipm.2003.10.006. URL http://arxiv.org/abs/cs/0005020.

[23] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The citation ranking: Bringing order to the web. Technical Report 1999-66, Stanford InfoLab, November 1999. URL http://ilpubs.stanford.edu:8090/422/. Previous number = SIDL-WP-1999-0120.

[24] Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertex- tual web search engine. Computer Networks and ISDN Systems, 30(1):107 – 117, 1998. ISSN 0169-7552. doi: http://dx.doi.org/10.1016/S0169-7552(98) 00110-X. URL http://www.sciencedirect.com/science/article/pii/ S016975529800110X.

[25] Herbert Rubenstein and John B. Goodenough. Contextual correlates of synonymy. Communications of the ACM, 8(10):627–633, 1965. ISSN 00010782. doi: 10.1145/ 365628.365657. URL http://dl.acm.org/citation.cfm?id=365657. BIBLIOGRAPHY 39

[26] David E. Rumelhart and David Zipser. Feature discovery by compet- itive learning*. Cognitive Science, 9(1):75–112, 1985. ISSN 1551-6709. doi: 10.1207/s15516709cog0901_5. URL http://dx.doi.org/10.1207/ s15516709cog0901_5.

[27] Christopher D Manning and Hinrich Schiitze. Foundations of Statistical NLP. MIT Press, Cambridge, MA, 1999.

[28] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze. An Introduc- tion to Information Retrieval. Cambridge University Press, Cambridge, MA, 2008.

[29] The Stanford Natural Language Processing Group. Stanford log-linear part- of-speech tagger. URL https://nlp.stanford.edu/software/tagger. shtml. Accessed: 2017-04-19.

[30] Wikipedia. Stemming, . URL https://en.wikipedia.org/wiki/Stemming. Accessed: 2017-02-20.

[31] Wikipedia. Lemmatisation, . URL https://en.wikipedia.org/wiki/ Lemmatisation. Accessed: 2017-02-20.

[32] Inderjeet Mani. Summarization evaluation: An overview. Reston, VA, 2001. The MITRE Corporation. URL http://citeseerx.ist.psu.edu/viewdoc/ summary?doi=10.1.1.15.2078.

[33] Scott Deerwester, Susan T Dumais, George W Furnas, T K Landauer, and Richard A Harshman. Indexing by latent semantic analysis. Jasis, 41(6):391– 407, 1990. ISSN 00028231. doi: 10.1002/(SICI)1097-4571(199009)41:6<391:: AID-ASI1>3.0.CO;2-9. URL http://onlinelibrary.wiley.com/doi/10. 1002/(SICI)1097-4571(199009)41:6%3C391::AID-ASI1%3E3.0.CO; 2-9/abstract.

[34] Magnus Sahlgren. An Introduction to Random Indexing. Proceedings of the Meth- ods and Applications of Semantic Indexing Workshop at the 7th International Confer- ence on Terminology and Knowledge Engineering, TKE 2005, pages 1–9, 2005. ISSN 1570-7075. doi: 10.1.1.96.2230. URL http://krextown.googlecode.com/ svn-history/r41/trunk/infomap/RI{_}intro.pdf.

[35] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Dis- tributed representations of words and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, edi- tors, Advances in Neural Information Processing Systems 26, pages 3111–3119. Curran Associates, Inc., 2013. URL https://arxiv.org/abs/1310.4546. BIBLIOGRAPHY 40

[36] Niladri Chatterjee and Pramod Kumar Sahoo. Random indexing and modified random indexing based approach for extractive text summarization. In Proceed- ings - Computer Speech and Language, 2014. doi: 10.1016. URL http://www. sciencedirect.com/science/article/pii/S0885230814000722.

[37] Liu Jiangzhen Ning Jianfei. Using word2vec with textrank to extract key- words. Data Analysis and Knowledge Discovery, 32(6):20, 2016. doi: 10.11925/ infotech.1003-3513.2016.06.03. URL http://manu44.magtech.com.cn/Jwk_ infotech_wk3/EN/abstract/article_4233.shtml.

[38] Yihong Gong and Xin Liu. Generic Text Summarization Using Relevance Mea- sure and Latent Semantic Snalysis. Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval - SI- GIR ’01, pages 19–25, 2001. ISSN 01635840. doi: 10.1145/383952.383955. URL http://portal.acm.org/citation.cfm?doid=383952.383955.

[39] Makbule Gulcin Ozsoy, Ferda Nur Alpaslan, and Ilyas Cicekli. Text summariza- tion using latent semantic analysis. J. Inf. Sci., 37(4):405–417, August 2011. ISSN 0165-5515. doi: 10.1177/0165551511408848. URL http://dx.doi.org/10. 1177/0165551511408848.

[40] Apache OpenNLP Development Community. Apache opennlp developer doc- umentation: Sentence detector, . URL https://opennlp.apache.org/ documentation/1.5.3/manual/opennlp.html#tools.sentdetect. Ac- cessed: 2017-03-29.

[41] Apache OpenNLP Development Community. Models for 1.5 series, . URL http: //opennlp.sourceforge.net/models-1.5/. Accessed: 2017-05-05.

[42] Apache OpenNLP Development Community. Apache opennlp developer documentation: Tokenization, . URL https://opennlp.apache.org/ documentation/1.7.2/manual/opennlp.html#tools.tokenizer. cmdline. Accessed: 2017-05-05.

[43] Apache OpenNLP Development Community. Apache opennlp developer doc- umentation: Part-of-speech tagger, . URL https://opennlp.apache.org/ documentation/manual/opennlp.html#tools.postagger. Accessed: 2017-04-20.

[44] Apache OpenNLP Development Community. Apache opennlp developer documentation: Porterstemmer, . URL https://opennlp.apache.org/ documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/ stemmer/PorterStemmer.html. Accessed: 2017-05-05. BIBLIOGRAPHY 41

[45] Cortical.io. Cortical.io semantic folding api, . URL http://www.cortical.io/ product_retina_api.html. Accessed: 2017-03-01.

[46] Cortical.io. Cortical.io java client sdk, . URL https://github.com/ cortical-io/java-client-sdk. Accessed: 2017-03-01.

[47] Jung. Java class pagerank. URL http://jung.sourceforge.net/doc/api/ edu/uci/ics/jung/algorithms/scoring/PageRank.html. Accessed: 2017-03-01.

[48] Amy N. Langville and Carl D. Meyer. Google’s PageRank and Beyond: The Science of Search Engine Rankings. Princeton University Press, Princeton, NJ, USA, 2012. ISBN 0691152667, 9780691152660. URL http://press.princeton.edu/titles/ 8216.html.

[49] National Institute of Standards and Technology (NIST). Duc 2002 guide- lines. URL http://www-nlpir.nist.gov/projects/duc/guidelines/ 2002.html. Accessed: 2017-04-18.

[50] Chin-yew Lin and Marina Rey. Looking for a Few Good Metrics : ROUGE and its Evaluation. NTCIR Workshop, (June):2–4, 2004. ISSN 15397890. URL https://www.researchgate.net/publication/229010212_Looking_ for_a_few_good_metrics_ROUGE_and_its_evaluation. Appendix A

Preprocessing

A.1 Steps of the porter stemming algorithm

Step 1a SSES→SS

IES→I

SS→SS

S→ ε

Step 1b (m > 0) EED→EE

(∗v∗)ED→ ε

(∗v∗)ED→ ε

Step 1b’ This step is done if rule two or three in step 1b matched.

AT→ATE

BL→BLE

IZ→IZE

(∗d ∧ ¬(∗L∨∗S∨∗Z))→prefix without the last letter

(m = 1∧∗o)→E

42 APPENDIX A. PREPROCESSING 43

Step 2 (m > 0) ATIONAL→ATE

(m > 0) TIONAL→TION

(m > 0) ENCI→ENCE

(m > 0) ANCI→ANCE

(m > 0) IZER→IZE

(m > 0) ABLI→ABLE

(m > 0) ALLI→AL

(m > 0) ENTLI→ENT

(m > 0) ELI→E

(m > 0) OUSLI→OUS

(m > 0) IZATION→IZE

(m > 0) ATION→ATE

(m > 0) IVENESS→IVE

(m > 0) FULNESS→FUL

(m > 0) OUSNESS→OUS

(m > 0) AVITI→IVE

(m > 0) BILITI→BLE

Step 3 (m > 0) ICATE→IC

(m > 0) ATIVE→ ε

(m > 0) ALIZE→AL

(m > 0) ICITI→IC

(m > 0) ICAL→IC

(m > 0) FUL→ ε

(m > 0) NESS→ABLE APPENDIX A. PREPROCESSING 44

Step 4 (m > 1) AL→ ε

(m > 1) ANCE→ ε

(m > 1) ENCE→ ε

(m > 1) ER→ ε

(m > 1) IC→ ε

(m > 1) ABLE→ ε

(m > 1) IBLE→ ε

(m > 1) ANT→ ε

(m > 1) EMENT→ ε

(m > 1) MENT→ ε

(m > 1) ENT→ ε

(m > 1∧(∗S∨∗T)) ION→ ε

(m > 1) OU→ ε

(m > 1) ISM→ ε

(m > 1) ATE→ ε

(m > 1) ITI→ ε

(m > 1) OUS→ ε

(m > 1) IVE→ ε

(m > 1) IZE→ ε

Step 5a (m > 1) E→ ε

(m = 1∧¬∗o) E→ ε

Step 5b (m > 1∧∗d∧∗ L)→prefix without the last letter APPENDIX A. PREPROCESSING 45

A.2 The stoplist used

the, and, of, in, to, on, for, that, by, was, is, at, it, said, from, with, as, an, be, have, he, but, who, had, were, year, been, has, are, not, him, about, after, two, which, new, this, there, their, other, when, all, would, more, also, or, no, will, than, time, we, out, if, some, could, can, them, where, what, just, because, off, so, did, make, while, still, she, then, her, want

Figure A.1: A list of some terms usually considered stop words Appendix B

Results

B.1 The median score

PoS-filter Recall Precision F1 - 0.4575 0.4549 0.4556 N 0.4528 0.4550 0.4538 N, Adj 0.4477 0.4605 0.4557 N, Vb 0.4490 0.4559 0.4526 N, Adj, Vb 0.4586 0.4650 0.4601

Table B.1: The median ROUGE-1 scores for using TextRank with semantic folding as similarity measure.

PoS-filter Recall Precision F1 - 0.4412 0.4366 0.4371 N 0.4675 0.4681 0.4684 N, Adj 0.4691 0.4757 0.4716 N, Vb 0.4583 0.4602 0.4594 N, Adj, Vb 0.4568 0.4573 0.4561

Table B.2: The median ROUGE-1 scores for using TextRank with term overlap as similarity measure.

46 APPENDIX B. RESULTS 47

PoS-filter Recall Precision F1 - 0.4654 0.4709 0.4698 N 0.4616 0.4707 0.4652 N, Adj 0.4712 0.4803 0.4750 N, Vb 0.4602 0.4683 0.4652 N, Adj, Vb 0.4787 0.4781 0.4816

Table B.3: The median ROUGE-1 scores for using TextRank with cosine similarity of tf-idf vectors as similarity measure.

B.2 The standard deviation score

PoS-filter Recall Precision F1 - 0.08940 0.08874 0.08846 N 0.08590 0.08454 0.08467 N, Adj 0.08574 0.08472 0.08467 N, Vb 0.08634 0.08759 0.08644 N, Adj, Vb 0.08870 0.08880 0.08818

Table B.4: The standard deviation of the ROUGE-1 scores for using TextRank with semantic folding as similarity measure.

PoS-filter Recall Precision F1 - 0.09158 0.09173 0.09116 N 0.08545 0.08610 0.08515 N, Adj 0.08119 0.08184 0.08085 N, Vb 0.08614 0.08620 0.08566 N, Adj, Vb 0.08139 0.08265 0.08143

Table B.5: The standard deviation of the ROUGE-1 scores for using TextRank with term overlap as similarity measure. APPENDIX B. RESULTS 48

PoS-filter Recall Precision F1 - 0.07810 0.07975 0.07834 N 0.08279 0.08324 0.08242 N, Adj 0.07943 0.08042 0.07934 N, Vb 0.08472 0.08392 0.08372 N, Adj, Vb 0.08285 0.08372 0.08274

Table B.6: The standard deviation of the ROUGE-1 scores for using TextRank with cosine similarity of tf-idf vectors as similarity measure. www.kth.se