LiU-ITN-TEK-A--21/016-SE

Extractive Text Summarization of Norwegian News Articles Using BERT Thomas Indrias Biniam Adam Morén

2021-06-04

Department of Science and Technology Institutionen för teknik och naturvetenskap Linköping University Linköpings universitet SE-601 74 Norrköping, Sweden 601 74 Norrköping LiU-ITN-TEK-A--21/016-SE

Extractive Text Summarization of Norwegian News Articles Using BERT The thesis work carried out in Datateknik at Tekniska högskolan at Linköpings universitet Thomas Indrias Biniam Adam Morén

Norrköping 2021-06-04

Department of Science and Technology Institutionen för teknik och naturvetenskap Linköping University Linköpings universitet SE-601 74 Norrköping, Sweden 601 74 Norrköping Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under en längre tid från publiceringsdatum under förutsättning att inga extra- ordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a considerable time from the date of publication barring exceptional circumstances. The online availability of the document implies a permanent permission for anyone to read, to download, to print out single copies for your own use and to use it unchanged for any non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional on the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its WWW home page: http://www.ep.liu.se/

© Thomas Indrias Biniam, Adam Morén Abstract

Extractive text summarization has over the years been an important re- search area in Natural Language Processing. Numerous methods have been proposed for extracting information from text documents. Recent works has shown great success for English summarization tasks by fine- tuning the language model BERT using large summarization datasets. However, less research has been made for low-resource languages. This work contributes by investigating how BERT can be used for Norwegian text summarization. Two models are developed by applying a modified BERT architecture, called BERTSum, on pre-trained Norwegian and Mul- tilingual BERT. The results are models able to predict key sentences from articles to generate bullet-point summaries. These models are evaluated with the automatic metric ROUGE and in this evaluation the Multilin- gual BERT model outperforms the Norwegian model. The multilingual model is further evaluated in a human evaluation by journalists, reveal- ing that the generated summaries are not entirely satisfactory in some aspects. With some improvements, the model shows to be a valuable tool for journalists to edit and rewrite generated summaries, saving time and workload. Acknowledgments

We want to start by giving our gratitude to our supervisor, Elmira Zohrevandi and examiner Pierangelo Dellacqua at Linköpings University for their com- mitment and valuable advice. We also want to express our gratitude to our supervisor, Björn Schiffler at Schibsted, for actively committing and guiding us with both technical and academic advice throughout our work. We thank the contextual team at Schibsted for welcoming us into the team and provid- ing us with advice and the necessary data to perform our research.

Norrköping, June 2021 Adam Morén and Thomas Indrias

iv Contents

Abstract iii

Acknowledgments iv

Contents v

List of Figures viii

List of Tables ix

1 Introduction 1 1.1 Background ...... 1 1.2 Motivation ...... 2 1.3 Aim ...... 2 1.4 Research question ...... 3 1.5 Delimitations ...... 3

2 Theory 4 2.1 Natural Language Processing ...... 5 2.1.1 ...... 5 2.1.2 Statistical Methods ...... 6 2.1.3 Artificial Neural Networks ...... 7 2.2 Sequential models ...... 9 2.2.1 RNN ...... 9 2.2.2 Encoder-Decoder ...... 11 2.2.3 Attention ...... 11 2.2.4 Transformers ...... 12 2.3 BERT ...... 15 2.3.1 Input and Output Embeddings ...... 15 2.3.2 Pre-training ...... 17 2.3.3 Fine-tuning ...... 18 2.3.4 Pretrained BERT models ...... 18 2.4 Extractive Text Summarization Methods ...... 19 2.4.1 TF-IDF ...... 19 2.4.2 TextRank ...... 20

v 2.4.3 BERTSum ...... 21 2.5 Evaluation metrics for summarization ...... 23 2.5.1 Precision, Recall and F-Score ...... 23 2.5.2 ROUGE ...... 23 2.5.3 Qualitative Evaluation ...... 25

3 Method 27 3.1 Datasets ...... 27 3.1.1 CNN/DailyMail ...... 27 3.1.2 Aftenposten/Oppsummert ...... 28 3.2 Implementation ...... 30 3.2.1 Restructure of the AP/Oppsummert dataset ...... 31 3.2.2 Truncation of articles ...... 33 3.2.3 Oracle Summary Generation ...... 33 3.2.4 Models ...... 34 3.2.5 Hyper-Parameters ...... 36 3.2.6 Fine tuning ...... 37 3.2.7 Prediction ...... 38 3.2.8 Hardware ...... 38 3.3 Evaluation ...... 39 3.3.1 ROUGE Evaluation ...... 39 3.3.2 Human Evaluation with Journalists ...... 40

4 Results 41 4.1 Implementation ...... 41 4.2 Evaluation ...... 45 4.2.1 ROUGE Evaluation ...... 45 4.2.2 Human Evaluation with Journalists ...... 45

5 Discussion 51 5.1 Results ...... 51 5.1.1 ROUGE Scores ...... 51 5.1.2 Sentence Selection ...... 52 5.1.3 Human Evaluation with Journalists ...... 53 5.2 Method ...... 56 5.2.1 Datasets ...... 56 5.2.2 Implementation ...... 58 5.2.3 Metrics ...... 58 5.3 The work in a wider context ...... 60 5.3.1 Ethical Aspects ...... 60

6 Conclusion 62 6.1 Conclusion ...... 62 6.2 Future Work ...... 64

vi Bibliography 66

A Appendix I A.1 All responses from Human Evaluation ...... I A.1.1 Article 1 ...... I A.1.2 Article 2 ...... III A.1.3 Article 3 ...... V A.1.4 Article 4 ...... VIII

vii List of Figures

2.1 Perceptron model ...... 8 2.2 Illustration of a Multilayer neural network ...... 9 2.3 RNN illustrated by C. Olah [rnn-lstm] ...... 10 2.4 RNN unpacked illustrated by C. Olah [rnn-lstm] ...... 10 2.5 RNN Encoder-Decoder sequence-to-sequence model illustrated by Kostadinov [encoder-decoder-seq2seq] ...... 12 2.6 Attention example illustrated by Bahdanau et al. [bahdanau2016neural]...... 13 2.7 model architecture of a transformer illustrated by Vaswani et al. [google-attention]...... 14 2.8 The input embeddings and embedding layers for BERT illustrated by Devlin et al. [pre-training-of-BERT] ...... 15 2.9 Two words broken down into sub-words using WordPiece tok- enization ...... 16 2.10 Binary labels generated by two pair inputs...... 17 2.11 Position embeddings layer...... 18 2.12 Architecture of BERTSum proposed by Yang [liu2019text] ...... 22

3.1 Summaries associated with x articles in the AP/Oppsummert dataset 30 3.2 Number of sentences in the AP/Oppsummert summaries dataset . 30 3.3 ROUGE-2 and ROUGE-L recall scores for summaries with one arti- cle in (a) and (b), summaries with more articles and the top-scoring articles in (c) and (d), and summaries with more articles and the second-best articles in (e) and (f)...... 32 3.4 Proportion of sentences with highest ROUGE score according to their position in the original article ...... 34

4.1 Sentence selection for Norwegian BERT fine-tuned on (a) Oracle-3 (b) Oracle-7 (c) Oracle-10 with trigram blocking and on (d) Oracle- 3 (e) Oracle-7 and (f) Oracle-10 without trigram blocking...... 43 4.2 Sentence selection for Multilingual BERT fine-tuned on (a) oracle-3 (b) oracle-7 (c) oracle-10 with trigram blocking and on (d) oracle-3 (e) oracle-7 and (f) oracle-10 without trigram blocking...... 44 4.3 Average human evaluation scores for each category where the highest score for each example is 20 ...... 46

viii List of Tables

3.1 Average token and sentence count for news articles and summaries in the CNN/DailyMail dataset ...... 28 3.2 Dataset split ...... 28 3.3 Article data type in the AP/Oppsummert dataset ...... 29 3.4 Summary data type in the AP/Oppsummert dataset ...... 29 3.5 Average token and sentence count for news articles and summaries for AP/Oppsummert ...... 29

4.1 Time it took to fine-tune time the Norwegian and Multilingual BERT models ...... 41 4.2 ROUGE scores on AP/Oppsummert test data (116 articles). *With trigram blocking ...... 47 4.3 Journalists’ opinion reflecting their satisfaction with generated summaries ...... 48 4.4 Journalists’ opinion on generated summaries mentioning features they found the algorithm performed weak...... 49 4.5 Journalists’ opinion on generated summaries reflecting potential for improvement...... 50 4.6 Overall comments from journalists ...... 50

A.1 Responses from the journalists on article 1 ...... III A.2 Responses from the journalists on article 2 ...... V A.3 Responses from the journalists on article 3 ...... VII A.4 Responses from the journalists on article 4 ...... IX

ix 1 Introduction

Over recent years, the amount of data available for both users and companies has massively kept increasing. As a response to this, creating summarizations of data has become a popular topic in data science. Text summarization tasks take part in this, focusing on representing context in a shorter format. Con- sidering the amount of text data in news and media, it is an example of a field where automatic text summarization could be beneficial.

1.1 Background

Aftenposten is Norway’s largest daily newspaper based in Oslo. It is a private company owned by Schibsted and has an estimate of 1.2 million readers. To save readers time, Aftenposten developed a daily brief called Oppsummert, which features the most important stories of the day, offered in a summarized format. The idea is to help readers to be updated on the most important daily news in a time-efficient way and at the same time offer a consistent and standardized reading experience.

Summarizing articles manually leads to an increased workload for jour- nalists. Additionally, many journalists want to focus on great journalism and innovation, not re-writing shorter versions of already written articles. The challenge here is to achieve both daily briefs for readers and, at the same time, use fewer resources from the newsroom and their journalists. For this challenge, we see the potential for automatic text summarization by learning machines to understand and process the human language.

There are two main text summarization strategies: extractive and abstractive.

1 1.2. Motivation

Extractive techniques are about identifying the most important sentences of a text and extracting them. In contrast, abstractive techniques produce new, shorter text that captures the context of the original longer text. When implementing automatic text summarization in this thesis, the approach will be to use extractive techniques. The motivation for this is that we want the summaries to use sentences written by the original journalist. Abstractive summarization techniques can sometimes lead to misinformation or biased generated results, which we want to prevent.

Traditional approaches for extractive text summarization are based on statis- tical and graph-based methods such as TF-IDF and TextRank. However, these methods have recently started to be replaced by methods based on neural networks, such as BERT. BERT is a new state-of-the-art language model that can learn to perform specific tasks using labeled data.

In our case, the amount of summarized articles from Aftenposten is limited since Oppsummert is a newly released feature. Therefore, we hypothesize that it will be challenging to train a Norwegian BERT model and get good results. Our approach will investigate this and try different BERT models and methods.

1.2 Motivation

In the massive flood of news and media seen today, it can be challenging for newsreaders to filter out the most important daily news. Furthermore, due to the rapidness of our daily lives, users often want to be as time-efficient as possible. Therefore, the motivation of news summaries is to help readers be updated on the most important daily news in a time-efficient way. However, writing these summaries manually leads to an increased workload for the journalists. This is where machine learning potential can be seen for generat- ing these summaries automatically. By implementing a model that can extract key sentences from an article, we can reduce the workload for journalists, and at the same time deliver summaries with the most important content to the newsreaders.

1.3 Aim

The thesis project aims to develop a model that can extract the most relevant sentences from an article written in Norwegian on which journalists can base their summaries. This will be done by investigating possible approaches for extractive text summarization using BERT with a limited labeled Norwegian data set and evaluating the results.

2 1.4. Research question

1.4 Research question

The current work aims to answer the following research question:

• How can a high-performance BERT-based extractive summarization model be developed based on a limited amount of news summaries in Norwegian?

To this end we aim to investigate:

• How news summaries can be used to generate labeled data that is re- quired for a supervised learning model.

• How the model’s performance should be evaluated and assessed.

• How BERT can be used for extractive text summarization on a low re- source language.

• Limitations with BERT and how they should be dealt with.

1.5 Delimitations

The study focuses on BERT-based models for extractive text summarization. However, it will also explore traditional methods for comparison purposes. The articles and summaries from Aftenposten are in Bokmål, one of Norway’s official writing standards. Therefore, we narrow the scope of the language to only Bokmål.

3 2 Theory

Automatic text summarization is the process of a machine condensing a longer text into a shorter comprehensive synopsis. The technique can be either abstractive or extractive. The abstractive approach aims to present a text with newly generated sentences, and the extractive technique aims to find and re-use key sentences from the original text. The output format can be in the form of bullet points, quotes, questions or speakable summaries. These outputs are usually analyzed and rated relative to how well they cap- ture main points, grammar, text quality, etc. An architecture must be able to capture the essence of longer articles. Therefore, summary evaluation becomes a crucial part of automatic text summarization.

Extractive text summarization can be treated either as a sentence scoring and selection task or as a sentence classification task. Sentence scoring and selection is the traditional approach where each sentence is scored based on its importance to the text, and the sentences with the highest scores are selected for the final summary. With the approach of sentence classification, each sentence is instead classified into one of two different classes: extracted or not extracted. The former approach is part of statistical methods, and the latter approach utilizes neural networks for task-learning.

In the following, we cover common methods and tasks with a focus on text and text summarization. In section 2.1, we introduce the research area of automatic text summarization known as natural language processing. Sec- ondly, in section 2.2, previous methods within textual tasks through neural networks are introduced. Thirdly, in section 2.3, the current state-of-the-art language model BERT is introduced. Finally, in sections 2.4 and 2.5, we

4 2.1. Natural Language Processing present different methods for extractive text summarization and how these models can be evaluated.

2.1 Natural Language Processing

Natural language processing (NLP) is the field of computer science and artifi- cial intelligence that deals with the enabling of computers to understand and process the human language. This includes the understanding of both written and spoken language, which comes with many complex challenges. Today computer applications are expected to translate, answer voice commands, give directions, and even produce human-like texts. These challenges are hard for computers to manage because of how abstract and inconsistent the human language is in it’s nature. Things like humor, sarcasm, irony, intent, and sentiment is a few examples that not only vary from different languages, but also from different people.

NLP aims to solve these challenges by converting language to numerical computational inputs that a computer can understand and process. By then combining computer algorithms with statistics, it’s possible to extract, classify, and label language.

2.1.1 Text Processing For a computer to be able to work with text and solve larger NLP-tasks, the text input must first be processed. Text processing contains several sub tasks, such as:

Tokenization: Tokenzation is usually the first subtask for text processing. It is used for separate a chunk of continuous text into tokens that help the com- puter to better understand the text. For example, the sentence "The firefighter saved the cat. Awesome!" would with a tokenization of words be converted into ["The", "fiefighter", "saved", "the", "cat", ".", "Awesome", "!,"], and as a tokenization of sentences into ["The firefighter saved the cat.", "Awesome!"]. After a text input has been tokanized, the computer can use the processed input for other important processes, such as and lemmatization.

Stemming and Lemmatization: Stemming and lemmitization are methods for trimming down words to their root form. For example the word "saved" has the root "save". The difference between the two methods is that stemming solely focuses changing the form of a word, whereas lammatizaiton actually finds a dictionary form of the word. Meaning that after applying lemmatiza- tion, we will always get a valid word.

5 2.1. Natural Language Processing

Stop Words: Stop words are usually words with no semantics, and are therefore considered not to provide any relevant information for the task. The English language contains several hundreds stop words such as "the" or "and" which does not carry any signification by themselves and are therefore often removed from documents [39].

POS tagging: Part-of-Speech tagging (POS tagging) is the method for iden- tifying the part of speech for words (noun, adverb, adjective, etc.). Some words in a language can be used as more than one part of speech. For ex- ample, consider the word "book" which can be used both as a noun, in the sentence "The firefighter read a book" and as a verb, in the sentence "Did the firefighter book a table?". This example shows why POS tagging becomes important to process sentiment in a text.

Sentence boundary identification: Sentence boundary identification is im- portant in order for the system to recognize the end of sentences in the docu- ment. Establishing where a sentence ends and the next one begins is impor- tant for a clear sentence structure for many NLP tasks.

Word Embeddings: Word embeddings is a method for representing words that can capture syntactical and semantic information. Usually, words are mapped to a vector space of numbers with a fixed dimension, where words with similar meaning are closer together in the vector space. This makes it possible to for example detect synonymous words, or suggest additional words for sentences. Word embeddings are obtained from, and used by, lan- guage models that use neural networks to train and learn word associations from a large corpus of text.

2.1.2 Statistical Methods For many years the most common NLP methods were based solely on statis- tical methods. For texts, this includes algorithms and rules for the statistics of the words and sentences in a text document. An example is TF-IDF, a numerical statistic that can reflect a word’s importance in a collection of text documents. Machine learning algorithms should also be mentioned here, as they revolutionised natural language processing when introduced in 1980. Popular machine learning classifiers and algorithms are Naive Bayes, Sup- port Vector Machine, decision trees and graph structures.

Today, statistical methods in NLP have been largely replaced by neural networks. However, statistical methods continue to be relevant in some con- texts, for example, when the amount of training data is insufficient. In Section 2.4 we investigate two statistical methods for extractive text summarization; TF-IDF and a graph-based method called TextRank.

6 2.1. Natural Language Processing

2.1.3 Artificial Neural Networks Most methods that are currently achieving state-of-the-art results for NLP tasks employ neural networks. A neural network is an artificial intelligence system that mimics how a biological brain works through artificial neurons. It enables models to learn tasks iteratively, and one of the reasons for its suc- cess in recent years has to do with the massive increase of data on which the models can train. In this section, we overview the main concepts of neural networks to understand better what is happening when a model learns to perform a specific task.

Perceptron: Single Layer Neural Net The simplest form of a neural network is a single-layer perceptron, capable of classifying linearly separable data. A perceptron is an algorithm that can be explained as a simplified biological neuron. It takes in a signal and outputs a binary signal. A vector of numbers represents the input, and the classification of this input represents the output. The framework for a perceptron is the following:

• Input: x = (x1, x2, ..., xd) • Output: y

• Model: Weight vector w = (w1, w2, ..., wd) and bias b

The perceptron makes its predictions based on the prediction function that is presented in Eq 2.1. An illustration of this equation is also shown in Fig 2.1 where f is an activation function that can be different for different types of neurons, w is the weights vector that represents the strength of nodes, T is the transpose, and b is the bias.

y = f (wTx + b) (2.1)

Training Perceptrons The training of perceptrons is a process which is known as gradient decent. The goal of gradient descent is to find optimal parameters w that minimize the loss function. A training set is used during training, which contains input values X and their corresponding correct values Y. The model predicts a value yˆ from the input x, and this prediction is then compared with the actual value y. The difference between predicted values and actual values is the error E of the model. E can be calculated in different ways depending on the output type of the model. A model that has a binary output type usually has a loss function that is the binary cross-entropy loss. Since the goal is to minimize the error of the loss function, we are looking for where the gradient

7 2.1. Natural Language Processing

Figure 2.1: Perceptron model of the loss function is zero. This is why it is called gradient descent because the goal is to go down the gradient until it no longer has a slope, i.e., the error becomes as small as it possibly can get.

The model parameters are updated via the equation presented in Eq 2.2. Here, we calculate the new values of the parameters as the old parameters minus a step in the direction of the derivative. e is called learning rate, and it is a value that determines how big the size of the step should be. The size of the step is important because if it is too large, we risk stepping over the optimal point, and if it’s too small, the decent takes too much computational time.

BE wi(t + 1) = wi(t) ´ e (2.2) Bwi Training of perceptrons happens in epochs. An epoch is defined as a full cycle through the training data. In the standard gradient descent method, we accu- mulate the loss for every example in the train set before updating the weights. The problem with the standard gradient descent method is that if the number of training samples is large, it may be time-consuming because we have to run through the whole training set for every parameter update. Instead, Stochastic Gradient Decent (SGD) is often applied for faster convergence. There are two main methods for SGD:

• Update weights for each sample

• Minibatch SGD: Update weights for a small set of samples

Updating the weights for each sample is fast, but it makes the model very sensitive to noise in the training set. Using minibatch SGD is both fast and robust to noise. That is why it is often preferred in training.

8 2.2. Sequential models

Figure 2.2: Illustration of a Multilayer neural network

Multi-Layer NN as a non-linear classifier The problem with a single layer neural net is that it only can be used as a linear classifier and not be used for feature learning. To solve this, multi- ple perceptrons can be combined to form a neural network with more layers, called hidden layers. An illustration of such a neural network consists of an input layer, two hidden layers, and an output layer is shown in Fig 2.2. The advantage of a multi-layer neural network is that it is able to solve non-linear functions unlike single layer neural networks that can only solve linear func- tions.

2.2 Sequential models

When working with textual data in NLP, the data is generally transformed into a sequence. It is essential to keep the order of the sequence since if the or- der of words is changed, the sentence’s context could also change. For exam- ple, ["the", "firefighter", "saved", "a", "cat"] has a different meaning then ["a", "cat", "saved", "the", "firefighter"]. Data, where the order is important, is called sequential data and models working with this type of data is called sequen- tial models. Another requirement for sequential models is that the processed sequence should remember previous important parts of sequences. For exam- ple, if the data is a sequence of sentences, such as ["X was walking home", ..., "he forgot to buy milk on the way"], it is essential to remember specific keywords such as "X" and "home".

2.2.1 RNN A Recurrent Neural Network (RNN) is a family of neural networks for pro- cessing sequential data. RNNs can process long sequences and sequences

9 2.2. Sequential models with variable length, meaning that the input sequence does not have to be the same length as the output sequence.

Figure 2.3: RNN illustrated by C. Olah [30]

Figure 2.3 above shows an RNN with loops. The model A takes in input sequence xt and outputs a value ht. The model also passes its past state to the next step. The same RNN can be visualized as an unpacked network instead, shown in the following figure 2.4.

Figure 2.4: RNN unpacked illustrated by C. Olah [30]

RNN can have different structures and combination. For example, RNN can have multiple layers so that an output from one layer can be used as input to another layer. Such layering are often called deep RNNs. Goldberg [13] observed empirically that deep RNNs works better than shallower RNNs on some tasks. However, it is not theoretically clear why they perform better.

Another extension of RNN is bidirectional-RNN (BI-RNN). Conventional RNN only uses the past state as seen in figure 2.4. However, the future state might also hold useful information of the following words in a sequence. BI- RNN attempts to deal with this by maintaining two separate states. Each state has two layers. BI-RNN run the input in two ways; one from front-to-back and one from back-to-front.

10 2.2. Sequential models

Simple RNN The most conventional RNN is called simple RNN (S-RNN), and it was pro- posed by Elman [10]. Mikolov [27] later explored S-RNN for use in NLP [13]. It builds a strong foundation for tasks such as sequence tagging and language modelling. However, S-RNN introduces a problem that causes the gradients that carry information used in a parameter update to increase or decrease rapidly over time. This problem is known as the exploding or vanishing gra- dients problem, resulting in the gradients becoming so big or small that the parameter updates carry no significant changes. In other words, this problem causes the model not to learn. In later works, Hochreiter [15] proposed an architecture known as Long Short-Term Memory which managed to overcome the exploding and vanishing gradient problem.

LSTM Long Short Term Memory networks (LSTMs) is a special kind of RNN capable of learning long-term dependencies [30]. The main difference between RNN and LSTM is that an LSTM is made up of a memory cell, input and output gate, and a forget gate [24]. The memory cell is responsible for remembering dependencies in the input sequence, while the gates control how much of the previous states should be memorized.

2.2.2 Encoder-Decoder Encoder-Decoder architecture is a standard method used in sequence-to-sequence (seq2seq) NLP tasks such as translation. For RNN (section 2.2.1), an Encoder- Decoder structure was first proposed by Cho et al (2014) [4]. The encoder takes a sequence as an input and produces an encoder vector used as input to the decoder. The decoder then predicts an output at each step with respect to the previous states (auto-regression) until some END-token is generated. Fig- ure 2.5 shows an RNN encoder-decoder architecture for seq2seq tasks where hi is the hidden state, xi is the input sequence and yj is the output sequence.

2.2.3 Attention An apparent disadvantage with conventional encoder-decoder models de- scribed in Section 2.2.2, is that the input sequence has a fixed-length vector. This issue limits the model to learn later parts of a sequence by truncating it. Additionally, early parts of long sequences within the fixed-length are often forgotten once it has processed the entire sequence [4].

Bahdanau et al. (2016) [2] proposed an approach to solve the limitations of encoder-decoder models by extending encoder-decoder forming a tech- nique called Attention. Unlike the conventional encoder-decoder, Attention

11 2.2. Sequential models

Figure 2.5: RNN Encoder-Decoder sequence-to-sequence model illustrated by Kostadinov [19] allows the model to focus on relevant parts of an input sequence. The process is done in two steps. First, instead of only passing the last encoder’s hidden state (context vector) to the decoder, the encoder passes all the hidden states of the previous encoders to the decoder. Second, the decoder gives each encoder’s hidden state a score where each of these states is associated with a certain word from the input sequence. This way, the model does not train on using one context vector but rather learn which parts of a sequence to pay attention to. Bahdanau et al. provide an example shown in figure 2.6. It illustrates Attention when translating the English input sequence: [", This, will, change, my, future, with, my, family, ., ", the, man, said], to the French target sequence: [", Cela, va, changer, mon, avenir, avec, ma, famille,", a, dit, l’, homme, .]. It can be seen in the figure that the alignment of the words is largely monotonic, hence the high attention score along the diagonal. How- ever, some words are non-monotonic. For example, the English word "man" is "l’homme" in French, and in the example, we can find high attention scores both for "l’" and "homme".

2.2.4 Transformers Transformers are attention-based architecture consisting of two main compo- nents, encoder and decoder. The model was introduced by Vaswani et al. [43] to solve existing problems with recurrent models, presented in section 2.2.1, that preclude parallelization, which would result in longer training time and drop in performance for longer dependencies. Due to the attention-based non-sequential nature of transformers, it can be highly parallelized and reach a constant sequential and path complexity, O(1). Transformers are used to solve translation problems. The aim is to find a relationship between words

12 2.2. Sequential models

Figure 2.6: Attention example illustrated by Bahdanau et al. [2] in an input sentence and combine it with an existing translation of that sen- tence [3].

Encoder: The encoder consists of multiple encoder layers where each layer has two sub-layers. The first sub-layer is a multi-head self-attention mecha- nism. For instance, looking at the same example as mentioned in the previous section 2.2.4, self-attention means that the target sequence is the same as the input sequence. In other words, self-attention is just another form of atten- tion mechanism that relates different positions of the same input sequence. The term "multi-head" means that instead of computing the attention once, it utilizes scaled dot-product attention, allowing multiple attention compu- tations in parallel [43]. The second sub-layer is a simple position-wise, fully connected feed-forward network. All sub-layers have a residual connection and a layer normalization. The purpose of this is to add the output of each sub-layer with its previous input. The left block on figure 2.7 shows the en- coder of a transformer.

Decoder: The decoder is structured similarly to the encoder. Still, it has an additional sub-layer called masked multi-head attention, which is a modified multi-head attention mechanism that, unlike self-attention, prevents it from attending to subsequent positions. The goal of masking positions is to ensure that the predictions made are not looking into the future of the target se- quence [43]. The right block on figure 2.7 shows the decoder of a transformer.

For instance, in translation tasks, the encoder is fed with words of a specific language, processing each word simultaneously. It then generates embed-

13 2.2. Sequential models

Figure 2.7: model architecture of a transformer illustrated by Vaswani et al. [43] dings for each word which are vectors that describe the meaning of the words in the form of numbers. Similar words have closer numbers in their respective vectors. The decoder can then be used to predict the translation of a word in a sentence by combining the embeddings from the encoder and the previously generated translated sentence. The decoder predicts one word at a time until the end of the sentence is reached.

Transformers have a token limitation of 512. The reason is that the mem- ory and computations requirements for a transformer grow quadratically with sequence length, making it impractical to process long sequences [43]. This means that transformers can only process input that is below 512 tokens. Later, new solutions were introduced, such as Transformer XL that uses a recurrent mechanism to handle text sequences longer than 512 tokens [6]. However, in most cases, it is sufficient to truncate sequences longer than 512 tokens to make them fit.

In general, the encoder learns what a word is in relation to the origin of language, grammar and, more importantly, context. In contrast, the decoder learns how the origin word relates to the target word in terms of language.

14 2.3. BERT

2.3 BERT

BERT is a transformer-based model, introduced by Devlin et al [7]. The au- thors motivate that previous language representation models, such as RNNs, were limited in how they encode tokens by only considering the tokens in one direction. Unlike RNNs, the authors utilize transformers, described in section 2.2.4, to design a Bidirectional Encoder Representation from Transformers (BERT), which is able to encode a token using tokens from both directions.

BERT can solve various types of problems such as , , and text summarization. However, these problems re- quire an understanding of language, which is solved by a pre-training and a fine-tuning phase. The first phase consists of pre-training BERT to under- stand language, and then fine-tuning is done so that BERT can learn to solve a specific task.

2.3.1 Input and Output Embeddings Similar to other language models, BERT processes each input token through a token embedding layer to create a vector representation. Additionally, BERT has two more embedding layers called, segment and position embeddings. In figure 2.8, an illustration of the three embedding layers can be seen.

Figure 2.8: The input embeddings and embedding layers for BERT illustrated by Devlin et al. [7]

To create an input embedding, Token, segment, and position embeddings are summed into an input embedding for a given token. Before the embedding layers process the input, the input text is tokenized using WordPiece.

WordPiece BERT adopts WordPiece tokenization proposed by Wu et al. [44]. The aim of using WordPiece tokenization was to improve the handling of rare words. The solution was to divide words into sub-words (WordPieces) using a fixed

15 2.3. BERT vocabulary set. In terms of BERT, it has a vocabulary size of 30 000 Word- Pieces. In figure 2.9, an example of two words broken down into subwords is shown. When the rarity of the word increase, the word can be broken down into single characters. Additionally, every subword except the first subword of a word is symbolized with "##" symbols. The first subword is separated from the rest of the subwords because they can be redundant for the whole word. For example, the word "bedding" can be split into subwords "bed", "##ding". The subword "bed" conveys meaning to bedding because that can be closely related to the word "bed".

Figure 2.9: Two words broken down into sub-words using WordPiece tok- enization

Finally, a "[CLS]" token is added to the start and "[SEP]" token is added to the end of a tokenized sentence. The objective of adding the extra tokens is to distinguish a pair of sentences which will help create segment embeddings 2.3.1. Since BERT uses default transformer encoders (2.2.4), BERT is limited to process input sequences up to 512 tokens.

Token Embeddings The first step is to create vector representations of the tokenized input in the token embeddings layer. Each token has a hidden size of 1x768 vector. For N input tokens, the token embedding results in a matrix shape of Nx768 or, as a tensor, 1xNx768.

Segment Embeddings BERT can handle a pair of input sentences as shown in figure 2.10. The in- puts are tokenized and concatenated to create a pair of tokenized sentences. Thanks to the [SEP] token, BERT can distinguish two sentences and label the

16 2.3. BERT sequence in binary.

Figure 2.10: Binary labels generated by two pair inputs.

The label sequence is then expanded into the same matrix shape as for token embeddings, Nx768, where N is the number of tokens. For example, for the paired input in figure 2.10, the segment embedding would result in a matrix shape of 8x768.

Position Embedding BERT is a transformer-based model and, therefore, will not process tokens sequentially. Thus, to avoid BERT forgetting the order of tokens, position embeddings are required. The position embeddings layer can be used as a look-up table as illustrated in figure 2.11, where the index of a row represents a token position. For example, the two sentences, "Cat is stuck" and "Call the firefighter" has identical vector representations for the words; "Cat" - "Call", "is" - "the" and "stuck" - "firefighter".

2.3.2 Pre-training The pre-training phase is done by training on two unsupervised tasks simulta- neously, which are Masked Language Model (MLM) and Next Sentence Prediction (NSP) [16].

Masked Language Model Masked Language Modeling (MLM) is an unsupervised task done during the pre-train of BERT. The goal of MLM is to help BERT understand deep bidi- rectional representations. MLM is done by masking 15 % of all WordPiece tokens for the input sequence randomly. The tokens are masked by replacing the token with a [MASK] token instead, which BERT identifies and predicts.

17 2.3. BERT

Figure 2.11: Position embeddings layer.

Next Sentence Prediction Next Sentence Prediction is another unsupervised task done during the pre- train of BERT. The objective of this task is to capture the relationship between two sentences. To capture sentence relationships, BERT is pre-trained for a bi- narized next sentence prediction that can be generated from any monolingual corpus [7]. That is done by setting 50% of the inputs to sentence pairs where the second sentence is the subsequent sentence from the corpus. The other 50 % contain the same sentence pairs except that the second sentence is instead a random sentence selected from the corpus. For example, if A is a sentence from the corpus, then 50 % of the time, B is the subsequent sentence of A and the other 50 % is a random sentence from the corpus.

2.3.3 Fine-tuning Fine-tuning allows the pre-trained BERT model to be used for specific NLP tasks through supervised learning. It works by replacing the fully connected output layers of the pre-trained BERT model with a new set of output layers that can output an answer with respect to the NLP problem at hand. The new model performs supervised learning with labeled data to update the weights of the output layers. Since only the output layer weights are updated during fine-tuning, the learning during fine-tune is relatively fast [7].

2.3.4 Pretrained BERT models As described in the previous section (2.3), a BERT model has to be pre-trained before it is fine-tuned on different tasks because the model needs to be taught to encode language. This process is both time and resource-consuming. For example, Devlin et al. [7] pre-trained the BERT model for four days using four cloud TPUs (16 TPU chips in total). Therefore many BERT models are released as pre-trained models with initialized parameters ready for specific

18 2.4. Extractive Text Summarization Methods tasks. In turn, fine-tuning can be done on the pre-trained model to be used for particular tasks.

Norwegian BERT: At the current state, the SOTA monolingual BERT model supporting the Norwegian language (both Bokmål and Nynorsk) is made by the National Library of Norway 1. It is based on the same structure as the multilingual BERT (2.3.4) and trained on a wide variety of Norwegian text in both Bokmål and Nynorsk from the last 200 years.

Multilingual BERT: Multilingual BERT (M-BERT) is a BERT-based model pre-trained on concatenated monolingual Wikipedia corpora from 104 lan- guages 2. In a study done by Pires et al. [38], it is shown that the M-BERT performs exceptionally well on zero-shot cross-lingual model transfer. Mean- ing, M-BERT can be fine-tuned using task-specific supervised data from one language and evaluated in a different language. This was done in a paper from Elmadani et al. [9]. They applied M-BERT for Arabic text summariza- tion and showed how effective it could be in low resource situations for both extractive and abstractive approaches.

2.4 Extractive Text Summarization Methods

Today there exists different extractive methods for text summarization. In this chapter, the two well known unsupervised methods, TF-IDF and TextRank, are presented in section 2.4.1 and 2.4.2. Furthermore, section 2.4.3 presents a supervised method called BERTSum which utilize the language model BERT for text summarization.

2.4.1 TF-IDF TF-IDF is short for term frequency-inverse document frequency. It is a nu- merical statistic that reflects how important a word is to a document within a corpus [39].

Term weighting based on term frequency was first introduced by Luhn [25]. Luhn stated that the importance of a term is proportional to its frequency. In mathematical terms, this can be described as:

t f (t, d) = ft,d (2.3)

As seen in eq. 2.3, the term frequency t f , is equal to the frequency f of a 1https://github.com/NBAiLab/notram 2https://github.com/google-research/bert/blob/master/ multilingual.md

19 2.4. Extractive Text Summarization Methods term t found in a document d. For example in the following sentence: The firefighter rescued a cat. The cat is safe now. The term, cat, would have high importance because it’s mentioned multiple times. However, it is seen that common terms such as, the, is also weighted as important.

To solve the issue of common terms appearing as important words, Jones [17] proposed a metric called, Inverse document frequency (IDF). The idea is to reduce the weighting of common terms and increase the weights of terms that occur infrequently, see Equation 2.4. n id f (t, d) = log (2.4) nt

Here, terms are weighted based on the inverse fraction of the documents containing a term. The fraction is calculated by dividing the total number of documents n by the number of documents nt containing a term t.

The combination of both TF and IDF favors more unique terms and damps common terms that occur in several documents. The combined equation is presented as: n t f ´ id f (t, d) = ft,d ˆ log (2.5) nt

For sentence weighting, the same principle can be used. Document d in eq 2.5 can be reformulated as a sentence s and term t can be represented as a word w. In this case, n would be the total number of sentences, and nt would be the number of sentences containing the term t. The final equation for sentence weighting: n t f ´ id f (w, s) = fw,s ˆ log (2.6) nw

2.4.2 TextRank TextRank is a graph-based ranking algorithm proposed by Mihalcea and Tarau [26]. The ranking is done by deciding the importance of vertex in a graph-based on global information drawn recursively from the entire graph. This is done by linking one vertex to another. The importance of a vertex is measured by the number of links to other vertices as well as the score of the vertices casting the votes.

A directed graph can be defined as G = (V, E), where V is a set of ver- tices and E is a set of edges. E is, in turn, a subset of V ˆ V. For a vertex Vi, let In(Vi) be the set of vertices pointing to it and let Out(Vi) be the vertices Vi

20 2.4. Extractive Text Summarization Methods

points to [26]. The score of a vertex Vi, indicating its importance, is based by Brin et al. [35]:

1 S(V ) = (1 ´ d) + d ˆ S(V ), where 0 ă d ă 1 (2.7) i |Out(V )| j ( ) j jPInÿVi

In equation 2.7, d is a damping factor that sets the probability of jumping from a given vertex to another.

TextRank can be applied for as proposed by Mihalcea and Tarau. This is done by setting the vertices of a TextRank graph equivalent to the number of sentences so that each vertex represents a sentence. The damping d shown in eq. 2.7 is equal to the similarity between two sentences. Additionally, the similarity function proposed by Mihalcea and Tarau also considers using normalization factors and division of content overlap to avoid promoting long sentences.

Given two sentences Si and Sj where a sentence is represented by a set of N words (S = ωi , ωi , ..., ωi ), the similarity between S and S is defined i i 1 2 Ni i j as (Mihalcea and Tarau):

|tωk| P Si&wk P Sju| Similarity(Si, Sj) = (2.8) log |Si| + log |Sj|

2.4.3 BERTSum BERT can not be directly used for extractive summarization. There are two problems at hand that Liu (2019) [23] points out. Firstly, BERT is trained using a masked language model (section 2.3.2). Therefore the output vectors result in tokens rather than sentences. Secondly, although BERT has segmentation embeddings for indicating different sentences, it can only differentiate a pair of sentences because BERT is also trained on next sentence prediction (section 2.3.2).

Liu [23] propose a method for handling multiple sentences using BERT by inserting [CLS] tokens before each sentence and a [SEP] token after each sentence. To distinguish multiple sentences rather than two sentences, in- terval segment embeddings can be used. This means that each token in a sentence will be assigned the same EA or EB if the position of the sentence is odd or even. As seen in figure 2.12, the output of the BERT layer shown as Ti, are the corresponding [CLS] tokens from the top BERT layer. Each Ti are treated as a sentence representation of sentence i.

21 2.4. Extractive Text Summarization Methods

Figure 2.12: Architecture of BERTSum proposed by Yang [23]

After obtaining sentence representations for multiple sentences, Yang sug- gests several methods to capture document-level features for extracting summaries:

1. Using a simple classifier on top of BERT outputs and a sigmoid function to get predicted score.

2. Inter-sentence transformer by adding more transformer layers on top of BERT outputs and a simple classifier together with a sigmoid function.

3. Applying an LSTM layer on top of BERT outputs together with a simple classifier and a sigmoid function.

Although from Liu’s experiments, the author stated that the second option in list 2.4.3, with two transformer layers, showed the best performance. The loss of the model is the binary cross-entropy of a prediction against its gold labels [23].

The predicted output from BERTSum is ranked by their importance which is represented by a score. Before ranking each sentence by its score, the author (Liu) implemented Trigram Blocking, introduced by Paulus et al. (2018) [37] to reduce redundancy by minimizing similarity between the selected sentences in the predicted summary.

Like BERT, the sequence input for BERTSum has a limit of 512 tokens.

22 2.5. Evaluation metrics for summarization

2.5 Evaluation metrics for summarization

Metrics that would traditionally be used to evaluate text summaries are co- herence, conciseness, grammaticality, readability, and content [21]. These are metrics that experts consider when writing summaries, and since experts are going to use the developed tool, they become important. Evaluating sum- maries manually does not scale well since it would required huge amounts of time and effort to evaluate the hundreds, or even thousands, of summaries that exists. Therefore, it is crucial to complement human evaluation with qual- itative evaluation methods and metrics to evaluate summaries automatically.

2.5.1 Precision, Recall and F-Score An extractive text summary can be seen as a binary classification problem where 1 indicates that a sentence from the document is extracted and 0 in- dicates that it is not. In statistical analysis of binary classification, Precision, Recall, and F-Score measure the test’s accuracy. The precision score is the number of true positive results divided by all selected positive results. The recall score is the number of true positives divided by all positive values. An- other way to interpret these values is to think of the precision score as how many of the selected items that are relevant and the recall score as how many of the relevant items that are selected. It then becomes clear that these values alone are not always applicable. For example, if we were to pick out three red apples in a bowl of ten apples, we could achieve a high precision score by simply picking one red apple. Similarly, we would achieve a high recall score by simply picking all ten apples in the bowl. In these cases, the F1 score, known as the harmonic mean of precision and recall, becomes useful. The F1 score is calculated as in Equation 2.9.

precision score ¨ recall score F = 2 ¨ (2.9) 1 precision score + recall score

2.5.2 ROUGE Recall-Oriented Understudy for Gisting Evaluation (ROUGE) is a set of met- rics presented by Lin [21] in 2004 for automatic evaluation of summaries. The metrics compare machine-generated summaries against one or multiple refer- ence summaries created by humans. The ROUGE-1, ROUGE-2, and ROUGE- L metrics are commonly used for bench-marking of document summariza- tion models, such as on the leaderboard for document summarization on the CNN/Daily Mail dataset [5]. For each metric, the recall, precision, and F1 score are generally computed. With ROUGE, the true positives are the words in the sentences of the reference summaries.

23 2.5. Evaluation metrics for summarization

ROUGE-N: N-gram Co-Occurrence Statistics ROUGE-N is defined as the overlap of n-grams between the candidate sum- mary and the reference summary. As mentioned, the most common metrics are ROUGE-1 and ROUGE-2, where ROUGE-1 refers to the overlap of un- igrams (single words), and ROUGE-2 refers to the overlap of (two adjacent words).

ROUGE-L: Longest Common Subsequence ROUGE-L refers to the Longest Common Subsequence (LCS) of words be- tween a candidate summary and a reference summary. It reflects similarity on sentence-level based on the longest in-sequence matches of words.

ROUGE-W: Weighted Longest Common Subsequence One disadvantage with the longest common sub-sequence with ROUGE-L is that it does not favor consecutive matches. This disadvantage means that word sequences with a less spatial difference will result in the same ROUGE score as sequences with a larger spatial difference. ROUGE-W deals with this problem by recognizing the length of encountered consecutive matches, giving a weighted longest common subsequence [21].

ROUGE-S: Skip- Co-Occurrence Statistics One can think of ROUGE-S as the opposite of ROUGE-2. Instead of measur- ing the overlap of bigrams, ROUGE-S measures the overlap of skip-bigrams between the candidate and the set of reference translations. A skip-bigram is any pair of words in their sentence order. A sentence with x number of words will have x!/(2! ¨ 2!) number of skip-bigrams.

ROUGE-SU: Extension of ROUGE-S Rouge-SU is an extension of ROUGE-S that additionally considers unigrams as a counting unit. The extension could be necessary if, for example, a candi- date sentence is the exact reverse of the reference summary. In that case, only using ROUGE-S would result in a score of zero even though the sentence has single word co-occurrences. With ROUGE-SU, a reversed candidate sentence would get a higher score than sentences that do not have a single word co- occurrence with the reference sentence.

ROUGE Example For clarification on ROUGE scores, let us investigate the following example: sentence one is a reference sentence, and sentences two and three are candi- dates.

24 2.5. Evaluation metrics for summarization

1. The firefighter saved the cat.

2. The firefighter rescued cat.

3. Cat saved the firefighter.

Considering ROUGE-1, we can see that sentence three gives the best match with a recall score of 4/5 = 0.8 and a precision score of 4/4 = 1. In the case of ROUGE-l, sentence 2 is preferred, with a recall score of 3/5 = 0.6 and a precision score of 3/4 = 0.75. For ROGUE-2 both of the candidate summaries gives a recall and precision score of 2/4 = 0.5 and 2/5 = 0.4. The importance of this example is to understand that focusing on only one type of ROUGE score does not always provide good insight. In our example, intuitively can be agreed that sentence two is the one that best fits the reference sentence since sentence three completely changes the context. But according to ROUGE-1, sentence three is preferred. This example shows why combining the three is often a good idea and the importance of complementing the results with a qualitative evaluation.

ROUGE Limitations Regardless of ROUGEs popularity among papers on automatic text summa- rization, there are some limitations that must be addressed:

1. ROUGE only considers content selection and not others aspects such as fluency and coherence.

2. ROUGE is relying only on exact overlaps, but a summary can express the same content as an article without exact overlaps, using other words and synonyms.

3. ROUGE was first designed to be used with multiple reference sum- maries with consideration that summaries are subjective. However, most datasets today only provides single summary references to each input.

2.5.3 Qualitative Evaluation A qualitative evaluation method is often used together with quantitative data to deepen the understanding of the statistical numbers. Patton [36] suggests three kinds of data collection methods for qualitative evaluation:

• Open-ended interviews

• Direct observations.

• Written documents.

25 2.5. Evaluation metrics for summarization

The purpose of these methods is to gather information and insights that are useful for decision-making. Qualitative methods should therefore be appro- priate and suitable, which means that it is essential to determine qualitative strategies, data collection options, and analysis approaches based on the eval- uation’s purpose. An example of a method that combines quantitative mea- surements and qualitative data is a questionnaire, or interview, that asks both fixed-choice questions and open-ended questions.

26 3 Method

In this chapter, the methodology for creating a text summarization model is described. Firstly, in section 3.1, the datasets that were used, their properties and features are introduced. Secondly, in section 3.2, implementation tech- niques are described in three parts: pre-processing, binary label generation, and fine-tuning. Finally, in section 3.3, we cover the methods used for evalu- ating our different models.

3.1 Datasets

The following section presents the features and properties of the two datasets used in this work.

3.1.1 CNN/DailyMail The CNN/DailyMail dataset1 was initially developed for machine-reading and comprehension and abstractive question answering by Herman et al. [14]. The dataset contains over 300k unique news articles written in English by journalists at CNN and the Daily Mail. Their script to obtain the data was later modified by Nallapati et al. [29] to support training models of abstrac- tive and extractive text summarization, using multi-sentence summaries. Both of these datasets are anonymized versions, where the data has been pre-processed to replace named entities with unique identifier labels. A third version of the dataset also exists, which operates directly on the original text (non-anonymized), see [41].

1https://cs.nyu.edu/~kcho/DMQA/

27 3.1. Datasets

The CNN/DailyMail dataset consists of two main features: articles, which are strings containing the body of the news article, and multi-sentence sum- maries, which are strings containing the highlight of the article as written by the article author. Table 3.1 displays the average token count and number of sentences in the dataset. Table 3.1: Average token and sentence count for news articles and summaries in the CNN/DailyMail dataset

Type Average Token count Average nr of sentences News Articles 781 29.74 Multi-Sentence Summaries 56 3.75

Furthermore, the dataset is split into a train, validation and test set according to Table 3.2. Table 3.2: Dataset split Dataset Split Number of Instances Train 287,113 Validation 13,368 Test 11,490

Model performance on the CNN/DailyMail dataset is measured by the ROUGE score value of the model’s predicted summaries when compared to the golden summaries. The highest achieving models can be found on the Papers With Code Leader-board 2

3.1.2 Aftenposten/Oppsummert The Norwegian articles and summaries provided from Aftenposten (AP) and Oppsummert build up to two datasets. One with 162k articles and one with 979 summaries. Each column in the article dataset includes is presented in Table 3.3. The summary dataset contains an array of article IDs, which are the articles that the summary is based on. Table 3.4 presents each column in the summary dataset.

To get an idea of how many articles from the article dataset were used to create the summaries, we plot this relation in Figure 3.1. As for the CNN/DailyMail dataset, we were interested in examining the average number of sentences in the AP/Oppsummert summaries. This plot is presented in Figure 3.2. Table 3.5 also displays the average token count and the average number of sentences in the articles and summaries datasets. 2https://paperswithcode.com/sota/document-summarization-on-cnn-daily-mail

28 3.1. Datasets

Table 3.3: Article data type in the AP/Oppsummert dataset

Field Description ARTICLE_ID The article’s ID ARTICLE_TITLE The title of the article ARTICLE_TEXT raw article text data ARTICLE_NEWSROOM The newsroom is Aftenposten LAST_UPDATE The date of when the article was last updated

Table 3.4: Summary data type in the AP/Oppsummert dataset

Field Description ARTICLE_ID The summary’s ID ARTICLE_TITLE The title of the summary ARTICLE_TEXT raw summary text data ARTICLE_NEWSROOM The newsroom is always Aftenposten LAST_UPDATE The date of when the summary was last updated SUMMARIZED_ARTICLES An array of connected article IDs

Table 3.5: Average token and sentence count for news articles and summaries for AP/Oppsummert

Feature Mean Token count Average nr of sentences News Articles 703 40.3 Multi-Sentence Summaries 154 9.5

Compared to the CNN/DailyMail dataset, the AP/Oppsummert dataset is more dynamic, both with a higher variance of sentences in the summaries and associated articles to each summary.

29 3.2. Implementation

Figure 3.1: Summaries associated with x articles in the AP/Oppsummert dataset

Figure 3.2: Number of sentences in the AP/Oppsummert summaries dataset

3.2 Implementation

In this section, the implementation of an automatic extractive text summariza- tion model will be described and the different problems we had to overcome. Firstly, in section 3.2.1, 3.2.2 and 3.2.4, the dataset is restructured, truncated and labeled. Secondly, in section 3.2.4, the different model implementations are described. Lastly, in section 3.2.6, 3.2.7 and 3.2.8, the fine-tuning, pre-

30 3.2. Implementation diction and the hardware used for implementing a BERT-based model is de- scribed.

3.2.1 Restructure of the AP/Oppsummert dataset When training a model, one of the most important aspects is to have good training data. The CNN/DailyMail dataset is relatively straightforward to use, having one summary per article. However, this was not the case for the AP/Oppsummert dataset since some of the summaries have multiple related articles. We, therefore, analyzed these summaries, together with their related articles, to identify where the summaries’ content comes from. Our method for this was to plot the ROUGE scores for articles that maximize the ROUGE-2 and ROUGE-l recall scores against their gold summaries, an approach similar to the sentence selection algorithm presented in the BERTSum paper[23]. The objective was to visualize and compare the score between the top-scoring articles and the second scoring articles. For the summaries with only one ar- ticle, we plot their ROUGE-2 and ROUGE-l recall score against the summary directly to understand how extractive they are, i.e., high ROUGE score mean- ing that the summaries use similar words and sentences as the connected article. These plots are presented in Figure 3.3. The reason for using the recall score is that we were not interested in the length variation of the arti- cles, only to what extent the summaries use content from the different articles.

From the graphs presented in Figure 3.3 we can draw two important conclu- sions about the AP/Oppsummert dataset:

1. The summaries with only one article are predominantly extractive writ- ten (since they have high ROUGE-2 and ROUGE-L scores).

2. The summaries with more articles regularly use sentences from only one main article (since both the scores from the second-best article are far worse than the scores from the top-scoring article).

With these two conclusions, the dataset was restructured so that sum- maries with multiple IDs of related articles only were to be connected with the highest-scoring article in that set. Therefore the field SUMMA- RIZED_ARTICLES was updated from an array of IDs to only one article ID.

31 3.2. Implementation

(a) (b)

(c) (d)

(e) (f)

Figure 3.3: ROUGE-2 and ROUGE-L recall scores for summaries with one article in (a) and (b), summaries with more articles and the top-scoring articles in (c) and (d), and summaries with more articles and the second-best articles in (e) and (f).

32 3.2. Implementation

3.2.2 Truncation of articles Before text articles can be used in a model like BERTSum, token limits must be addressed. As mentioned in 2.4.3, both BERT and BERTSum have an input limit of 512 tokens. The news articles in the CNN/DailyMails dataset have a mean token count of 781, and the news articles in the AP/Oppsummert dataset have a mean token count of 703. This means that the token limit of BERTSum is indeed a problem when using these datasets.

Different approaches to handling token limitation have been suggested in previous works [23] [42]. A standard method is to truncate longer texts to fit the model’s token limit. The problem with this method is the loss of data that it introduces. If important information is discarded, it will result in a poorly trained model.

For news articles, important information is primarily presented in the first third of the article [20]. This observation is also the case for the CNN/DailyMail dataset, as demonstrated by Liu [23]. Using ROUGE, we examined if this also is true for the AP/Oppsumert dataset. We did this by plotting the position of every document’s Oracle sentences, i.e., the sentences with the highest ROUGE score against the document’s golden summary, see Fig 3.4. In this particular plot, we choose Oracle-3, which is the top three scoring sentence.

Figure 3.4 shows that the top-scoring sentences in the AP/Oppsummert dataset mainly occur at the beginning of the articles. Therefore, it was de- cided that longer articles that do not fit the token limit of 512 should be truncated only to use the first sentences sequentially since it is known that this is where the most important information exists in each document. The same approach was chosen for truncation of the CNN/DailyMail dataset, regarding the results from Liu [23].

3.2.3 Oracle Summary Generation Similar to the approach followed in [22], we are indirectly using the abstrac- tive summaries (gold summaries) to create oracle summaries for supervised learning. Since the gold summaries from AP/Oppsummert are abstractive, they can not directly be used for supervised learning. Therefore a greedy al- gorithm based on calculating and maximizing the ROUGE-2 score between sentences in the gold summary and the article is performed to generate an oracle summary. An oracle summary contains label 1 for selected sentences and 0 for the rest. Liu [22], also suggests a second algorithm for oracle sum- mary generation. The second algorithm considers all sentence combinations

33 3.2. Implementation

Figure 3.4: Proportion of sentences with highest ROUGE score according to their position in the original article for maximizing the ROUGE-2 score; however, for many combinations, the al- gorithm can be time-consuming.

3.2.4 Models Six types of models were implemented for the task of extractive text summa- rization. With each model, we predicted three, seven, and ten sentences for the summaries. Out of the six models, TextRank and TF-IDF will only be used as a comparison to the BERTSum models.

Oracle Oracle was not only used as a label generation method described in section 3.2.3, but also as an upper limit for our BERT models. Since oracle summaries are used as labels for our BERT models, the model can not score higher than the oracle summaries. Therefore, the oracle summaries could be set as a ceiling for our BERT models.

We also experimented with both the greedy and combination algorithm mentioned in section 3.2.3. However, the combination algorithm resulted in slow performance, as Liu [22] mentioned, especially when selecting more than three sentences. Thus, the greedy algorithm was used throughout the experiments.

34 3.2. Implementation

LEAD We used LEAD as a baseline, which selects the first sentences in an article. From the previously presented analysis in section 3.2.2, with the position of the highest ROUGE scoring sentences shown in Fig 3.4, we considered LEAD to be a good baseline to use.

With Oracle and LEAD, we now had a good range for the ROUGE scores of where we wanted our models to perform within. For clarification: Our models should perform above the LEAD score and under or the same as the Oracle score.

Next, we implemented two models based on statistical methods and two models based on BERT and BERTsum.

TextRank We adopted a Python implementation3 of TextRank based on the approach followed in [26]. This implementation performs both keyword extraction as well as text summarization. We used (NLTK)4 to download necessary files used by stopwords, tokenizer, and stemmer.

TF-IDF An implementation5 of TF-IDF was adopted for extractive text summariza- tion. The source code was updated to support Norwegian using Spacy6, an NLP toolkit similar to NLTK. Key sentences could then be extracted by rank- ing the scores for each sentence in descending order.

BERTSum We used BERTSum described in section 2.4.3 to fine-tune two pre-trained BERT models for the task of extractive text summarization. The original BERTSum code is currently using an older version of PyTorch and therefore is not directly suitable for integrating new models. Thus, an updated version of BERTSum by Microsoft 7 was used. PyTorch, introduced by Klein et al. [18], is a toolkit for deep learning in Python. There are other currently popular deep learning libraries, such as TensorFlow (Abadi et al. [1]). However, PyTorch was chosen for our task since both the original and updated BERTSum code is processed using PyTorch. PyTorch is also more suitable for development in 3https://github.com/acatovic/textrank 4https://www.nltk.org/ 5https://github.com/luisfredgs/extractive-text-summarization 6https://spacy.io/ 7https://github.com/microsoft/nlp-recipes

35 3.2. Implementation

Python as it is more of a pythonic framework than TensorFlow. In this case, PyTorch is used for model processing, fitting, and prediction.

The updated BERTSum code from Microsoft includes other natural lan- guage processing tools and is still maintained today. However, only relevant functions were included in our source code. Also, the updated code provides a more readable code structure, better optimization techniques, and a scal- able solution for adding pre-trained BERT models than the original BERTSum code from the author Yang.

Data loader: Data comes in many different formats and shapes. The pro- vided code from Microsoft supports both text files and lists of strings as data input. For loading and processing the CNN/DM dataset, a dataset loader is already included by Microsoft. However, for the AP/Oppsummert dataset, a script was implemented to shuffle and split the data into a train and test set to match the input format.

Pre-trained models: The pre-trained BERT models were added using a python-based library called Transformers provided by Hugging Face 8. Trans- formers provides general-purpose architectures for different tasks in both Py- Torch and TensorFlow with currently, over 32 pre-trained models spanning many different languages. Transformers include Auto Class feature that en- ables dynamic tokenizer and model initialization. Both pre-trained models, Norwegian BERT and Multilingual BERT mentioned in section 2.3.4, were found in Hugging Face API and could be integrated using Transformers.

3.2.5 Hyper-Parameters Before fine-tuning our models, some hyper-parameters had to be set:

Summarization layer: As described in section 2.4.3, one can choose mul- tiple summarization layer modes. The author, Liu [23] suggests using two transformer layers as summarization layers. Thus, this is what we used.

Sequence Length: Since we want to use as many tokens as possible, the sequence length was set to the maximum sequence length that could be used with transformers, which was 512 tokens.

Batch size: To keep the memory consumption low, Peltarion [28] suggests a rule of thumb is to keep the product batch size ˆ sequence length ă 3000. Therefore, if the sequence length is set to 512, the batch size used is approxi- mately 6. 8https://huggingface.co/transformers/

36 3.2. Implementation

Learning rate: A low learning rate was chosen to avoid catastrophic forget- ting, which can happen when a model is fine-tuned. This means that the new fine-tuned model can forget the previously learned information. This issue is avoided by setting a very low learning rate of the order of 10´5.

Epochs: Fine-tuning requires only a few epochs. Devlin et al. [7] used three epochs for their fine-tuning experiments, and Peltarion [28] also suggests us- ing three epochs or less. Thus we aimed to use three epochs.

Max steps: The steps were calculated to represent the number of epochs we want to train. With the following formula (3.1, we estimated the max amount of steps:

training size ˆ epochs Max steps = (3.1) batch size

Where training size is the total number of data points in the training set. Epochs is the number of epochs we use, and batch size is the number of data points used per step.

3.2.6 Fine tuning Once the pre-trained models had been integrated with BERTSum, we fine- tuned Norwegian BERT and Multilingual BERT for the task of extractive text summarization. Each model was fine-tuned with a different number of oracle sentences to test how it affects the performance of a model during sentence prediction. In this case, both models were fine-tuned with three, seven, and ten oracle sentences. The motivation to why these numbers were chosen was based on figure 3.2. Firstly, from the figure its seen that most summaries for AP/Oppsummert are seven sentences, which motivates the selection of seven oracle sentences. Secondly, three oracle sentences were considered to experiment with since the CNN/DM dataset by default only uses three oracle sentences. Thirdly, ten oracle sentences was chosen to investigate how a longer oracle summary would affect the performance of the models.

We used the data loader described in the previous section 3.2.4 to load the AP/Oppsummert dataset, which in turn was shuffled and split into a train and test set with 90 % and 10 % data, respectively. Using the provided code from Microsoft, binary labels were generated through oracle summary generation 3.2.3. Using transformers, the pre-trained Norwegian BERT model is downloaded and automatically initialized. The training data is also tok- enized automatically using the Auto Class feature to convert the sequence of words for each article in the training set to IDs representing the pre-trained

37 3.2. Implementation model’s 2.1.1. Then, using PyTorch and specified hyper- parameters (3.2.5), a fitting is done, which automatically divides the training data into batches and iteratively calculates loss, backpropagate to calculate gradients, and updates the model weights.

The process of fine-tuning Multilingual BERT was done in two steps. Firstly, the code from Microsoft provides a data loader for the CNN/DM dataset, which we used to load the dataset. Because of hardware limitations men- tioned in 3.2.8, 10 000 articles were used in the training set and 1 000 articles in the test set. Similar to the Norwegian BERT, the model is tokenized and fine-tuned through BERTSum using the pre-trained BERT model, Multilin- gual BERT.

Secondly, to improve the fine-tuned Multilingual BERT model, an additional fine-tune was made using the AP/Oppsummert dataset. The additional fine-tune was done using the weights from the Multilingual BERT model fine-tuned on the CNN/DM dataset and fit the model using the processed AP/Oppsummert data. The process is similar to the first step.

3.2.7 Prediction Finally, predictions were made on the AP/Oppsummert test set using the model to obtain a score for each sentence in an article for every article in the test set. The sentences for every article are then ranked by their scores from highest to lowest, and then select top-N sentences as the summary, where N is the number of oracles sentences the model was fine-tuned on.

As mentioned in section 2.4.3, BERTSum uses trigram blocking by default during prediction to reduce redundancy. However, to check the necessity of that, predictions with and without trigram blocking were tested.

3.2.8 Hardware The training was done through Google Colab 9 which allows anyone to write and execute python code through the browser. Google Colab provides single GPUs such as Nvidia K80, T4, P4, and P100. However, there is no way for users to choose which type of GPU to use. By displaying the allocated GPU during runtime, we noticed that Nvidia K80 was used in our case. In terms of memory, we were limited to use up to 12 GB. Also, sessions in Google Colab for free use only last at most 12 hours. 9https://colab.research.google.com

38 3.3. Evaluation

3.3 Evaluation

We used the automatic evaluation metrics ROUGE and human evaluation by journalists from Aftenposten to evaluate our fine-tuned models. In this sec- tion, we describe our implementation of these.

3.3.1 ROUGE Evaluation The most popular package for computing ROUGE scores is the pyrouge pack- age10 which is a wrapper for the ROUGE summarization evaluation package. Pyrouge does, however, not support other languages than English at the time of writing. It was therefore decided to use another package called py-rouge 11, which is a Python implementation of the ROUGE metric that be extended to other languages. Since we must evaluate articles in Norwegian, this package seemed to be the best fit.

We adopted the evaluation utilities provided by nlp-recipies 12 to imple- ment a script based on py-rouge that supports both English and Norwegian text. To add the language support for Norwegian, we added five Norwegian language-specific arguments:

• Sentence Splitter

• Word Tokenizer

• Pattern of characters to remove

• Stemmer

• Word Splitter

Our models were then evaluated using the test data of the AP/Oppsummert. The models’ predictions were used as candidate summaries, and the sum- maries written by the journalists at AP were used as reference summaries. Once the ROUGE scores were computed, we saved the ROUGE-1, ROUGE-2, and ROUGE-L scores. These metrics were chosen for easy comparison with results on the CNN/DailyMail dataset and because they are most commonly used for summarization tasks in other scientific research. 10https://pypi.org/project/pyrouge/ 11https://pypi.org/project/py-rouge/ 12https://github.com/microsoft/nlp-recipes/tree/master/utils_nlp/ eval/rouge

39 3.3. Evaluation

3.3.2 Human Evaluation with Journalists The human evaluation was performed with a questionnaire of both fixed- choice questions and open-ended questions on journalists at Aftenposten. The fixed-choice questions represented the quantitative part of the human evaluation and were designed to investigate our model’s performance by the journalists’ assessment. The following measures were considered: extracted key sentences, length, redundancy, and content coverage. On the other hand, the open-ended question represented the qualitative part of the human evaluation. The journalist was presented with four articles and a summary predicted from M-BERT to each of these articles. The examples chosen for this were taken from the test set, where two examples had summaries with a high ROUGE score, and two examples had summaries with a low ROUGE score. This selection process was chosen to avoid giving our model either an advantage or disadvantage in terms of ROUGE score.

The journalists were then prompted to rank these summaries in different categories by answering the following questions:

1. How well do you think this summary managed to extract key sentences from the article?

2. How do you rate the summary in terms of length adequacy?

3. How do you rate the summary in terms of not being redundant (not having repetitive sentences)?

4. How do you rate the summary in terms of content coverage?

For the open-ended questions, we asked the journalist to give comments about other weaknesses/strengths of the summaries. At the end of the ques- tionnaire, we also asked for overall improvements and comments.

40 4 Results

In this chapter, we present the results from our methods. First, in section 4.1, we cover our implemented models and how they manage to predict which sentences should be picked from the articles. Secondly, in section 4.2, we present how well the models performed in our evaluations.

4.1 Implementation

As mentioned in section 3.2.4, six models for sentence prediction were im- plemented. When these models were used on the test data, they managed to extract a selected number of sentences from different positions in the original articles. The time it took to fine-tune the Norwegian BERT and Multilingual BERT is presented in Table 4.1.

Table 4.1: Time it took to fine-tune time the Norwegian and Multilingual BERT models

Model Dataset Fine-tune time

nb-BERT AP/Oppsummert 45m 33s mBERT CNN/DM + AP/Oppsummert 11h 8m 48s + 43m 6s

For the Norwegian and Multilingual BERT the selected sentences’ position in their original article are shown in Figure 4.1 and 4.2. Six subplots are made for each model. The first subplot shows the respective model fine-tuned on the Oracle-3 training set with a prediction of three sentences on the Oracle-3 test

41 4.1. Implementation set. The second and third subplot follows the same principle but for Oracle-7 with seven predicted sentences and Oracle-10 with ten sentences predicted. This is then repeated through subplot four, five and six but without the use of trigram blocking. Oracle sentence positions are also presented in these sub- plots, showing where the top-scoring sentences exist in the original articles.

Since BERTSum has a limit of 512 tokens, the articles are truncated to fit that limit, and the expected predictions will be sentences within the truncated limit. Thus, if one word is a token, then counting the number of tokens per sentence for each article in our AP/Oppsummert test set up to 512 tokens corresponds to an average of 26 sentences.

42 4.1. Implementation

(a) (b)

(c) (d)

(e) (f)

Figure 4.1: Sentence selection for Norwegian BERT fine-tuned on (a) Oracle-3 (b) Oracle-7 (c) Oracle-10 with trigram blocking and on (d) Oracle-3 (e) Oracle- 7 and (f) Oracle-10 without trigram blocking.

43 4.1. Implementation

(a) (b)

(c) (d)

(e) (f)

Figure 4.2: Sentence selection for Multilingual BERT fine-tuned on (a) oracle-3 (b) oracle-7 (c) oracle-10 with trigram blocking and on (d) oracle-3 (e) oracle-7 and (f) oracle-10 without trigram blocking.

44 4.2. Evaluation

4.2 Evaluation

We present the evaluation results of the implemented models in this section. In section 4.2.1, the automatic ROGUE evaluation is presented, and in section 4.2.2 the human evaluation that was done on the journalists of Aftenposten.

4.2.1 ROUGE Evaluation The performance of TextRank, TF-IDF, NB-BERT, and M-BERT are presented in table 4.2 The BERT models are also evaluated with and without trigram blocking 3.2.7.

4.2.2 Human Evaluation with Journalists The fixed-choice quantitative human evaluation results with journalists are shown in figure 4.3. The scores ranging from 1 to 5 indicate the measures’ quality to be very weak, weak, satisfactory, strong and very strong. Here, examples one & two were the examples chosen based on summaries with a high ROUGE score, and examples three & four were the examples chosen based on summaries with a low ROUGE score. In regards to the categories in figure 4.3, it is seen that the summaries with high ROUGE score are:

1. strong in content coverage

2. satisfactory in non-redundancy

3. strong in summary length

4. strong in key sentence extraction

Furthermore, the summaries with low ROUGE scores are:

1. weak in content coverage

2. satisfactory in non-redundancy

3. satisfactory in summary length

4. weak in key sentence extraction

Overall, journalists found summaries with high ROUGE scores (example 1 & 2) performing stronger than those with low ROUGE scores (example 3 & 4).

For the qualitative open-ended questions, we present comments on the sum- maries strengths in Table 4.3 and comments about the summaries weaknesses in Table 4.4. The overall improvements suggestions and comments are pre- sented in Table 4.5 and Table 4.5 respectively.

45 4.2. Evaluation

Figure 4.3: Average human evaluation scores for each category where the highest score for each example is 20

46 4.2. Evaluation

Table 4.2: ROUGE scores on AP/Oppsummert test data (116 articles). *With trigram blocking

Model Dataset R1 R2 RL

Oracle-3 - 58.69 50.06 57.71 Oracle-7 - 71.66 59.81 70.53 Oracle-10 - 72.64 59.88 71.5 Oracle-all - 72.11 59.06 70.87

LEAD-3 - 39.6 28.61 38.42 LEAD-7 - 55.44 40.8 53.92 LEAD-10 - 41.48 29.94 40.63

Textrank-3 - 33.87 15.1 31.3 Textrank-7 - 41.38 20.23 39.08 Textrank-10 - 40.56 20.98 38.74

TF-IDF-3 - 33.6 12.37 30.72 TF-IDF-7 - 38.68 17.85 36.46 TF-IDF-10 - 38.11 19.5 36.39

nb-BERT-3 AP/Oppsummert 35.62 23.1 30.49 nb-BERT-7 AP/Oppsummert 55.12 40.3 48.18 nb-BERT-10 AP/Oppsummert 54.16 39.25 47.4

mBERT-3 CNN/DM + AP/Oppsummert 39.6 28.61 34.75 mBERT-7 CNN/DM + AP/Oppsummert 55.41 40.82 48.61 mBERT-10 CNN/DM + AP/Oppsummert 54.58 39.86 47.96

nb-BERT-3* AP/Oppsummert 35.59 23.07 30.48 nb-BERT-7* AP/Oppsummert 54.85 39.85 48.08 nb-BERT-10* AP/Oppsummert 53.14 37.62 46.3

mBERT-3* CNN/DM + AP/Oppsummert 39.67 28.72 35 mBERT-7* CNN/DM + AP/Oppsummert 55 40.11 48.27 mBERT-10* CNN/DM + AP/Oppsummert 53.43 37.92 46.64

47 4.2. Evaluation

Table 4.3: Journalists’ opinion reflecting their satisfaction with generated sum- maries

Measures Sample comments

Key sentences: "The algorithm is good at picking out the key sen- tences at the start of the articles" Content Coverage "I think the summary all in all has captured the most important aspects of the article" "The most important content of the main story is present in the summary" "Most of the story is well covered here" Summary Length "I think it was impressive to reduce the (quite long) story to a summary this short" "The length is good for such a long article" "I think the length is good"

48 4.2. Evaluation

Table 4.4: Journalists’ opinion on generated summaries mentioning features they found the algorithm performed weak.

Measures Sample comments

Key sentences: "The summary is missing some key information, and contains some information that is not neces- sary at all." "Bullet point 5 and 6 doesn’t provide any insight to the security issues. Bullet point 4 does nothing else than provide gender" Content Coverage: "The summaries doesn’t reflect all aspects of the articles" "The summary doesn’t get the main and most important point here" Context: "I don’t think quotes should be used in sum- maries, especially if it has no context (which is the case here) "Bullet point 9 is a quote, but that is not clear in the summary" "Bullet point 4 is a quote, but it doesn’t say from whom" Redundancy: "A few sentences are unnecessarily repetitive" "Bullet point 6 is redundant when you have point 7" "feels a little repetitive and excessive" Subtitles in summary: "In several sentences, the algorithm has com- bined subheaders and text" "Bullet point 6 is just a one word subtitle" Coherence: "It doesn’t seem like a summery (text), but more like a listing of facts." Truncation: "The summary seems to just cover the first half of the main article (?)" "Sometimes it seems as if the summaries did not include the last third of the original stories"

49 4.2. Evaluation

Table 4.5: Journalists’ opinion on generated summaries reflecting potential for improvement.

Response 1 "Better coverage of the whole article" 2 "The content. It doesn’t extract the most important points of the article, and the way it’s written is not very good." 3 "Mellomtitler should never be included in a summary - this should be easy to avoid. In crime stories it is often touchy ethical questions involved. I do not think the summaries handled this in a satisfying way (the lawyer’s quote in the Prinsdal-story)." 4 "See my comments on each story." 5 "The algorithm is good at picking out the key sen- tences at the start of the articles, but seems to struggle with contextualisation and order. Also, towards the end of the articles, it often misses out on key informa- tion or fails to summarize longer parts efficiently."

Table 4.6: Overall comments from journalists

Response 1 "This is impressively good, but not quite there. Some important points/facts from some of the articles are missing (when the main article is long, the second half of it seems to be missing in the summary?)" 2 "I think the two first summaries were the best ones." 3 "No answer" 4 "Overall this was not too bad considering." 5 "With some improvements, this could be a useful tool for creating a base structure for us to edit and rewrite. That could save time. But as it stands, it would not be a reliable automated tool for writing summaries."

50 5 Discussion

This chapter presents our thoughts and reflections about different parts of the thesis. In section 5.1, we analyze our results and discuss their meaning. Then, in section 5.2, we discuss our methods and how they could be improved. Moreover, in section 5.3, we focus on our work in a wider context, giving our thoughts on related ethical and societal aspects.

5.1 Results

In the following we highlight our main findings from the results that was presented in chapter 4.

5.1.1 ROUGE Scores Table 4.2, compared ROUGE scores between various models. However it is worth noticing that the Oracle models cannot be used directly for sentence predictions since they are created from the golden summaries. The Oracles are presented as the upper limit for our models to understand better how well a model should perform if it always extracted the articles’ exact top ROUGE scoring sentences. Keep in mind that the Oracles select sentences with a greedy algorithm, making it possible to dynamically choose the number of sentences that best match the golden summary for each article. In contrast, the prediction models have a fixed output. They will, therefore, never per- form above or similar to Oracle scores simply because they will not be able to dynamically choose the number of sentences to predict for different articles.

The results presented in Table 4.2 show that the LEAD baselines give high

51 5.1. Results

ROUGE scores compared to golden summaries. This observation strengthens the statement that the summaries provided in the AP/Oppsummert dataset are strongly extractive and usually take sentences from the beginning of the document. What is a bit surprising is the fact that this naive approach performs better than any other model for three and seven-sentence predic- tions. This means that to create a summary with three or seven sentences, selecting the first three or seven sentences in an article would result in higher ROUGE scores. Thus, the models have not yet learned to sufficiently pick the top-scoring Oracle sentences that we see in Table 4.2, a problem that will be further discussed in 5.1.2.

When predicting summaries with ten sentences, we see a better performance comparing the BERT-models with LEAD-10. We, therefore, consider this score to be more relevant in terms of seeing what the models have learned during training. When predicting ten sentences, the model evidently manages to pick better sentences than those just at the beginning of each document.

When comparing nb-BERT with m-BERT, it can be noticed how m-BERT, fine-tuned with the CNN/DM dataset, outperforms nb-BERT in all different experiments. This indicates that m-BERT performs better than nb-BERT in few-shot learning. Furthermore, the result is supported by the work of [9] and [38], in which it was shown that m-BERT performs stronger than other models when training data is limited.

It can be seen that the models without trigram blocking, in general, gives a slightly better ROGUE score than the ones with trigram blocking. This re- sult is expected considering the lead bias. However, the ROUGE score alone does not show how well the trigram blocking reduced the redundancies in the selected sentences. This issue is something that the human evaluation gives insight into, which was the reason for selecting the model using trigram blocking in the human evaluation. The results from this is discussed in sec- tion 5.1.3.

The statistical models did not perform well in terms of the ROUGE score and we will not further discuss their scores and performances.

5.1.2 Sentence Selection It is seen from the figures in 4.1 that the predicted model is prone to pick more sentences at the beginning of an article which results in a lead bias. For example, in figure 4.2(a), where three sentences are predicted, we see that the first three sentences have a much higher proportion of being selected than the rest of the sentences in AP/Oppsummert’s test set. The same pattern can be seen for figure 4.2(b) where the model predicts seven sentences, but

52 5.1. Results sometimes it also picks sentences from positions eight, nine, and ten. In figure 4.2(c), ten predictions are made. However, in this case, the proportions of the selected sentences are more distributed across the sentence positions in AP/Oppsummert’s test set.

The behavior described above comes from the convention of placing key sentences in the first part of an article, as shown in figure 3.4 for AP/Oppsummert. The same problem is mentioned by Liu, [23] where they tested a non-pre-trained transformer called TransformerEXT, that uses the same architecture as BERTSum. The author observed clear bias towards the first sentences of the articles in the CNN/DM dataset and states that TransformerEXT relied more on shallow position features and less on deeper document representations. In our case, the same statement can be made. It is, however, unclear why the model is biased towards the first sentences. Although, one factor could be that the limited data we currently have is not enough for the model to learn deeper document representations.

Zhu et al. [45] discuss that, especially in news articles, the leading sen- tences are often the most important part of the article. Therefore LEAD, in general, is a hard baseline to beat for many deep learning models. Moreover, comparing the LEAD-3 score for CNN/DM mentioned in Liu’s paper [23] to our LEAD-3 score for AP/Oppsummert, we see that we have a significantly higher LEAD-score. One explanation to this is that the data used in our work might have higher positional bias than the CNN/DM data set, used in the work of [23].

In figure 4.1 (d-f) and 4.2 (d-f), we observe that when predicting summaries of our BERT models without using trigram blocking, a stronger lead bias can be seen. Our BERT models predicted without trigram blocking got a slightly better ROUGE score than predicting with trigram blocking. However, we see a better distribution on the selected sentences with trigram blocking. Therefore, trigram blocking is applied to the model we developed in our work.

5.1.3 Human Evaluation with Journalists Quantitative assessment Regarding our quantitative human evaluation shown in figure 4.3, we pre- sented four different examples to five journalists. Every block for a stacked bar in the figure shows the average score of the responses from the journal- ists. These blocks represent different categories of a summary that we want to assess. We presented them by the questions shown in section 3.3.2. Since the scores range from one to five, we expect that our model has performed

53 5.1. Results well enough in an example if every block in the bar has a score above three. For instance, example one & two shown in figure 4.3, shows scores above three which means that the model has achieved in satisfying the journalists for those examples. However, the two last examples proved to be less satis- fying to the journalists because the model underperformed in aspects such as content coverage and key sentence extraction.

As described in the results (section 4.2.2), the first two examples are cho- sen based on summaries with a high ROUGE score. In comparison, the last two examples are selected based on summaries with a low ROUGE score. This decision is interesting because, as previously observed in figure 4.3, the journalists scored the first two examples as satisfying. The last two examples were underperforming in terms of the four categories presented in the figure. This similarity in mind shows that there could be a mutual agreement be- tween how the journalists assess the quality of the summary and the ROUGE metric.

Qualitative assessment For the qualitative human evaluation, the results in section 4.2.2 present the journalist’s thoughts for different aspects of the summaries. In the following, we discuss these comments:

Key Sentences: Regarding key sentence extraction, we notice how the jour- nalists consider the summaries to miss meaningful sentences and have sen- tences that do not bring any value. However, it is observed that sentences picked from the beginning are considered relevant for the summary. How- ever, these sentences alone are not sufficient for a satisfying summary.

Content Coverage: The results in 4.3 indicate that, in some cases, the sum- maries managed to cover the essential aspects of its main article. This means that, for those cases, the leading sentences indeed cover the main content of the whole article, which matches the discussion about LEAD sentence selec- tion in section 5.1.2. However, most comments from 4.4 and 4.5 state that the summaries did not have sufficient content coverage, which means that only considering leading sentences is not enough.

Summary length: The summary length seems to be appreciated, which in- dicates a correct analysis of the previously written summary lengths, as pre- sented in 3.2.

Context: According to the results, most cases where sentences are consid- ered to be out of context are when the model has extracted quotes. This could be due to the fact that quoted persons are usually presented before or after the

54 5.1. Results actual quote. If the model then misses selecting this presentation, the quote sentence will appear out of context. Another problem with context is when it is not clear that a selected sentence is from a quote. This happens in cases where the quote consists of several sentences like in the following example: - "Sentence A. Sentence B. Sentence C.", Said the firefighter. In this case, if only sentence B is extracted, there will be nothing in the summary stating that this sentence is a quote.

One reason for journalist’s dissatisfaction about context could be our ex- tractive approach towards text summarization. As mentioned earlier, the intention/goal behind such approach was to avoid misleading context by creating an extractive summarization model rather than an abstractive sum- marization. That way, it was believed that the summaries generated from extracted sentences would, in most cases, not go out of context. However, from the previous discussion, we observe that even though the sentences are extracted from the article, the summary, in general, could give a different per- ception from the original article. On the other hand, one strength of extractive summarization technique is that it guarantees the summaries to at least not introduce new words and definitions that might result in more significant problems.

Redundancy: Even though trigram-blocking was used, journalists were still not entirely satisfied with models performance in terms of redundancy. As a future direction, we aim to introduce techniques which improve redundancy. These comments are interesting, considering the model’s intention towards selecting most of the sentences from the beginning of the article. This indi- cates that the beginning of these articles is also using redundant sentences. One would think that redundancy would become a bigger problem once the model learns to select sentences from a broader range of sentences in the arti- cle.

Sub headers in summary: We also observed from the comments that a sum- mary could contain the subheaders of an article. This problem could be linked to AP/Oppsummert’s dataset format, which we discuss further in section 5.2.1.

Coherence: The journalists believe that the summaries are more like a list- ing of facts than a coherent summary. However, the summaries were never intended to be a speakable text, but bullet points of the most important sen- tences of the longer article.

Truncation It is evident that the journalists notice the model’s limitation in only covering the first part of the articles and that this has a negative impact

55 5.2. Method on the summary. This result does not favor either BERT’s and BERTsum’s 512 token limit or the lead bias during fine-tuning. It seems that even though the first sentences often contain essential information, that is not enough reason to truncate articles and only use the first sentences for summarization.

We interpreted journalists general opinion to be that they do not believe the model’s current state can be used directly for article summarization. However, with some improvements, the potential for the model can be seen.

5.2 Method

In this section, our chosen methods are discussed. More specifically, we look at resource limitations and things that could have been done differently. We also reflect on how these relate to the outcome of the results and whether other approaches could lead to better results.

5.2.1 Datasets In general, when training a model to perform a specific task with neural net- works, it is essential to have a datasets of high quality. In this section, we will therefore discuss the quality of the datasets that were used to implement our models.

CNN/DM The CNN/DM is one of the most commonly used datasets for comparing text summarization models. It was also the dataset that was used by Liu [23] in the BERTsum paper. Therefore we used CNN/DM as our dataset for fine-tuning the multilingual BERT model. The advantage of this is that the results of our model can be compared with that of Liu [23]. However, some cons with using this dataset for this project has been realized:

1. The average number of sentences of the summaries is 3.75, while for AP/Oppsummert, it is 9.5. The fine-tuning of mBERT could have made more sense if first trained on a dataset with a similar number of sen- tences in the summaries. 2. Like the AP/Oppsummert dataset, the CNN/DM dataset also has its most important sentences at the beginning of the document (see figure [23]). This means that after fine-tuning on CNN/DM, the model could already have a bias towards picking the first sentences of the articles. As already discussed in section 5.1.2, such phenomena is observed in the re- sults of the model implemented in the current work as well. With this in mind, having the model train on CNN/DM, which likewise is biased to- wards the first sentences, may not be the best approach. It would have

56 5.2. Method

been interesting to investigate how the output would be if the model instead were trained on a dataset with a more distributed sentence se- lection.

AP/Oppsummert The initial limitation with the data that was provided from Aftenposten and Oppsummert was known to be the limited amount of 979 summaries. How- ever, during implementation, more limitations were observed. In the follow- ing, we list some of these issues, which could be improved in the future:

1. The fact that the summaries mainly consist of sentences that have just been picked from the beginning of the article. We believe that was the main reason behind the model’s tendency towards choosing most of the sentences from the beginning of the article.

2. The variation of related articles for each summary which lead to the re- moval of articles. In our work, restructuring the dataset was essential for training of a single document-classifier. However, the summaries based on several articles can lead to non-optimal results since we remove po- tential top Oracle sentences when removing these articles.

3. Raw article data not containing HTML tags. The articles online are pre- sented with HTML, but the data in the dataset only consisted of raw text. This made it impossible for us to separate headlines from paragraphs, which resulted in headlines being treated the same as sentences during model training and prediction.

4. Summary data contained additional unrelated data. This was a problem we discovered late in the development process. Some of the summaries include text that is not necessarily relevant to the summary. For exam- ple, it could be a prompt to read more of the topic in the main article.

Dealing with the first issue could not be done during the project’s implemen- tation period. The human-written summaries have to be updated with new summaries that consider other sentences of the articles.

For the second issue, a solution could be to implement a multi-document approach for the summaries written from more than one article. This is some- thing that we never tested and that we considered out of scope for the project. Especially since the goal was to implement a model that could perform sen- tence selection on single documents.

The third and fourth issues deal with how the dataset was processed be- fore we got access to the data. This issue could have been dealt with by manually going through 979 summaries or by re-doing the data extraction

57 5.2. Method process. Although, since we did not have resources, it was decided to stick with the current state of the data to save both time and labor.

5.2.2 Implementation All fine-tune for the BERT models were done using Google Colab, which worked efficiently for AP/Oppsummert dataset but not for CNN/DM as the dataset was much larger. Fine-tuning on CNN/DM resulted in a longer training period causing sessions in Google Colab to timeout because of the limitations described in section 3.2.8. Therefore, as mentioned in section 3.2.6, where CNN/DM was used in our experiments: we only extract 10 000 sam- ples to our training set (section 3.2.6). The time it took to fine-tune the BERT models are shown in table 4.1. It would have been interesting to fine-tune M-BERT on all samples in CNN/DM. However, because of the limitations of Google Colab, that became out of the scope of this work.

The initial thought for model optimization was to include a validation set during our model fine-tuning process, giving us insights on when to stop the training. However, we chose to skip this step because of two main reasons:

1. We are limited by the number of samples we have in AP/Oppsummert’s dataset. Splitting that into an additional set for validation would reduce the number of samples we have in our training set.

2. As mentioned in section 3.2.5, we decided to fine-tune our models for a fixed number of three epochs. With the number of epochs fixed, there is no need to have a validation set for deciding when training should stop.

5.2.3 Metrics The evaluation was done in two parts. In this section, we discuss the ade- quacy of ROUGE and our human evaluation.

ROUGE The limitations of ROUGE, as mentioned in 2.5.2, have been widely discussed across different papers (ADD REFS). Many authors question its suitability as an evaluation metric for summaries and that it is used to claim the state- of-the-art performance of models. However, to our knowledge, there is no other good metric today that can be used to compare evaluation results across papers, and developing a new metric would be out of scope for the project. Therefore, we decided to use ROUGE as one of our metrics. However, we were also interested to know how journalists assess the model’s performance. Therefore, an evaluation study was carried out as a complement to ROUGE.

58 5.2. Method

Using ROUGE, we assumed that the golden summary is the true best sum- mary to the article. Such assumption has some disadvantages. There is no reason why one summary has to be better than another since this is subjec- tive. Two very different summaries could be equally good simply because two different summaries can both capture the full context of an article. This means that potentially good summaries are ignored because they are not using similar words as the reference summaries. For evaluation, this problem can be addressed by additional evaluation. However, that is not the case when we use ROUGE to create Oracle sentences for fine-tuning our models. This means that the model will limit its learning to the "true best" summary that we have for that article.

One way to address the issue above, is to formulate a new assumption which does not rely on the fact that the golden summary is the only solu- tion. ROUGE does support comparison between several summaries, so one solution could be to train with a dataset with more than one summary for each article. This, however, can become costly since each article would need more than one journalist for summary-writing. Which there were simply no resources for during the limited period.

One limitation that we did attempt to deal with was how ROUGE only considering content selection. We did this by using Liu’s [23] algorithm with trigram blocking for reducing redundancy of the selected sentences. Accord- ingly, to the results presented in 4.2, the usage of trigram blocking did indeed lead to a broader range of selected sentences. This means that the algorithm managed to find sentences that were considered to be redundant.

Regarding other aspects of the content selection limitation of ROGUE, such as coherence, we did not directly implement anything to counteract this. This is because our summaries are presented as bullet points, therefore, considering such measures became out of the scope of our wok. If the summaries were presented more like a fluent text, more research would have been spent to cover this limitation.

Human Evaluation with Journalists We consider having our model’s performance evaluated by journalists highly valuable to the research. Since these journalists are writing summaries them- selves, we consider their answers highly reliable. However, we should have used these resources to a greater extent, especially at the beginning of the project. Throughout the implementation of the project, ROUGE was the only metric used for tracking progress, while the extensive human evaluation was only applied at the end.

59 5.3. The work in a wider context

We believe that the questionnaire was formed in a way that covered well the quality of the summaries. What makes a human evaluation of summaries hard is how to interpret different opinions about quality. I.e., what is really meant by a summary that is "really good" or "really bad". This needs to be made explicit somehow. Our way of doing this was to have the journalists rank how well each summary performed in specific categories. We defined that a good summary captures key sentences from the article, has a decent length, is not redundant, and covers the original article’s content well.

One could argue that making the human evaluation is less reliable due to the few participants in the questionnaire. This would indeed be a big prob- lem with fluctuating results. However, since the journalists share the same opinion more or less, we consider the human evaluation to be still sufficient to draw conclusions from it.

5.3 The work in a wider context

Language models such as BERT require large datasets to be trained on to achieve good results. Because of this, current advancements for monolin- gual language models have mostly been made on large-resource languages such as English. This makes it hard to train a monolingual model for other low-resource languages. However, we see a potential in using a multilingual model such as M-BERT, which supports many languages as mentioned in sec- tion 2.3.4. From the results explained in section 5.1.1, we see that fine-tuning the model with a larger dataset (CNN/DM) in a large-resource language, such as English, can boost the performance of the task-specific model significantly. When fine-tuning M-BERT with CNN/DM and our limited dataset in Norwe- gian, we could outperform the monolingual Norwegian BERT model. Simi- larly, our work can be used in other tasks involving a BERT model in other low-resource languages.

5.3.1 Ethical Aspects In this section, we highlight the ethical aspects of the summaries that our model automatically generates. In this case, the model extracts sentences from an article which ensures that we are using the "journalists" sentences. However, there still are some primary concerns that need to be addressed.

Firstly, when a journalist writes a summary, they are aware of what should be highlighted and how the summary should be structured to convey the article’s main points transparently. However, a machine-generated summary might not pick up the sentences that the journalist finds important. This could result in a shift in the main point in the summary. This problem can be

60 5.3. The work in a wider context seen in some of our extracted summaries and should be expected.

Secondly, the generated summaries from a machine might miss out on picking coherent sentences. This can be problematic because if the generated summary only picks up one of the coherent sentences, the information con- veyed can be false. For example, if the coherent sentences are: "A is true.", "However, not if B is false.", In this example, if only the first sentence is picked, then "A" would always be true, which in the original context, it was not.

Depending on the topic and content, the issues mentioned can be critical at some point. At the current state, the machine can not make these judg- ments itself. Therefore we advise using the extractive summarization model, BERTSum, as a tool to help the journalists write their summaries.

61 6 Conclusion

This chapter will summarize our concluding thoughts on the purpose and the research questions described in our thesis. Ideas for improvements will be discussed at the end of the chapter.

6.1 Conclusion

This thesis project aimed to develop a model that could extract the most rele- vant sentences from Norwegian news articles. This aim has to an extent been achieved but with limitations. In the following, we conclude our investiga- tions to then answer the main research question.

• How news summaries can be used to generate labeled data that is re- quired for a supervised model To generate labeled data from summaries, two algorithms has been pre- sented:

1. Greedy sentence selection algorithm 2. Combination sentence selection algorithm

These algorithms create labels by maximizing the ROUGE score be- tween the summaries and their original articles. In this work, the greedy algorithm is used because the combination algorithm quickly gets com- putationally heavy the more sentences that are selected. Thus, in terms of computational efficiency, the greedy algorithm proved to be a better label generator for this task.

62 6.1. Conclusion

• How the model’s performance should be evaluated and assessed The performance of a summarization model can be evaluated by com- paring the model’s generated summaries with other, ideal summaries. Two ways for how this can be done has been presented:

1. With the automatic metric ROUGE, that compare the generated summaries with human-written summaries, using n-grams and word sequences. 2. With human evaluation, where humans compare the generated summaries with the best summary they have in mind.

The first method has limitations regarding summary quality, while the second method is not sustainable on a larger scale. Additionally, man- ual experiments are difficult to compare across papers. Therefore, until another metric is developed, ROUGE will continue to be the method of choice to evaluate text summarization models on larger scales.

• How BERT can be used for extractive text summarization on a low resource language Two approaches have been presented to use BERT for extractive text summarization on a low resource language, using a modified BERT ar- chitecture called BERTSum. The first approach is to, if available, use a monolingual model provided in the target language, which in this case is the Norwegian BERT model. Otherwise, Multilingual BERT could be used, allowing cross-lingual fine-tuning. From the limited Norwegian news dataset provided by AP/Oppsummert, it is seen that the Mul- tilingual BERT performed slightly better than the Norwegian BERT in terms of ROUGE score. In conclusion the Multilingual BERT is the better model as it can be trained on a limited Norwegian dataset with a higher performance level as the monolingual BERT model. Furthermore, this shows great potential for using the multilingual model to other low re- source languages, where data is limited.

• Limitations with BERT and how they should be dealt with The two main limitations with using BERT in this project has been shown to be:

1. That it cannot directly be used for extractive text summarization due to BERT only being able to differentiate a pair of sentences. 2. That it has a token limit of 512, which means that longer articles cannot directly be used as input to the model

The first limitation was dealt with by implementing BERTSum, which modifies BERT’s input sequence and embeddings to differentiate and

63 6.2. Future Work

extract sentences. The second limitation was dealt with by truncating longer text only to use the first 512 tokens. This method was motivated by the fact that key sentences were found at the beginning of each arti- cle in the dataset. However, it has become clear with the human eval- uation that this option was not sufficient for creating qualitative sum- maries. Even if the first sentences contain key information, we still need to consider sentences later in the article to create a satisfying summary in terms of content coverage.

Together with the results and discussion, the main research question can now be answered from these investigations:

How can a high-performance BERT-based extractive summarization model be developed based on a limited amount of news summaries in Norwegian? By fine-tuning a pre-trained model using a modified BERTSum architecture, we developed an extractive summarization model. By investigating two ap- proaches, we found Multilingual BERT to perform best in terms of ROUGE score. Even though the model outperformed other presented models in ROUGE score, journalists pinpointed that, in the current state, the model would not be a reliable automated tool for writing summaries. However, with some future work, this could be a valuable tool for creating a base structure for journalists in AP/Oppsummert to edit and rewrite the generated summaries, saving them time and workload.

6.2 Future Work

To improve this work further, some changes in the implementation can be made that we suspect could positively impact the final results of this project.

1. We have seen that the model can learn sentence selection based on the data it is provided with. Therefore, we expect that with a better struc- tured and less lead-biased dataset, the model should learn to pick sen- tences based on context rather than on positions.

2. As proved by the human evaluation, experiments with other solutions to BERT’s token limit have to be done to see if this will yield better final results. One idea here could be to split the input articles into multi- ple sub-articles, classify them, and combine the results. This approach would be more expensive, but it would potentially solve the content coverage problem.

64 6.2. Future Work

3. Another experiment that could be made is to change the CNN/DM dataset to a dataset that contains summaries with more sentences to see if this will give better results when fine-tuning M-BERT.

4. Continued work on the problem with subheaders and non-cited quotes inside the generated summary data must also be done. One possible solution would be to filter these from the article during prediction. This solution requires some automatic tagging to differentiate subheaders, quotes, and the article text itself.

65 Bibliography

[1] Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. “Tensor- Flow: A system for large-scale machine learning”. In: 12th USENIX Sym- posium on Operating Systems Design and Implementation (OSDI 16). 2016, pp. 265–283. URL: https://www.usenix.org/system/files/ conference/osdi16/osdi16-abadi.pdf. [2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural Ma- chine Translation by Jointly Learning to Align and Translate. 2016. arXiv: 1409.0473 [cs.CL]. [3] Miguel Romero Calvo. Dissecting BERT Part 1: Understanding the Trans- former. https : / / medium . com / @mromerocalvo / dissecting - -part1-6dcf5360b07f. Accessed: 2020-12-06. [4] Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bah- danau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. “Learn- ing Phrase Representations using RNN Encoder–Decoder for Statistical ”. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar: Asso- ciation for Computational Linguistics, Oct. 2014, pp. 1724–1734. DOI: 10 . 3115 / v1 / D14 - 1179. URL: https : / / www . aclweb . org / anthology/D14-1179. [5] Papers With Code. Document Summarization on CNN / Daily Mail. https : / / paperswithcode . com / sota / document - summarization-on-cnn-daily-mail. Accessed: 2021-03-18.

66 Bibliography

[6] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, and Ruslan Salakhutdinov. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context. 2019. arXiv: 1901.02860 [cs.LG]. [7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. In: CoRR abs/1810.04805 (2018). arXiv: 1810.04805. URL: http://arxiv.org/abs/1810.04805. [8] Henning Carr Ekroll and Kjetil Magne Sørenes. Avgåtte regjeringspoli- tikere får karantenelønn på grunn av egne selskaper. Enkelte har vært helt uvirksomme. https : / / www . aftenposten . no / norge / politikk / i / P9BX5J / avgaatte - regjeringspolitikere - faar - karanteneloenn - paa - grunn - av - egne - selska. Ac- cessed: 2021-05-13. [9] Khalid N Elmadani, Mukhtar Elgezouli, and Anas Showk. “BERT Fine-tuning For Arabic Text Summarization”. In: arXiv preprint arXiv:2004.14135 (2020). [10] Jeffrey L. Elman. “Finding structure in time”. In: Cognitive Science 14.2 (1990), pp. 179–211. ISSN: 0364-0213. DOI: https : / / doi . org / 10 . 1016 / 0364 - 0213(90 ) 90002 - E. URL: https : / / www . sciencedirect . com / science / article / pii / 036402139090002E. [11] Wenche Fuglehaug Fallsen. Tiltalte: Ble provosert da Mohammed sa «jeg elsker henne». https://www.aftenposten.no/norge/i/Jo2V6P/ tiltalte- ble- provosert- da- mohammed- sa- jeg- elsker- henne. Accessed: 2021-05-13. [12] Jan Gunnar Furuly and Hans O. Torgersen. Havarikommisjon kritis- erer sikkerhetsbrudd etter at 15-åring omkom i strømulykke. https : / / www.aftenposten.no/norge/i/8mo9bw/havarikommisjon- kritiserer - sikkerhetsbrudd - etter - at - 15 - aaring - omkom-i-s. Accessed: 2021-05-13. [13] Yoav Goldberg. “A Primer on Neural Network Models for Natural Language Processing”. In: CoRR abs/1510.00726 (2015). arXiv: 1510. 00726. URL: http://arxiv.org/abs/1510.00726. [14] Karl Moritz Hermann, Tomás Kociský, Edward Grefenstette, Lasse Es- peholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. “Teaching Ma- chines to Read and Comprehend”. In: NIPS. 2015, pp. 1693–1701. URL: http://papers.nips.cc/paper/5945-teaching-machines- to-read-and-comprehend. [15] Sepp Hochreiter and Jürgen Schmidhuber. “Long Short-term Memory”. In: Neural computation 9 (Dec. 1997), pp. 1735–80. DOI: 10.1162/neco. 1997.9.8.1735.

67 Bibliography

[16] Rani Horev. BERT Explained: State of the art language model for NLP. https://towardsdatascience.com/bert-explained-state- of- the- art- language- model- for- nlp- f8b21a9b6270. Ac- cessed: 2020-12-06. [17] Karen Spärck Jones. “A statistical interpretation of term specificity and its application in retrieval”. In: Journal of Documentation 28 (1972), pp. 11–21. [18] Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexan- der M. Rush. “OpenNMT: Open-Source Toolkit for Neural Machine Translation”. In: CoRR abs/1701.02810 (2017). arXiv: 1701 . 02810. URL: http://arxiv.org/abs/1701.02810. [19] Simeon Kotadinov. Understanding Encoder-Decoder Sequence to Sequence Model. https://towardsdatascience.com/understanding- encoder - decoder - sequence - to - sequence - model - 679e04af4346. Accessed: 2021-03-18. [20] Wojciech Kry´sci´nski,Nitish Shirish Keskar, Bryan McCann, Caiming Xiong, and Richard Socher. “Neural text summarization: A critical eval- uation”. In: arXiv preprint arXiv:1908.08960 (2019). [21] Chin-Yew Lin. “ROUGE: A Package for Automatic Evaluation of Sum- maries”. In: Text Summarization Branches Out. Barcelona, Spain: Associ- ation for Computational Linguistics, July 2004, pp. 74–81. URL: https: //www.aclweb.org/anthology/W04-1013. [22] Yang Liu. “Fine-tune BERT for Extractive Summarization”. In: CoRR abs/1903.10318 (2019). arXiv: 1903 . 10318. URL: http : / / arxiv . org/abs/1903.10318. [23] Yang Liu and Mirella Lapata. “Text summarization with pretrained en- coders”. In: arXiv preprint arXiv:1908.08345 (2019). [24] Infolks Pvt Ltd. Recurrent Neural Network and Long Term Dependencies. https : / / medium . com / tech - break / recurrent - neural - network-and-long-term-dependencies-e21773defd92. Ac- cessed: 2021-03-18. [25] H. P. Luhn. “A Statistical Approach to Mechanized Encoding and Searching of Literary Information”. In: IBM Journal of Research and De- velopment 1.4 (1957), pp. 309–317. DOI: 10.1147/rd.14.0309. [26] Rada Mihalcea and Paul Tarau. “TextRank: Bringing Order into Text”. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Lan- guage Processing. Barcelona, Spain: Association for Computational Lin- guistics, July 2004, pp. 404–411. URL: https://www.aclweb.org/ anthology/W04-3252.

68 Bibliography

[27] Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernocký, and San- jeev Khudanpur. “Recurrent neural network based language model”. In: vol. 2. Jan. 2010, pp. 1045–1048. [28] Multilingual BERT snippet. URL: https : / / peltarion . com / knowledge-center/documentation/modeling-view/build- an- ai- model/pretrained- snippets/multilingual- bert- snippet. [29] Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre, Bing Xiang, et al. “Abstractive text summarization using sequence-to-sequence rnns and beyond”. In: arXiv preprint arXiv:1602.06023 (2016). [30] Christopher Olah. Understanding LSTM Networks. https://colah. github.io/posts/2015-08-Understanding-LSTMs/. Accessed: 2021-03-17. [31] Oppsummert. Egne selskaper ga regjeringspolitikerne etterlønn. https:// www.aftenposten.no/norge/i/70gbJ4/egne-selskaper-ga- regjeringspolitikerne-etterloenn. Accessed: 2021-05-13. [32] Oppsummert. Kritikk etter dødsulykken på Filipstad. https : / / www . aftenposten . no / verden / i / xPWQ6G / kritikk - etter - doedsulykken-paa-filipstad. Accessed: 2021-05-13. [33] Oppsummert. Prinsdal: Politiet knytter ny siktet til gjengmiljøet. https: / / www . aftenposten . no / norge / i / WbBReG / prinsdal - politiet - knytter - ny - siktet - til - gjengmiljoeet. Ac- cessed: 2021-05-13. [34] Oppsummert. Tiltalte: Ble provosert da Mohammed sa «jeg elsker henne». https://www.aftenposten.no/norge/i/RR23Jd/tiltalte- ble - provosert - da - mohammed - sa - jeg - elsker - henne. Ac- cessed: 2021-05-13. [35] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Technical Re- port 1999-66. Previous number = SIDL-WP-1999-0120. Stanford Info- Lab, Nov. 1999. URL: http://ilpubs.stanford.edu:8090/422/. [36] Michael Quinn Patton. “Qualitative evaluation checklist”. In: Evaluation checklists project 21 (2003), pp. 1–13. [37] Romain Paulus, Caiming Xiong, and Richard Socher. “A Deep Reinforced Model for Abstractive Summarization”. In: CoRR abs/1705.04304 (2017). arXiv: 1705 . 04304. URL: http : / / arxiv . org/abs/1705.04304. [38] Telmo Pires, Eva Schlinger, and Dan Garrette. “How multilingual is Multilingual BERT?” In: CoRR abs/1906.01502 (2019). arXiv: 1906 . 01502. URL: http://arxiv.org/abs/1906.01502.

69 Bibliography

[39] Anand Rajaraman and Jeffrey David Ullman. Mining of massive datasets. Cambridge University Press, 2011. [40] Wasim Riaz, Daniel Røed-Johansen, Frøydis Braathen, and Harald Stolt- Nielsen. Politiet: 20-åringen som er siktet i Prinsdal-saken, har tilknytning til gjengmiljøet. https://www.aftenposten.no/norge/i/mRdr1l/ politiet - 20 - aaringen - som - er - siktet - i - prinsdal - saken-har-tilknytning-t. Accessed: 2021-05-13. [41] Abigail See, Peter J Liu, and Christopher D Manning. “Get to the point: Summarization with pointer-generator networks”. In: arXiv preprint arXiv:1704.04368 (2017). [42] Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang. “How to Fine-Tune BERT for Text Classification?” In: CoRR abs/1905.05583 (2019). arXiv: 1905.05583. URL: http://arxiv.org/abs/1905.05583. [43] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. 2017. arXiv: 1706.03762 [cs.CL]. [44] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Moham- mad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. “Google’s Neural Machine Translation System: Bridging the Gap between Hu- man and Machine Translation”. In: CoRR abs/1609.08144 (2016). arXiv: 1609.08144. URL: http://arxiv.org/abs/1609.08144. [45] Chenguang Zhu, Ziyi Yang, Robert Gmyr, Michael Zeng, and Xuedong Huang. Leveraging Lead Bias for Zero-shot Abstractive News Summariza- tion. 2021. arXiv: 1912.11602 [cs.CL].

70 A Appendix

A.1 All responses from Human Evaluation

Each section presents the articles and the questions sent to the journalists.

A.1.1 Article 1 The article can be found in [12] and its gold summary in [32].

The generated summary where each bullet point is an extracted sentence from the original article:

• Statens havarikommisjon for transport retter kritikk om mangler ved inngjerding, skilting og sikkerhetsvurdinger etter ulykken ved Filipstad der en 15-åring døde og to ble skadet.

• Ulykken skjedde 24. februar 2019 da de tre ungdommene hadde tatt seg inn på Filipstad driftsbanegård, like ved Ruseløkka fritidsklubb.

• Da de klatret opp på et hensatt togsett, ble alle tre utsatt for strøm fra kjøreledningen.

• En 15-årig gutt døde, mens en 15 år gammel gutt og en 16 år gammel jente ble alvorlig skadet.

• Den ene av de to ungdommene som falt på bakken, klarte ifølge rap- porten å gå ut av togtunnelen på stedet og tilkalle hjelp etter det som hadde skjedd.

I A.1. All responses from Human Evaluation

• Vedkommende løp deretter tilbake til ulykkesstedet for å hjelpe sine venner.

• Gjerdet hadde hull og var for lavt Ungdommene hadde forut for ulykken tatt seg gjennom et hull i et gjerde som ikke var i henhold til sikkerhetsforskriften: Det skulle være 180 cm, men var på det laveste kun 106 cm.

• I tillegg manglet denne delen av gjerdet korrekt skilting med «Adgang forbudt».

• Oppe på taket beveget de seg slik at de kom i berøring med kontaktled- ningsanlegget og fikk strømgjennomgang.

• I det kraftige strømstøtet som fulgte ble den omkomne liggende på taket, mens de to andre ble kastet ned på bakken.

Qualitative responses from the journalists is presented in A.1:

II A.1. All responses from Human Evaluation

Table A.1: Responses from the journalists on article 1

Response Weaknesses/strength of the summary?

1 This summary doesn’t reflect all the aspects of the article 2 I think the summary is a bit long, but that’s mostly because it sometimes is quite repetitive. If you could take out the repetitive parts, I think the summary all in all has captured the most important aspects of the article. 3 When thinking about Oppsummert-articles, I think the shorter, the better. Hence the grade 3 on length adequacy. I think I would have preferred the text to have a little less details about the actual event and one sentence containing an answer from Bane Nor. I think these two sentences could have been dropped: • En 15-årig gutt døde, mens en 15 år gammel gutt og en 16 år gammel jente ble alvorlig skadet (some of it has already been said in the first sentence ). • Oppe på taket beveget de seg slik at de kom i berøring med kontaktledningsanlegget og fikk strømgjennomgang. Also, there is something strange with this sentence (but maybe it is simply missing a bullet point before the word Ung- dommene?). • Gjerdet hadde hull og var for lavt Ung- dommene hadde forut for ulykken tatt seg gjennom et hull i et gjerde som ikke var i henhold til sikkerhetsforskriften: Det skulle være 180 cm, men var på det laveste kun 106 cm. 4 Bullet point 5 and 6 doesn’t provide any insight to the se- curity issues. Bullet point 4 does nothing else than provide gender. 5 The summary is accurate, but feels a little repetitive and excessive. For instance two sentences are about the num- ber of deaths/injuries. The faults found in the report could have been summarized more succinctly and it should have included one quote/sentence about the response from Bane NOR.

A.1.2 Article 2 The article can be found in [40] and its gold summary in [33].

The generated summary where each bullet point is an extracted sentence from the original article:

III A.1. All responses from Human Evaluation

• Politiet har siktet enda en 20 år gammel mann for drapet eller medvirkn- ing til drapet på Halil Kara (21) i Prinsdal.

• Ifølge en fersk kjennelse er han godt kjent for politiet.

• Den 20 år gamle mannen er fra Oslo og er den tredje som blir siktet i drapssaken i Prinsdal sør i Oslo.

• En kamerat av ham ble skadet i basketaket som fulgte.

• 20-åringen som fredag ble siktet i drapssaken, var allerede varetekts- fengslet for fire uker på bakgrunn av grov ulovlig bevæpning på of- fentlig sted.

• Han ble fremstilt for fengsling i Oslo tingrett torsdag.

• Han ble fengslet med brev- og besøksforbud i to uker.

• Han er blitt tatt to ganger, i løpet av kort tid, med svært farlig skytevåpen».

• Politiinspektør Grete Lien Metlid har gitt uttrykk for at politiet foreløpig ikke kjenner motivet for drapet.

• Aftenposten er kjent med at det over tid har vært flere konflikter mellom unge menn fra Holmlia og unge menn fra Mortensrud.

Qualitative responses from the journalists is presented in A.2:

IV A.1. All responses from Human Evaluation

Table A.2: Responses from the journalists on article 2

Response Weaknesses/strength of the summary?

1 The most important content of the main story is present in the summary. One small error, and some small repeats. But all over, a quite good summary. 2 Some spelling errors, and it’s sometimes difficult to under- stand if the text is talking about the deceased or the accused. It seems more like a summary, then a coherent text, so it’s not the best reading experience. 3 I think one of the sentences in the summary is in fact in- correct – the sentence «En kamerat av ham ble skadet i bas- ketaket som fulgte». I think «ham» in this case actually is the killed person, Halil Kara (21). But I think the original text is written in a way that makes it easy to misunder- stand this. I think it was impressive to reduce the (quite long) story to a summary this short. This (quite impor- tant sentence) is the only one I am missing from the sum- mary (would have been included if I had written it myself): Øyvind Bratlien er forsvarer for 20-åringen som ble siktet fredag. Han skriver i en SMS til Aftenposten at hans klient sier at han er uskyldig. 4 Bullet point 4 is confusing and lacks context. Bullet point 3 provides little new information with many words. Bullet point 6 is redundant when you have point 7. 5 Most of the story is well covered here, but a few sentences are unnecessarily repetitive. For instance, "20-åringen som fredag ble siktet i drapssaken," could have just been "Den siktede". Also, this sentence appears out of context and should just be removed: "En kamerat av ham ble skadet i basketaket som fulgte."

A.1.3 Article 3 The article can be found in [8] and its gold summary in [31].

The generated summary where each bullet point is an extracted sentence from the original article:

• Flere av Erna Solbergs tidligere regjeringskollegaer har fått ekstra etter- lønn fordi de har opprettet selskaper som gir interessekonflikter.

V A.1. All responses from Human Evaluation

• Eksstatsråd Robert Erikssons selskap er allerede avviklet – uten at det har gitt inntekter.

• Som Aftenposten tidligere har skrevet, har Erna Solberg satt rekord i antallet statsrådsutskiftninger i sitt regjeringsprosjekt.

• Den store gjennomtrekken av politikere i regjeringsapparatet har ført til en stadig voksende etterlønnsregning til skattebetalerne.

• 20 millioner i etterlønn før Frp-exit Aftenposten har sammenstilt data fra Statsministerens kontor og Karantenenemnda.

• Også Stavanger Aftenblad har omtalt omfanget av etterlønnsutbetalin- gene.

• Aftenposten har sett nærmere på hva som har skjedd i disse virk- somhetene.

• Konsulentselskap uten inntekter ga høy etterlønn Blant dem som har fått aller mest etterlønn er Fremskrittspartiets Robert Eriksson, som gikk av som arbeidsminister i desember 2015.

• Da tenkte jeg at jeg får livnære meg selv, med det jeg gjorde før jeg gikk inn i politikken.

• Da drev jeg med rådgivning om pensjon og forsikring, sier Eriksson, som i dag er administrerende direktør i arbeidsgiverorganisasjonen Sjø- matbedriftene.

Qualitative responses from the journalists is presented in A.3:

VI A.1. All responses from Human Evaluation

Table A.3: Responses from the journalists on article 3

Response Weaknesses/strength of the summary?

1 This is a rather long and complicated main article. The sum- mary doesn’t get the main and most important point here, that this is something the politicians are doing on purpose. And some of the sentences in the summary are not impor- tant at all. 2 I think this summary is missing some key points. The part of the article that is summarized, does not make me more able to understand the pressure points of the case. It seems more like a list of facts, then a summary of this particular case. And I don’t think quotes should be used in summaries, especially if it has no context (which is the case here) 3 The original article is a bit complicated, so making a sum- mary here is a real challenge, I would say. A couple of the sentences contain both a sentence AND the «mellomtittel». I have here marked the mellomtittel with () - they should ide- ally have been removed: • (20 millioner i etterlønn før Frp- exit) Aftenposten har sammenstilt data fra Statsministerens kontor og Karantenenemnda. AND «• (Konsulentselskap uten inntekter ga høy etterlønn) Blant dem som har fått aller mest etterlønn er Fremskrittspartiets Robert Eriksson, som gikk av som arbeidsminister i desember 2015.» It should ide- ally be made clearer that this sentence is a quote from R. Erikssen: • Da tenkte jeg at jeg får livnære meg selv, med det jeg gjorde før jeg gikk inn i politikken. 4 Bullet point 3 is not very relevant. Bullet point 5 and 8 have included subtitles that give no meaning in the sum- mary. Bullet points 6 and 7 give nothing in terms of con- tent. Bullet point 9 is a quote, but that is not clear in the summary. 5 This one had quite a few weaknesses. In several sentences, the algorithm has combined subheaders and text (like here: " 20 millioner i etterlønn før Frp-exit Aftenposten har sam- menstilt data fra Statsministerens kontor og Karantenen- emnda.") The length is good for such a long article, but the review of different politicians could more efficiently have been merged into one or two summarizing sentences. Some sentences are also irrelevant in a summary, such as the one about Stavanger Aftenblad. Also, key pieces of information, like the total sum of money (20 million), is not included properly.

VII A.1. All responses from Human Evaluation

A.1.4 Article 4 The article can be found in [11] and its gold summary in [34].

The generated summary where each bullet point is an extracted sentence from the original article:

• Mohammed Altai (16), som døde etter en voldsepisode på Holmlia i 2017, skal ha gått på hjertemedisiner da ugjerningen mot ham skjedde.

• Den tiltalte 21-åringen nekter å forklare seg for retten.

• I Oslo tingrett sitter en 21 år gammel mann som er tiltalt for grov kropp- skrenkelse og for å ha etterlatt Mohammed Altai i en hjelpeløs tilstand.

• – Min klient opplever saken som svært belastende, og avisoppslagene de siste dagene har vært en ekstra påkjenning for ham.

• Saken er dratt ut av proporsjoner der både ære og drap er trukket inn.

• Hjertefeil?

• I retten fortalte aktor Børge Enoksen at obduksjonsrapporten av Mo- hammed Altai viser at slagene fra tiltalte var relativt beskjedne, og at de var med flat hånd.

• Mohammed døde uansett av skadene.

• Han fikk hjerteflimmer som følge av oksygenmangel til hjernen, falt i koma og døde fem uker senere, 25. juli 2017.

• Mohammeds mulige hjertesykdom skal belyses senere denne uken når den medisinske sakkyndig rapporten senere skal legges frem i retten.

Qualitative responses from the journalists is presented in A.4:

VIII A.1. All responses from Human Evaluation

Table A.4: Responses from the journalists on article 4

Response Weaknesses/strength of the summary?

1 The summary seems to just cover the first half of the main article (?) 2 The length is good. The summary is missing some key infor- mation, and contains some information that is not necessary at all. Still it doesn’t seem like a summery (text), but more like a listing of facts. 3 The summary is quite confusing to read and I do not think I would understand much had I not known the whole orig- inal story beforehand. The summary sentences seem very randomized when I read them. If the third sentence (• I Oslo tingrett sitter en 21 år gammel mann som er tiltalt for grov kroppskrenkelse og for å ha etterlatt Mohammed Altai i en hjelpeløs tilstand.) had been first in the summary, it may have been easier to get into the story. On this note, it should be noted that the original text is quite complicated, and the text is quite long. The motive behind the (possible) crime is not really made clear in the summary (the story about the relationship with the sister). It also seems that the summary emphasizes the first and the middle part of the original text – and not the last third of the original txt. Also, this word is just a «mellomtittel» and could have been dropped: • Hjertefeil? 4 Bullet point 4 is a quote, but it doesn’t say from whom. Bul- let point 5 is also a quote, but it is not presented as such. Bullet point 6 is just a one word subtitle. The summary does not include anything about the accused’s younger sis- ter. 5 This one did not work so well. It is a good example that the first sentences often have key information, but still needs to be contextualized and ordered properly in a summary. Sentence 2 and 3 should have been switched around, and as the summary goes on, more bullet points appear totally out of context.

IX