DEGREE PROJECT IN THE FIELD OF TECHNOLOGY ENGINEERING PHYSICS AND THE MAIN FIELD OF STUDY COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2020

Employing a Transformer Language Model for Information Retrieval and Document Classification

Using OpenAI's generative pre-trained transformer, GPT-2

ANTON BJÖÖRN

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

Employing a Transformer Language Model for Information Retrieval and Document Classification

ANTON BJÖÖRN

Master in Date: July 11, 2020 Supervisor: Håkan Lane Examiner: Viggo Kann School of Electrical Engineering and Computer Science Host company: SSC - Swedish Space Corporation Company Supervisor: Jacob Ask Swedish title: Transformermodellers användbarhet inom informationssökning och dokumentklassificering

iii

Abstract

As the information flow on the Internet keeps growing it becomes increasingly easy to miss important news which does not have a mass appeal. Combating this problem calls for increasingly sophisticated information retrieval meth- ods. Pre-trained transformer based language models have shown great gener- alization performance on many natural language processing tasks. This work investigates how well such a language model, Open AI’s General Pre-trained Transformer 2 model (GPT-2), generalizes to information retrieval and classi- fication of online news articles, written in English, with the purpose of compar- ing this approach with the more traditional method of Term Frequency-Inverse Document Frequency (TF-IDF) vectorization. The aim is to shed light on how useful state-of-the-art transformer based language models are for the construc- tion of personalized information retrieval systems. Using transfer learning the smallest version of GPT-2 is trained to rank and classify news articles achiev- ing similar results to the purely TF-IDF based approach. While the average Normalized Discounted Cumulative Gain (NDCG) achieved by the GPT-2 based model was about 0.74 percentage points higher the sample size was too small to give these results high statistical certainty.

Keywords: , Transformer Models, Information Retrieval, Ranking, Generative Pre-training, Document Classification iv

Sammanfattning

Informationsflödet på Internet fortsätter att öka vilket gör det allt lättare att missa viktiga nyheter som inte intresserar en stor mängd människor. För att bekämpa detta problem behövs allt mer sofistikerade informationssöknings- metoder. Förtränade transformermodeller har sedan ett par år tillbaka tagit över som de mest framstående neurala nätverken för att hantera text. Det här arbetet undersöker hur väl en sådan språkmodell, Open AIs General Pre-trained Trans- former 2 (GPT-2), kan generalisera från att generera text till att användas för informationssökning och klassificering av texter. För att utvärdera detta jäm- förs en transformerbaserad modell med en mer traditionell Term Frequency- Inverse Document Frequency (TF-IDF) vektoriseringsmodell. Målet är att klar- göra hur användbara förtränade transformermodeller faktiskt är i skapandet av specialiserade informationssökningssystem. Den minsta versionen av språk- modellen GPT-2 anpassas och tränas om till att ranka och klassificera nyhets- artiklar, skrivna på engelska, och uppnår liknande prestanda som den TF-IDF baserade modellen. Den GPT-2 baserade modellen hade i genomsnitt 0.74 procentenheter högre Normalized Discounted Cumulative Gain (NDCG) men provstorleken var ej stor nog för att ge dessa resultat hög statistisk säkerhet.

Nyckelord: djupinlärning, transformermodeller, informationssökning, ranking, generativ förträning, dokumentklassificering Contents

1 Introduction 1 1.1 Overview ...... 1 1.1.1 The Space Industry ...... 1 1.1.2 Swedish Space Corporation ...... 2 1.1.3 Problem Formulation ...... 2 1.1.4 Approach ...... 2 1.2 Research Question ...... 3 1.2.1 Hypothesis ...... 3 1.2.2 Conditions & Limitations ...... 3

2 Background 5 2.1 Artificial Neural Networks ...... 5 2.1.1 The ...... 5 2.1.2 Deep Neural Networks & Stochastic ...... 6 2.2 Transformer Networks ...... 8 2.2.1 Network Architecture ...... 10 2.2.2 Generative Pre-Training and GPT-2 ...... 12 2.2.3 Generalization Capabilities of GPT-2 ...... 13 2.3 Information Retrieval ...... 16 2.3.1 Document Ranking & Relevance Score ...... 16 2.3.2 TF-IDF ...... 17 2.3.3 Neural Relevance Scoring ...... 18 2.3.4 Ensemble Relevance Scoring (TF-IDF + GPT-2) . . . 19 2.4 Evaluation ...... 19

3 Method 21 3.1 Data Collection ...... 21 3.1.1 Web Scraping ...... 21

v vi CONTENTS

3.1.2 SSC News Summaries & Fundamental Data Set . . . . 21 3.1.3 Daily Web Scraping & Main Data Set ...... 22 3.1.4 Labeling of Main Data Set ...... 24 3.2 TF-IDF Based Model ...... 25 3.2.1 Training ...... 25 3.2.2 Scoring and Classification ...... 26 3.3 GPT-2 Based Model ...... 27 3.3.1 Adaptation ...... 27 3.3.2 Relevance Classifier ...... 28 3.3.3 Zone Classifier ...... 29 3.3.4 Data Set ...... 30 3.3.5 Training ...... 30 3.3.6 Scoring & Classification ...... 31 3.4 Ensemble Models ...... 32 3.4.1 Multiplicative Model ...... 32 3.4.2 Re-ranking Model ...... 32 3.5 Evaluation ...... 33 3.5.1 N-Fold Cross Validation ...... 33 3.5.2 Normalized Discounted Cumulative Gain ...... 33 3.5.3 Classification Accuracy ...... 34 3.5.4 Resource Demands ...... 34 3.6 Statistical Significance & Null Hypothesis ...... 34

4 Results 36 4.1 Relevance Scoring ...... 36 4.2 Classification Accuracy ...... 37 4.3 NDCG vs Accuracy ...... 38 4.4 NDCG@10 ...... 40 4.5 T-Test ...... 41 4.6 Resource Demands ...... 41

5 Discussion 42 5.1 Model Performance ...... 42 5.2 Model Robustness ...... 43 5.3 Ensemble Models ...... 43 5.4 Resource Demands ...... 44 5.5 Ethical Considerations ...... 45 5.6 Sustainability & Societal Considerations ...... 46 CONTENTS vii

6 Conclusions 47 6.1 Employment of Transformer Language Models ...... 47 6.2 Future Work ...... 48

Bibliography 50

Chapter 1

Introduction

This chapter introduces the problem and provides some basic information about the host company and the approach that was applied.

1.1 Overview

The amount of information published on the Internet every day is vast. For some companies it is possible that the amount of news published just within their industry alone takes too much time for a single person to read, making it hard to stay up to date. Furthermore there are many forces at work behind what is published and how it is published (click-baiting, biases, paid articles) which can make finding the most relevant and trustworthy news challenging. This flood of information calls for sophisticated information retrieval techniques to aid in navigating the growing information landscape.

1.1.1 The Space Industry The Space Industry has seen big changes since the beginning of the century with launch prices being reduced by about a factor of 20 [1] accompanied by a big shift towards commercialisation [2, 1]. With a boom in space related startups and commercial interest [1] the information flow within the industry is higher than it’s ever been.

With future plans ranging from huge satellite constellations providing global Internet services [3] to the colonization of Mars [4] and new moon missions [5] there seems to be no end to the coming growth and change.

1 2 CHAPTER 1. INTRODUCTION

1.1.2 Swedish Space Corporation Swedish Space Corporation (SSC), previously known as Rymdbolaget, is a mainly commercial space company owned by the Swedish government. SSC operates the Esrange spaceport in northern Sweden where they carry out mis- sions for their customers including launching experiments aboard sounding rockets and weather balloons. The company also provides science and launch services as well as satellite ground network services from their many ground stations around the world and engineering and spacecraft operation services.

1.1.3 Problem Formulation Swedish Space Corporation (SSC) wants to help their employees stay up-to- date with industry events in a time efficient manner by constructing a program that can automatically retrieve and present the most relevant news items for SSC. The system should automatically collect news articles from selected sites and upon a request produce a summary of news collected within a given range of dates.

The scientific question being investigated is whether a modern transformer model, more specifically Generative Pre-trained Transformer 2 (GPT-2) [6], can be used to improve the results of such a system by being employed for document ranking and classification. The primary objective of the system will however be document ranking, classification is a secondary objective.

1.1.4 Approach Collection of news articles will be done by crawling and scraping pages from a list of online news pages created with the help of SSC. This web scraper will be used both to collect data to be labeled for training and to collect articles for creating the news summaries in the finished software.

Two different methods for document relevance scoring (ranking) and classifi- cation will be implemented. One will be a vector space model based on TF- IDF (term frequency-inverse document frequency) and will serve as a baseline for performance. The other method will perform relevance scoring and classi- fication via transfer learning on the pre-trained transformer model GPT-2 [6]. All web pages scraped will be written in English since GPT-2 is mainly trained on English text. CHAPTER 1. INTRODUCTION 3

1.2 Research Question

"Is employing a transformer language model using transfer learning better than a proven traditional approach for the purpose of information retrieval and document classification?"

This research question is evaluated by comparing the performance of the two different approaches. In this work "better" for the purpose of information re- trieval is defined as retrieving results of higher relevance and "better" for doc- ument classification is defined as achieving a higher classification accuracy. In addition to these metrics of performance, the resource demands of both ap- proaches will also be taken into consideration in the form of execution time.

To answer this question software must be constructed and two different rel- evance scoring and classification methods implemented into it. One method needs to represent a standard approach to the task, to serve as baseline, and the other needs to be based on a state-of-the-art transformer language model. To construct these methods, and the information retrieval part of the system, data collection in the form of web-crawling and scraping will have to be im- plemented. To perform initial training of both methods news items previously manually collected and labeled by SSC will have to be compiled into a data set. Once both methods are implemented they need to be evaluated in three areas: relevance scoring, classification accuracy and resource demands.

1.2.1 Hypothesis Training the smallest version of GPT-2 for document ranking and classification on a small data set yields competitive results to a more standard approach.

1.2.2 Conditions & Limitations This project was carried out during a 6 month period with work on the actual program and testing being focused to about a 4 month period. SSC provided computers and the expertise of their employees for labeling data as well as guidance in steering the project towards a useful end product. The amount of GPU resources available was limited and impacted the choice of model and method.

The project aimed to investigate ways to fine-tune an existing model i.e. to 4 CHAPTER 1. INTRODUCTION

make relatively small additions or changes to an existing model and not to construct a new model. The project did not aim to improve a traditional ap- proach to information retrieval and classification but to implement it up to industry standard in order to compare it with a new promising method. Cre- ation of a basic graphical user interface for the program was necessary but features aimed strictly towards usability were beyond the scope of the project. Chapter 2

Background

In this chapter basic theory about artificial neural networks and information retrieval is provided along with more in depth explanations of the used models and methods. Information about related previous work within the area is also presented along with their implications for this work.

2.1 Artificial Neural Networks

Neurons are one of the two major types of cells that make up the brains of animals, the other being glia [7]. Neurons are the main cells responsible for sending electrical nerve signals through the body upon being sufficiently elec- trically stimulated themselves [7]. The creation of what is now called artificial neurons began with McCulloch and Pitts [8] creating a computational model of neural networks. This was followed by work by Kleene [9] on the link between McCulloch-Pitts neurons and finite automata, a type of computation machine. Hence artificial neurons and artificial neural networks were originally inspired by and named after biological neurons.

2.1.1 The Perceptron The perceptron could be called the forefather of today’s deep and complex artificial neural networks. It can be seen as the simplest form of an artificial neural network, consisting of a single artificial neuron which produces an out- put signal based on a sum of its input signals. In its simplest form the signal is binary (on or off) based on a threshold θ but it can also be a real unthresholded value or mapped using an such as a sigmoid [10].

5 6 CHAPTER 2. BACKGROUND

A Simple Binary Perceptron output = φ(X),X = the set of all inputs x

( P 1 if cxx > θ φ(X) = x∈X (2.1) 0 or − 1 otherwise, depending on network design

Here the coefficients cx are what decides the function of the perceptron. These coefficients correspond to the weights in deep artifical neural networks where, inside the network, the inputs would be the outputs of the previous of arti- ficial neurons. One way to look at a single binary perceptron is that it divides the input space into two halves using a single line (or plane or hyperplane) [11].

2.1.2 Deep Neural Networks & Stochastic Gradient Descent Connecting multiple in sequence allows for a more complex di- vision of the input space provided that some form of non-linearity (activa- tion function) is applied to the outputs of all perceptrons [11]. Without the non-linearity there is no point in making a network deeper. If the operation performed on a vector Xn−1 by layer n is called fn and we know that fn has no non-linearity, i.e. that it is a linear combination of the elements of X, the following holds:

X2 = f2(f1(X0)) = Mf Mf X0 = 2 1 (2.2) (Mf2 Mf1 )X0 = Mf1,2 X0 = f1,2(X0)

Here Mfn is a matrix representation of the operation performed by layer n which is only possible because the operation is strictly a linear combination of elements in X. Once all layers are simply matrix multiplications the associa- tive property of matrix multiplication allows us to calculate a new matrix to represent the operations of all layers at once, i.e. to create a single layer which performs the exact same calculation. This means no complexity was gained by stacking layers.

When training a single layer of artificial neurons (a perceptron), setting the coefficients for each input can be done in many ways, for example by simply minimizing the square error on a data set. The deeper our networks get the harder it becomes to tune the coefficients of each layer since all layers’ optimal set of coefficients depend on the other ones. This is where stochastic gradient CHAPTER 2. BACKGROUND 7

descent (SGD), the key to training artificial neural networks, comes in. In short SGD can be summarized by the following steps [11]:

• One or more data points are fed through the network and their outputs are calculated

• The calculated outputs are compared to the desired outputs for those data points

• The error between the desired and calculated outputs is calculated, this is called the cost or

• Starting from the end of the network and working backwards the deriva- tive of all coefficients with respect to the cost is calculated, giving us a gradient for the entire network

• All coefficients are updated by moving a small step in the negative gra- dient direction, i.e. to reduce the cost

• The above steps are repeated until some criteria is met at which point training is finished

Figure 2.1: Illustration of a simple 2 layers deep artificial neural network 8 CHAPTER 2. BACKGROUND

For the network illustrated in figure 2.1 the parameters learnt are M1 and M2, the network weights. As outlined in the steps above given an input vector we first calculate the output:

Output = Youtput = M2X1 = M2L(M1X0) (2.3)

Next we calculate the error between the output and the desired output Ylabel, the "label" of the data point. Let’s suppose we want to minimize the square error so we choose this to be our cost function:

2 Cost Function = E(Ylabel,Youtput) = (Ylabel − Youtput) (2.4)

Now all that remains is to calculate the gradients of M1 and M2 with respect to our cost and to change them both by a small amount in the negative gradient direction: ∗ δM1 ∗ δM2 M = M1 − η ,M = M2 − η (2.5) 1 δE 2 δE

Here the asterisk ’*’ marks the updated values of M1 and M2. The variable η is called our learning rate and is typically a small value (about 0.1 to 1e- 6) which is often varied during training and is tuned based on the task and network design. This type of SGD is the most basic way to train a network and recent years have seen many improvements to the method such as inclusion of momentum and other averaging strategies for the gradient, a good example of this is the ADAM optimizer [12].

2.2 Transformer Networks

The paper "Attention Is All You Need" by Vaswani et al. [13], at , is often quoted as the original transformer work and goes into depth explaining the workings of transformers. In this report we will more briefly summarise how transformers came about and how they work.

Before transformers the most commonly used networks for handling text were different kinds of Recurrent Neural Networks (RNNs) [13]. A handles generating and reading text sequentially, i.e. by looking at it one piece (often word) at a time and taking as input at each step the previous internal state and the current underlying state. CHAPTER 2. BACKGROUND 9

Figure 2.2: Computation graph of a Recurrent Neural Network.

As illustrated in figure 2.2 the states St are the network’s internal representa- tion states of the sequence constructed from the previous such state and the currently observed underlying state xt, which for text is often a vector repre- sentation of a word. Using the internal state an output yt is produced for each time step t. Or equivalently: yt = Wy(WsSt−1 + Wxxt)

Improvements to RNNs include Long Short-Term Memory networks (LSTMs) [14], these also include a memory state which is maintained throughout the sequence and given as input at each time step. The main issue with training RNNs is their recurrent nature, each step being dependent on previous steps, which make their training unsuitable for parallelization [13]. The work by Vaswani et al. [13] did away with the recurrent nature of RNNs by focusing their entire architecture on a mechanism called "attention" which was intro- duced as an improvement to RNNs in 2014 by Bahdanau, Cho, and Bengio [15].

The idea behind attention is that the network learns which other parts of the data to "pay attention to" at the current step so that it can use the information from that part to produce the correct output. A simple example would be a translation network trying to translate a word with two different meanings such as "address" and having to pay attention to the context in order to know if it is a verb or a noun. If the full sentence is "What is your address?" then paying attention to "what is" might indicate that it is a noun as opposed to a context such as "Who did you address?". This type of attention is often called 10 CHAPTER 2. BACKGROUND

"self-attention" since we calculate attention for the input on itself.

2.2.1 Network Architecture Inside a transformer language model the words of a text are represented as vectors of numbers, word embeddings. Before passing these vectors into the network another vector of equal length, called positional embedding, is added to each word vector element-wise. The positional embedding of each word contains information about its location in the text. In this way a text can be represented as a series of vectors (a matrix) where each vector describes a word and its location in the text.

Figure 2.3: An illustrative mock-up example of GPT-2’s text encoding

In the case of GPT-2 each vector represents a token rather than a word. The tokens can be anything from individual symbols to pairs of words [6]. A single attention block typically learns 3 mappings (matrices of weights), Wq,Wk,Wv, which respectively map each word embedding vector hi to its query, key and value vectors.

A Single Attention Block:

hi ∗ Wq = qi, hi ∗ Wk = ki, hi ∗ Wv = vi (2.6) 0 Each new word embedding hi is produced as a sum of the values vj scaled by the similarity of qi and kj, Vaswani et al. [13] and Radford et al. [6] used a scaled dot-product as similarity. The similarity between qi and kj is what is called the attention. CHAPTER 2. BACKGROUND 11

d = dimension of query and key vectors

0 X qi · kj hi = Softmax( √ )vj (2.7) j d The normalizes a vector into a probability distribution using the exponentials of the elements:

exi Softmax(x) = pi = (2.8) P xj j e The attention blocks actually used by Vaswani et al. [13] and Radford et al. [6] are called multi-head attention blocks. This simply means applying multiple different single attention blocks in parallel, concatenating their resulting vec- tors and applying a final linear mapping Wo before passing the results onward.

In GPT-2 a single "transformer block" consists of layer normalization followed by a multi-head attention block, followed by another layer normalization and a feed forward layer. The feed forward layer and the multi-head attention block have skip-connections [16, 6]. This is illustrated in figure 2.4.

Figure 2.4: An illustration of a single transformer block of GPT-2. 12 CHAPTER 2. BACKGROUND

Transformer blocks are stacked one after the other and make up the main part of the model.

Figure 2.5: A schematic illustration of the entire GPT-2 architecture.

As seen in figure 2.5 the number of transformer blocks, N, varies between different versions of GPT-2. For the smallest version N = 12, for the medium version N = 24, for the large version N = 36 and for the largest version N = 48. This report uses the smallest version so the model has 12 transformer blocks.

2.2.2 Generative Pre-Training and GPT-2 This paper uses the transformer model GPT-2 (Generative Pre-trained Trans- former 2) from "Language Models are Unsupervised Multitask Learners" by Radford et al. [6] at OpenAI. Using the pre-trained weights from the trans- former language model in that paper the model will be adapted to score rel- evance and to classify documents. This model was chosen for the following reasons: CHAPTER 2. BACKGROUND 13

• It is relatively new, the first release being in February of last year (2019).

• It is trained on similar texts to those involved in this project (web text, amongst which are news articles).

• The model is available in four different sizes which allows starting out using the smallest and then scaling up if feasible considering time and computation needed.

• It achieves state-of-the-art performance (the largest version of the model).

The big feature after which GPT-2 is named is generative pre-training. One of the big problems for complex machine learning models is the need for lots of annotated data to train on [17]. For text processing the amount of anno- tated data available is dwarfed by the amount of available unannotated data, which is quickly realized by the fact that the entire Internet is a collection of unannotated text. The idea behind GPT-2 is to leverage this by performing unsupervised pre-training on vast amounts of Internet text before being fine tuned on more specific tasks.

Simply put GPT-2 is trained to guess the next word (token) in a text given the entire text so far as context (up to a maximum of 1024 tokens ≈ 1024 words). This lets the model train on any text as the correct answer is always given, so long as there is a next token in the sequence. The results of this are impres- sive, creating a language model with state-of-the-art performance on 7 out of 8 language modelling data sets tested [6].

2.2.3 Generalization Capabilities of GPT-2 With such extensive pre-training and good language modelling capabilities it is of interest to see what other tasks GPT-2 could be made to perform given su- pervised training. Xu et al. [18] showed that GPT-2 can be adapted to perform text classification. They compared training GPT-2 from scratch in a supervised manner with first performing unsupervised pre-training and then fine-tuning it to a classification task. They found that the pre-trained version needed far fewer data points in order to reach the same performance as the version that was only trained in a supervised fashion. A few other examples of research on the generalization capabilities of pre-trained transformers (using GPT-2 or similar architectures) are: 14 CHAPTER 2. BACKGROUND

• "Learning to Few-Shot Learn Across Diverse Natural Language Classi- fication Tasks" by Bansal, Jha, and McCallum [19]

Using BERT [20] as the underlying architecture for their transformer model they created a model called LEOPARD. From abstract: "LEOPARD ... shows better generalization to tasks not seen at all during training, with as few as 4 examples per label. Across 17 NLP tasks ... we show that LEOPARD learns better initial parame- ters for few-shot learning than self-supervised pre-training or multi-task training, outperforming many strong baselines, for example, yielding 14.5% average relative gain in accuracy on unseen tasks with only 4 ex- amples per label."

Implication: The work shows that in some cases it is possible for a pre- trained transformer to generalize to new tasks with very little additional data.

• "TransferTransfo: A Transfer Learning Approach for Neural Network Based Conversational Agents" by Wolf et al. [21]

Trained a transformer model based on GPT [16] (the model preceding GPT-2) into an open-domain dialogue system called TransferTransfo. From abstract: "The resulting fine-tuned model shows strong improve- ments over the current state-of-the-art end-to-end conversational mod- els like memory augmented and information-retrieval models. On the privately held PERSONA-CHAT dataset of the Conversational Intelligence Challenge 2, this approach obtains a new state-of-the-art, with respective perplexity, Hits@1 and F1 metrics of 16.28 (45% abso- lute improvement), 80.7 (46 % absolute improvement) and 19.5 (20 % absolute improvement)."

Implication: Results indicate that GPT’s general knowledge of language helps it generalize to achieve state-of-the-art performance on dialogue tasks. CHAPTER 2. BACKGROUND 15

• "Enriching Conversation Context in Retrieval-based Chatbots" by Tahami and Shakery [22]

Used BERT [20] to construct a retrieval-based dialogue system, i.e. a dialogue system based on matching queries to available responses. From conclusion: "The model improves upon the BERT Bi-Encoder baseline without greatly affecting inference speed."

Implication: It is possible to perform matching between different text pieces, query and answer, using transformer models with good results.

Based on these findings it should not be surprising to find that GPT-2 can be used to classify text, more specifically news articles in this case. The model is trained to generate text, i.e. to give you the next word in a sequence given the sequence thus far. This is what is exploited to classify texts using GPT-2. By feeding the model texts in a special format ending with the correct label the model can be conditioned to predict this label given an incomplete sequence. In the work on the first version of GPT [16] each sequence started and ended with randomly initialized start and end tokens and parts within the sequence were delimited by "$" as a delimiter token.

Figure 2.6: An example of how GPT-2 could be used as a Part of Speech tagger.

GPT-2 could for example be used as a Part of Speech tagger if given sequences consisting of a sentence, a delimiter token and then the word to tag. In the la- beled training examples in figure 2.6 "" would be replaced with the correct label.

Basically the task of "predicting the word" following the sequence becomes the task of assigning a label. There is no need to actually extract the word at the label position. Instead we can simply use the activations in the network at that point and feed them to a classifier network which performs the classifica- tion and compares the result to a label, which could for example use normal 16 CHAPTER 2. BACKGROUND

one-hot encoding. After that it is just a matter of propagating the error back- wards through the network. This approach requires retraining the entire net- work which for the larger models require significant computation even when just fine-tuning.

This paper will use a very similar approach to this. By attaching two different end tokens to each passage (of about 96 words) from an article the attention mechanism inside the network can be fine tuned so that the output vector for the first token can be used for zone classification and the output vector of the second token for relevance scoring. To each of these vectors a small feed for- ward classifier network is then attached with different purposes, one for zone classification and one for relevance scoring. This approach has the advantage of performing both tasks at once by running the model just once.

2.3 Information Retrieval

The book "Introduction to Information Retrieval", by Manning, Raghavan, and Schütze [23], starts by defining information retrieval (IR) as an academic field of study in the following way:

Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an informa- tion need from within large collections (usually stored on com- puters).

In the case of this project the "material" is news articles and "large collections" is the Internet. The information need is to stay informed regarding the events and developments within the space industry, from the perspective of SSC.

2.3.1 Document Ranking & Relevance Score One way of retrieving documents is to set up a criterion, such as "all docu- ments containing the word ’space’", and then simply presenting all documents that fulfill this criterion. This is known as boolean retrieval and is best suited for smaller collections of text or very specific sets of criteria which will result in very few retrieved documents . In order to be able to find "the best" docu- ment in a collection it is necessary to sort them according to their relevance to our query. Hopefully if our relevance measure is good enough the documents with the highest scores will be what we are looking for [23]. CHAPTER 2. BACKGROUND 17

This project investigates setting this relevance score by applying a neural net- work to each potential document. In order to properly determine the effec- tiveness of this approach a more standard approach to the task is implemented as a baseline, namely TF-IDF scoring (term frequency-inverse document fre- quency).

2.3.2 TF-IDF TF-IDF (term frequency-inverse document frequency) is what is called a vector- space model. This means that it produces a vector to represent each document in a vector space. The task of determining the relevance of a document is then simply to calculate the similarity between the document’s vector and some vector representing our query, usually by using cosine-similarity [23].

Document d’s TF-IDF vector is formed by first calculating the term frequen- cies (TFd,t) for all terms (t) in the document that occur in the corpus:

Occurrences of t in d TFd,t = (2.9) Number of terms in d

Then each term is weighted by how informative it is. TF-IDF builds on the assumption that a term’s informativeness is inversely proportional to its doc- ument frequency, i.e. that the less documents a word occurs in the more in- formative it is. The inverse document frequency (IDFt) for each term (t) in the corpus is calculated and multiplied to each term frequency to form its final TF-IDF value:

Total Documents in Corpus IDFt = log( ) (2.10) Documents Containing t

TF -IDFd,t = TFd,t × IDFt (2.11) This gives us d vectors (one for each document) of dimension t.

TF-IDF is one of the most widely used methods and weighting schemes for document ranking [24, 25]. The original TF-IDF work is by Luhn [26] and was done in 1957. Despite this it is not hard to find more recent, [27] (2003), [28] (2009) and very recent examples of TF-IDF being used [29, 24] (2020). A TF-IDF based relevance model will therefore be used as the baseline for per- formance in this work. In addition to being widely used it is a mathematically and computationally quite simple method which will make it easier to ensure that it was implemented correctly. 18 CHAPTER 2. BACKGROUND

2.3.3 Neural Relevance Scoring Producing relevance scores using neural networks is in itself not new. Al- ready in 2013 Lu and Li [30] developed an architecture for matching short texts, which can be used to retrieve relevant documents if one of the texts is a query. More recent work includes Deep Relevance Matching Model (DRMM) by Guo et al. [31], using a feed forward network, and DeepRank by Pang et al. [32] which utilizes Convolutional Neural Networks and Recurrent Neural Net- works (more specifically Gated Recurrent Units). A transformer architecture similar to GPT-2, namely BERT by Devlin et al. [20], has been used for rele- vance ranking by Qiao et al. [33], Nogueira and Cho [34] and also by Tahami and Shakery [22].

The current approach to doing query relevance scoring with a transformer based architecture seems to be concatenating the document which we are scor- ing to the end of the query separated by a delimiter token and then feeding it to the network. In this project we are not interested in varying the query so in- stead the query will be programmed into the weights of the network and is part of what the network is learning. Whatever text you show the network it will tell you only how relevant it is to SSC. This implicit query will be taught in the way the data is labeled, i.e. by labeling data according to a general relevance to SSC we are teaching the network to perform that particular task. This means that the network won’t be a general search engine and to change the query it would have to be trained again with new data whose relevance was labeled according to that query. The query implicitly taught to the network could ab- stractly be summarised as "documents that are relevant to SSC". The idea is that by making the model highly specialized it will perform its task better.

In the work of Pang et al. [32] and Guo et al. [31] the loss function used to train the network for relevance scoring is a form of sigmoid hinge-loss. Given two documents p that have been assigned a relevance score (p) by the model in the range [0.0, 1.0] (by application of a sigmoid) determine which is labeled as having higher relevance p+ and which is labeled as lower p−. The loss is calculated as follows:

L(p+, p−) = max (0, 1 − R(p+) + R(p−)) (2.12)

This loss makes sure that the most relevant documents are boosted in rele- vance frequently while documents of lower relevance have their relevance di- minished frequently. Documents of average relevance are both boosted and CHAPTER 2. BACKGROUND 19

diminished in relevancy. This loss measure has been shown to work for train- ing models with a similar function as the one developed in this paper [32, 31] and so will be implemented into the cost function used to train the model. This sigmoid hinge-loss should cause the model to distribute documents more evenly across the entire relevance score interval from 0 to 1 instead of cluster- ing documents at either 0 or 1.

2.3.4 Ensemble Relevance Scoring (TF-IDF + GPT-2) Nogueira and Cho [34] used the transformer model BERT [20] for re-ranking documents, i.e. to improve the ordering of the top results as ranked by an- other method. They show that using a transformer in this way can improve the ranking. A key advantage of this approach is that it reduces the number of documents on which the transformer model needs to be applied, which is desirable since it is a more computationally demanding method than vector space models such as TF-IDF. In this work combining a TF-IDF model with a GPT-2 model is done in a similar way where the former performs an initial ranking of news items while the latter reorganizes (re-ranks) the top 10%-50% of documents. This method can potentially leverage the strengths of both ap- proaches to produce an ensemble method which outperforms both.

In addition to combining the TF-IDF and GPT-2 models in this way a simple multiplication of their respectively produced relevance score is also investi- gated as a potential ensemble method.

2.4 Evaluation

The work of Pang et al. [32], DeepRank, used three different measures for determining the quality of their results: Normalized Discounted Cumulative Gain at K (NDCG@K), Precision at K (P@K) and Mean Average Precision (MAP). For these measures K is the number of top results we calculate the measure over, in the case of DeepRank the Ks used were 1,3,5 and 10. MAP will not be applicable in this case since it is an average over queries and the query will not be varied. Precision at K is not well suited for non-binary rele- vance and in this work relevance will be score on an integer scale from 1 to 5 hence only NDCG will be used. The paper on the Deep Relevance Matching Model by Guo et al. [31] also uses these three metrics although with a K of 20. NDCG evaluates the quality of the ranking of a list of documents. 20 CHAPTER 2. BACKGROUND

Precision at K is simply the number of relevant documents amongst the top K ranked documents, divided by the number of documents, K. Number of Relevant Documents P @K = (2.13) K Discounted Cumulative Gain is similar to Precision at K. The difference is that DCG@K takes into account the position where a relevant document is found and is more compatible with non-binary relevance. The Normalized Discounted Cumulative Gain simply divides the Discounted Cumulative Gain by its highest achievable value.

K X relevance(i) DCG@K = (2.14) log i + 1 i=1 2 Now if we figure out what the ideal DCG@K score is and call it IDCG@K then we get the NDCG@K simply as: DCG@K NDCG@K = (2.15) IDCG@K To measure the zone labeling performance a simple accuracy measure will be used: Correctly Labeled Zones Accuracy = (2.16) T otal Zone Labels Chapter 3

Method

This chapter goes into depth about how all parts of the project were carried out including collection & labeling of data, construction & application of models and evaluation of the models.

3.1 Data Collection

3.1.1 Web Scraping Web scraping was a central component in constructing the data set for this project as well as for the function of the end product. A web scraper was con- structed using the Python Scrapy [35] in conjunction with the Python library Newspaper3k [36]. Scrapy was used to control the crawling of the web pages and the extraction of the html while Newspaper3k was used for parsing the scraped html into article title and main text.

3.1.2 SSC News Summaries & Fundamental Data Set The main benefactor of the project at SSC has in the past manually created news summaries for sharing with their colleagues on a weekly to monthly ba- sis. These pdf files contain links to news articles that are of interest to SSC. With only a few exceptions these articles are in English. In total there are 320 articles in these summaries, after removing any articles not in English. This collection of articles was used as a starting point for the project and construc- tion of the main data set.

Almost all of the articles in the summaries had URL links provided allowing

21 22 CHAPTER 3. METHOD

the contents of the articles to be scraped using the constructed web scraper. These links were compiled manually into a list and then each article was scraped, parsed and saved to file using the web scraper. In total 292 of the 320 articles were successfully scraped and put into a fundamental data set. The articles in this data set have labels given to them in the news summaries, such as which geographical zone the article belongs to, but are unlabeled relevance wise be- sides from being implicitly labeled as relevant by appearing in a summary.

3.1.3 Daily Web Scraping & Main Data Set To construct the main data set, which was used to train the models in this work, new articles were collected daily by scraping any new articles that appeared on a selection of web pages over the course of about a month, more specifically from 2020-02-05 to 2020-03-05. The set of pages scraped was initially all of the domains on which the articles in the fundamental data set were published. During the course of the data collection additions to and revisions of the list of domains were done in cooperation with SSC’s expertise and preferences. All of the scraped domains were in English.

To detect which articles are new on a page the scraper saves, for each domain, all URLs it has previously seen. Each day when the domain is scraped only unseen URLs are scraped, based on the assumptions that the links (URLs) that vary on a typical news site are the article links and that URLs are not reused. To summarize the scraping process the scraper creates 32 simultaneous spi- ders (threads) that are each responsible for crawling one domain. Once that domain is crawled to a depth of 1, meaning we only follow links on the main page and not on subsequent pages, the spider terminates and a new spider is created and given a new domain to crawl, so long as there are uncrawled do- mains left. The first time a new domain is crawled no articles are saved and the scraper simply maps the domain by marking all the found links as seen. The following day the domain is scraped as normal and the contents on the pages of any new links that appeared are scraped and saved as published on that day.

The scraper was only run on weekdays meaning that any article that appeared during a weekend was marked as published the following Monday. By not scraping all domains during weekends articles published during those days run a higher risk of not being scraped since a link to an article can appear and be removed from the main page within that time. This is in general an inherent risk using this scraping approach, especially for pages that publish a CHAPTER 3. METHOD 23

great number of articles every day. However since multiple news sources tend to cover the same news and since this approach has a very high versatility the trade off was deemed acceptable.

During the data collection about a thousand articles were scraped every day out of which a majority were not related to the space industry and an even larger majority not relevant to SSC. To pick out articles to be labeled and put into the main data set a rudimentary TF-IDF model was constructed using the fundamental data set. A TF-IDF vector was then created by taking the aver- age TF-IDF vector of all articles in the fundamental data set, representing an average article in an SSC news summary. By calculating the cosine similarity between this average vector and all scraped documents it became possible to crudely sort them by relevance. The main data set was then constructed by saving the top 20 articles, sorted by this crude relevance, scraped during 17 days, the earliest being the 5th of February and the latest the 5th of March. This yielded 340 articles for manual relevance and zone tag labeling. Since there is a bias towards higher relevance in the top another 20 articles from rank 50 downwards were included from 3 different days to balance the data set. In the end the data set consisted of 389 articles, 11 articles were exluded due to being duplicates or having otherwise failed to scrape properly.

In this work relevance was measured on an integer scale from 1 to 5. At the time of training it was discovered that the data set contained too many articles with a relevance of 1 (low relevance). For the purpose of training specifically GPT-2 there also turned out to be too few articles with a relevance of 5 (high relevance). To remedy this half of items with a relevance of 1 were randomly removed from the data set used. Furthermore during training of the GPT-2 based model relevance 5 items were duplicated, i.e. shown twice per epoch. 24 CHAPTER 3. METHOD

Unchunked Data Chunked Data Relevance 1 47 843 Relevance 2 92 1043 Relevance 3 127 1361 Relevance 4 54 639 Relevance 5 23 259* Zone Tag: EMEA 91 1171 Zone Tag: APAC 113 1314 Zone Tag: AMER 162 1982 Total 343 4145 Table 3.1: Table showing the distribution of labels in the main data set as it was used in training. *Doubled during training of GPT-2

3.1.4 Labeling of Main Data Set To label the relevance a simple integer scale from 1 to 5 was used, where 1 represents articles of lowest possible relevance and 5 the maximum. The ge- ographical zone category of each data point was labeled by assigning it any combination of three different labels (EMEA, APAC and AMER) including none of them. The labeling was done manually by 2 employees at SSC, they both received the same instructions for how to label. The instructions (defini- tions) for the different relevance and zone categories are given in table 3.2.

Label Category Definition of label category Relevance 1 Completely irrelevant, the article does not pertain to space in any way Relevance 2 Mostly irrelevant, the articles is related to space but irrelevant to SSC Relevance 3 Relevant, the article is related to or about events in the space industry Relevance 4 Very relevant, the article is relevant and has a clear connection to SSC Relevance 5 Critical, the article is of maximum relevance since it is closely connected to or otherwise important to SSC Zone EMEA Actors or activities mentioned are based in Europe, the Middle-East or Africa Zone APAC Actors or activities mentioned are based in the Asia-Pacific region Zone AMER Actors or activities mentioned are based in North-, Central- or South America Table 3.2: Table of the definitions of the different label categories used. CHAPTER 3. METHOD 25

3.2 TF-IDF Based Model

To construct Term Frequency-Inverse Document Frequency (TF-IDF) models the Python library Scikit-learn was used [37], more specifically the TfidfVec- torizer class which converts raw documents into TF-IDF features. N-grams of up to length 2 (bigrams) were allowed as features and the default list of stop words for English was used to filter out common non-informative words.

Since TF-IDF is a vector space model the first step to classification and rele- vance scoring is vectorizing the document (news article) to be classified and scored. Once the document is represented as a vector of TF-IDF features it is compared to four different TF-IDF vectors which the model has learnt. Co- sine similarity was used to compare vectors. The four different vectors are the relevance vector and the three zone classification vectors (EMEA, APAC and AMER) where similarity to the relevance vector is used as the relevance score of the document and the similarities to each zone classification vector is used for zone classification. In other words the scoring and classification of a document is based on its TF-IDF feature similarity to learnt document representations.

3.2.1 Training The training process of the TF-IDF model needs to do three things, it needs to find TF-IDF features, calculate the inverse document frequencies and construct four TF-IDF vectors, one for relevance scoring and three for zone classifica- tion.

All of the text from the documents in the main data set were passed to the TfidfVectorizer class which extracted TF-IDF features and calculated the in- verse document frequencies, it also calculated the TF-IDF representations of all documents in the data set using these found features. To calculate the four vectors, the average TF-IDF vector, Va, of all documents is first calculated.

N = Number of documents in the main data set di = TF-IDF vector of a document with relevance i D = The set of all documents di in the main data set

1 X Va = di (3.1) N di∈D 26 CHAPTER 3. METHOD

The relevance vector, Vr, is then formed as a weighted average of the TF-IDF vectors of any documents with 3 or higher relevance. The average vector is also substracted to increase orthogonality to other vectors.

Ni = Number of documents with a relevance of i max(V, 0) = Vector formed by taking the max with zero element-wise on vector V ! 1  X  Vr = max (i − 2)di − Va, 0 (3.2) N3 + 2N4 + 3N5 di∈D, i>2 The zone vectors are formed in the same manner except for the weighting scheme.

dz = TF-IDF vector of document labeled with zone z ∈ {EMEA, APAC, AMER} Nz = Number of documents labeled with zone z in D

! 1  X  Vz = max dz − Va, 0 (3.3) Nz dz∈D

Finally the vectors Vr,VEMEA,VAP AC and VAMER were all normalized to a length of 1 using the L2 norm.

3.2.2 Scoring and Classification As mentioned previously the relevance score of a document according to our TF-IDF model is simply its cosine similarity to the learnt relevance vector Vr which because our vectors are normalized is simply a dot product between the vectors. d = TF-IDF feature vector of a text document

Document d’s TF-IDF Relevance = Vr · d (3.4) When classifying which zone labels a document should have the individual zone scores are first calculated in the same manner as the relevance score, by a dot product with the corresponding zone vector.

Document d’s TF-IDF zone scores = [VEMEA·d, VAP AC ·d, VAMER·d] (3.5)

The zone scores are normalized so that they sum to 1.0 (using the L1 norm) and then a threshold of 0.28 is used to decide for each zone if the document CHAPTER 3. METHOD 27

should have each label or not. Each zone with a score above 0.28 is assigned as a label to the document. The threshold was tweaked based on the average number of labels each article has in the training data so that the model assigns on average an equal amount of labels to maximize the chance of getting high labeling accuracy.

3.3 GPT-2 Based Model

The transformer model approach was based on the GPT-2 architecture by Rad- ford et al. [6] and used the original code and models provided alongside their paper. More specifically the smallest version of GPT-2 with 124M parameters was used.

3.3.1 Adaptation To adapt the architecture for relevance scoring and zone classification the acti- vations from the last transformer block (GPT-2 Small has 12 blocks) were used and everything after that point in the model was removed. The activations were fed into two new classifiers, one for relevance scoring and one for zone clas- sification. The masking of attention weights was removed since each batch during training will now contain different samples and not different maskings of the same chunk of text.

GPT-2 can take as input samples of length at most 1024 tokens and for di- mensionality reasons it is necessary for all samples in a batch to be of the same length. Because of this and performance reasons articles were split into chunks of 96 tokens instead of being used in their entirety since this would re- quire all texts to be padded to the same length as the longest text in their batch. Another reason for chunking the data in this way is to allow the model to "read" articles about one or two paragraphs at a time, since 96 tokens roughly equals 96 words, which allows for a more interesting relevance measure, see the Scor- ing and Classification section (3.3.6). Splitting the data in this way also allows the model to train on sections of articles at a time which might help conver- gence during training. Whenever the end of an article was reached or an article was too short so that a chunk of 96 tokens could not be made that chunk of text was padded using spaces (" "), based on the assumption that empty space should not impact the model’s interpretation of the text much. By using the same padding token each time the model can potentially learn to ignore them. 28 CHAPTER 3. METHOD

Once the text chunks in a batch have been encoded 2 special tokens are ap- pended to the end of each vector of tokens. These tokens were selected to be symbols very rarely used in news articles. The first token appended was "$$" (encoded as 13702), the zone classification token, and the second was "><" (encoded as 6927), the relevance scoring token. The tokens’ purpose is to give the transformer specific tokens for which to relearn attention map- pings in order to make their activation vectors at the end of the network (after the last transformer block) usable to the zone and relevance classifiers. The activations for the zone classifier token are passed to the zone classifier and the activations of the relevance token to the relevance classifier. When their errors are propagated back through the network this should change the atten- tion mappings of those tokens however the error will also propagate to other attention mappings allowing the model to specialize further.

3.3.2 Relevance Classifier The relevance classifier is a two layer feed forward network which takes the relevance token output vector as input and has 10 nodes in its hidden layer connected to a single output node. The 10 intermediate nodes used a ReLU (rectified linear unit) activation function and the was applied to the output node to produce a relevance score in the interval [0.0, 1.0].

ReLU(x) = max(0, x), element-wise on vector x (3.6)

The relevance classifier uses a loss function inspired by previous work on neu- ral relevance scoring [31, 32]. Each mini-batch of 20 chunks is randomly ar- ranged with the restriction that the documents in the first half of the batch are more relevant than their position-wise respective document in the second half.

Ri = relevance label of document chunk at position i in training batch

Ri > Ri+10, 1 ≤ i ≤ 10 (3.7) In the loss function the chunks in the first half have their relevance boosted while documents in the second half have their relevance diminished. In prac- tice this means that chunks from relevance 5 documents are always boosted, relevance 4 chunks are often boosted and relevance 1 chunks never boosted etc. The loss function used to achieve this is a modified hinge-loss. CHAPTER 3. METHOD 29

Oi = relevance classifier output for a chunk at position i in a batch

F(x) = 2x − 1 (3.8) i=10 X Total Relevance Loss = max(0, 1 − F(Oi) + F(Oi+10)) (3.9) i=1 The weight parameters of the relevance classifier were initialized using He initialization [38, 39] for the ReLU activated layer and Xavier initialization for the (sigmoid activated) output layer [40, 39], the biases were initialized as zero-centered with a standard deviation of 0.01.

3.3.3 Zone Classifier The zone classifier is simply a single feed forward layer which takes the zone token output vector as input (dimension 1x1024) and linearly maps it to a vec- tor of dimension 1x3, each element representing one zone. To this vector of zone logits the sigmoid function is then applied element-wise to map the val- ues to the interval [0.0, 1.0].

The error propagated from this classifier comes from a hinge-loss of each score and their label. To use hinge-loss each output is mapped from [0.0, 1.0] to [-1.0, 1.0], the labels are given as -1 if the data point does not have the label and 1 if it does. The zone classification loss is then the sum of each zone’s hinge-loss.

z = zone index, = chunk of a document Lcz = zone z label of chunk c, −1 not labeled, 1 has the label Ocz = network output for zone z and hunk c

F(x) = 2x − 1 (3.10) z=3 X Zone Classification Loss on c = max(0, 1 − LczF(Ocz)) (3.11) z=1 To get the total zone classification loss we simply sum the loss over all 20 chunks in a batch.

The weight parameters of the zone classifier were initialized using Xavier ini- tialization [40, 39] and the biases were initialized as zero-centered with a stan- dard deviation of 0.01. 30 CHAPTER 3. METHOD

3.3.4 Data Set Training of the GPT-2 based model was done using the same data set as the TF- IDF model i.e. using the main data set with half of relevance 1 items randomly removed. All documents in the data set were divided into chunks representing 96 tokens as encoded by the GPT-2 encoder. To put extra emphasis on the high relevance items and to balance the data set all chunks from relevance 5 documents were doubled, i.e. shown twice per epoch.

3.3.5 Training Training of the GPT-2 based model was done on one GPU more specifically using an NVIDIA GTX 980 graphics card with 4GB of memory. Each model was trained for 20 epochs.

During the first half of epochs a weight factor wzcl is linearly increased from 0 to 1, this weight factor is called the "zone classifier loss weight" and is mul- tiplied with the total zone classification loss. This allows the model to focus on learning relevance classification initially since this is our primary learning goal and then slowly introduces the secondary goal of zone classification. Dur- ing the second half of epochs wzcl is set to 1 and the model tries to consolidate relevance scoring with zone classification. The complete loss function mini- mized during training was the average of the total relevance loss and the total zone classification loss plus an L2 regularization loss applied to all parameters with λ = 0.005. w = An unfrozen weight parameter W = The set of all unfrozen weights wzcl = zone classifier loss weight Lrel = Total relevance loss Lzcl = Total zone classification loss

Lrel + wzclLzcl X Total Loss = + 0.005 × w2 (3.12) 2 w∈W To minimize this loss TensorFlow’s implementation of the Adam Optimizer [12] was used with parameters β1 = 0.9, β2 = 0.999, learning_rate = 5e-5. All biases and weights in the classifiers were of course unfrozen (changed dur- ing training) but in addition to these all weights and biases of the attention mappings in all 12 transformer blocks were unfrozen. All other parameters of the model were left frozen, i.e. kept constant and not changed during the gra- dient descent (training). Note that only weights were subject to regularization, CHAPTER 3. METHOD 31

the biases were not regularized.

Before each epoch of training the data points are randomly shuffled to reduce variance and overfitting. After shuffling, the data is sorted into batches that sat- isfy the inequality given in equation 3.7, i.e. so that the first half of the batch has a strictly higher relevance than the second half. The shuffling ensures that these pairings vary but of course it is not truly random since relevance 5 items will always be in the first half of a batch and relevance 1 items in the second half etc.

An early stopping heuristic was implemented which puts emphasis on the rel- evance scoring over zone classification accuracy. Every time during training when the herustic reaches a new maximum the model is saved and once the training concludes the checkpoint which achieved the highest heuristic value is used to produce the NDCG and zone classification accuracy. Since calcu- lating the NDCG and accuracy takes some time the heuristic is only calculated twice per epoch, in the middle of each epoch and after it.

Early Stopping Heuristic = NDCG + 0.1 × Accuracy (3.13)

During training, especially early on, it is possible for the model to get stuck in local minima. To avoid these outliers impacting the results an automated restart criteria was implemented. If the relevance loss for 20 consecutive batches stays within 2% of each other the training aborts, the model is reset, reinitialized and training begins anew.

3.3.6 Scoring & Classification Once the GPT-2 based model is trained it can be used to assign a relevance score and three zone scores (EMEA, APAC, AMER) to a chunk of text passed to it. Since the model is trained on text chunks of 96 tokens, using the GPT- 2 encoder, it should not be applied to entire articles at once like the TF-IDF model. To score an article its text is split into chunks of 96 tokens and the model is applied to each chunk separately. The total relevance score of the article is then calculated as the maximum relevance score of all the chunks in the article, based on the assumption that an article is overall as relevant as its most relevant chunk of text. Intuitively this means that an article can be about something irrelevant as long as some part of it is very relevant to SSC, making it worth reading. The zone classification scores on the other hand are formed as the average over the individual scores of each chunk in the article. 32 CHAPTER 3. METHOD

This means that a geographical theme has to be present throughout the article to affect the zone classification score, a single mention of for example Wash- ington will not greatly change the AMER zone score.

In order to assign an article a zone label or not in order to calculate the accu- racy the scores are normalized so that they sum to 1.0 (using the L1 norm). A threshold of 0.28 is used to decide for each normalized zone score if the document should have each label or not. Each zone with a score above 0.28 is assigned as a label to the document. The threshold was tweaked based on the average number of labels each article has in the training data so that the model assigns on average an equal amount of labels to maximize the chance of getting high labeling accuracy.

3.4 Ensemble Models

The ensemble models tested were designed to potentially produce a better rele- vance score than the TF-IDF and GPT-2 models applied separately. No attempt was made to create an ensemble model to improve zone classification.

3.4.1 Multiplicative Model The idea behind the multiplicative model is simple. The total relevance score of an article according to the model is the product of the TF-IDF model’s relevance score and the GPT-2 model’s relevance score.

3.4.2 Re-ranking Model The re-ranking model produces its list of articles sorted by relevance by first scoring all articles using the TF-IDF model and sorting them. Then a fraction of the top ranking documents (0.1, 0.2 or 0.5) are resorted using the GPT-2 model before the NDCG score of the sorting is finally calculated. CHAPTER 3. METHOD 33

3.5 Evaluation

This section describes more specifically how the results presented in the results chapter were produced and how the NDCG and classification accuracy was calculated.

3.5.1 N-Fold Cross Validation All results were produced using 5-fold cross validation using the same 5 ran- domly generated folds of the main data set. During each of the 5 measurements one fold is set aside as the test set while the other 4 are seen by the models as training data meaning that the evaluation metrics (NDCG and accuracy) are evaluated on unseen data from the same data set. To increase the sample size of the main evaluation of NDCG and accuracy the 5-fold cross validation was repeated three times on the same folds.

Label type Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Relevance 1 12 9 10 8 8 Relevance 2 13 19 19 20 19 Relevance 3 23 28 24 25 27 Relevance 4 14 10 11 10 8 Relevance 5 6 2 4 5 6 Zone EMEA 20 21 15 16 18 Zone APAC 23 24 22 22 22 Zone AMER 34 37 29 34 28 Total Articles 68 68 68 68 68 Table 3.3: Table of the distribution of labels in the data folds used.

3.5.2 Normalized Discounted Cumulative Gain The normalized discounted cumulative gain (NDCG) was calculated by first calculating the discounted cumulative gain (DCG) of the model on the current test fold of the data. This is simply done by calculating the model’s relevance score for each article in the test set and sorting them in descending order. The DCG is then calculated according to equation 2.14 with K equal to the number of data points in the fold, i.e. by summing the relevance labels of each article weighted logarithmically based on their ranking in the sorted list. Then the articles in the test fold were sorted according to their actual relevance labels and the DCG of this sorting calculated, this is the ideal discounted cumulative 34 CHAPTER 3. METHOD

gain (IDCG) since this represents a perfect sorting. From the DCG and IDCG we get the NDCG as their quotient, equation 2.15.

3.5.3 Classification Accuracy The classification accuracy is only calculated on data points that are labeled with at least 1 zone tag. Data points without any zone tags were included in training but excluded when calculating the accuracy measure since the reason behind the lack of tags was usually uncertainty on behalf of the labeler. We include data points labeled with no zone tags in training to reduce the certainty of the model on similar data points but during accuracy testing we are only interested in articles with a labeled geographical connection.

3.5.4 Resource Demands In order to compare the resource demands of the different approaches two benchmarks were considered:

• The time required to train the model

• The time required to run the model on 1 week worth of scraped articles

The time was measured using the Python Standard Library’s time module, more specifically the function time.time() was used so that the measured time was based on the system CPU time. Time was measured in this way since we are interested in the time it actually takes for a user to run the program.

The week of articles used to benchmark was from 2020-05-11 to 2020-05-15 i.e. containing articles from 2020-05-09 to 2020-05-15 since articles pub- lished during weekends are scraped the following Monday. This week con- tained 7970 scraped articles.

3.6 Statistical Significance & Null Hypothesis

To calculate some level of statistical significance a null hypothesis is needed. For the purpose of determining whether GPT-2 outperforms TF-IDF a good null hypothesis is that the they perform equally well, i.e. that results are drawn from the same normal distribution. The statistical significance of the results was then calculated by an unpaired two-sample t-test using unequal variance CHAPTER 3. METHOD 35

for the compared distributions [41]. This test gives us two measure of signifi- cance the t-score and the p-value. The t-score is the ratio of the difference be- tween the two distributions to their internal variance, indicating how different the groups are. The p-score measures the likelihood of the results occurring by chance. To calculate the t-score and p-value we need the sample size for both distributions. For GPT-2 this is the number of data points (trained mod- els). Since training TF-IDF is not a stochastic process the correct sample size is less obvious. Logically it would be either 5 or arbitrarily large since we have 5 separate results (one for each fold) which can be reproduced any amount of times. To solve this problem the sample size for TF-IDF was set to be the same as for GPT-2. Chapter 4

Results

In this chapter the results produced are presented in both figures and text.

4.1 Relevance Scoring

Figure 4.1: NDCG achieved by all the different models, K=68

36 CHAPTER 4. RESULTS 37

As can just about be seen in figure 4.1 the model that achieved the highest average NDCG score was the ensemble model RR 50%, i.e. the Re-Ranking model which reranks the top 50% of articles, as ranked by the TF-IDF model, using the GPT-2 model. It outperforms the purely GPT-2 based model by a very small margin (about 0.0007305) but suffers from a larger standard devi- ation. The models re-ranking a smaller proportion (RR 20% and RR 10%) performed worse than GPT-2, followed by Mult which is the multiplicative ensemble model. The model using only TF-IDF achieved the lowest average, 0.74 percentage points lower than GPT-2.

The results in figure 4.1 are compiled from 3 separate training attempts on the same 5 folds using cross validation. The results on each fold from all of the three runs are presented as colored crosses. We can see that the green (fold 2) and blue (fold 3) were the hardest for all models and that red (fold 1) and magenta (fold 4) were the easiest.

4.2 Classification Accuracy

Figure 4.2: Zone Classification Accuracy of TF-IDF and GPT-2 models 38 CHAPTER 4. RESULTS

The results in figure 4.2 was produced by the same models as those trained in producing the results of figure 4.1.

The average classification accuracy of TF-IDF and GPT-2 are very similar with GPT-2’s average being slightly higher (0.8482 percentage points higher), how- ever with a greater standard deviation (1.31 percentage points larger). Looking at the results within the individual folds TF-IDF performs more evenly (for all folds) and yields the same results each time when retrained due to its non- stochastic training. Meanwhile GPT-2 achieves very high results (80%+) at least once for each fold as well as very low results for most folds, especially yellow (fold 5) and blue (fold 3).

4.3 NDCG vs Accuracy

Figure 4.3: NDCG vs Zone Classification Accuracy for each trained model CHAPTER 4. RESULTS 39

In figure 4.3 the same data as presented separately in figure 4.1 and 4.2 is plotted, but this time the achieved NDCG of each model is plotted against its Classification Accuracy. Only the TF-IDF and GPT-2 models are presented, the ensemble models are omitted for clarity and because they are merely com- binations of these models.

As can be seen in figure 4.3 there is no clear trade-off between achieving a high NDCG and Classification Accuracy. The covariance between NDCG and Classification Accuracy for GPT-2 was −3.17e−5 and for TF-IDF 7.97e−5. A covariance close to zero implies very low correlation. Since the covariance is in the interval [-1, 1] these covariances are relatively close to zero. For the TF-IDF models this is expected as the way the model handles each tasks has no impact on the other. For GPT-2 however training to reduce the zone classification loss could come at the cost of a higher relevance loss and vice versa. For fold 1 (red) we see that both the TF-IDF model and the GPT-2 models perform similarly well. For fold 2 (green) the TF-IDF model performs a lot poorer than the GPT-2 models, both in NDCG and accuracy. For fold 3 (blue) both models seem to struggle to get a high NDCG, the GPT-2 models manage to achieve a higher NDCG and accuracy than the TF-IDF model but not at the same time. For fold 4 (magenta) one GPT-2 model barely manages to get a higher NDCG score than the TF-IDF model while also having a higher classification accuracy, giving GPT-2 a slight edge. For fold 5 two out of three GPT-2 models perform better than the TF-IDF model, giving GPT-2 the edge. 40 CHAPTER 4. RESULTS

4.4 NDCG@10

Figure 4.4: NDCG at 10 achieved by all the different models, K = 10

The results presented in figure 4.4 were generated by a 5-fold cross validation run separate from all other results but using the same folds.

In figure 4.4 it seems that in improving the NDCG amongst the top results the re-ranking ensemble models perform better than the purely TF-IDF and GPT- 2 based models. The multiplicative ensemble model still achieves mediocre results. Overall the NDCG@10, i.e. amongst the top results, is lower than for the entire test fold, the NDCG as it was calculated in figure 4.1. Judging by figure 4.4 the best performing model for producing a high NDCG@10 was RR 20% the model that re-ranks 20% of top results, having both the highest average score achieved and the lowest standard deviation. CHAPTER 4. RESULTS 41

4.5 T-Test

NDCG The t-score between the NDCG of the TF-IDF and purely GPT-2 based model is 1.35 and the p-value is 0.19. Sample size = 15.

Classification Accuracy The t-score between the Classificatoin Accuracy of the TF-IDF and purely GPT-2 based model is 0.48 and the p-value is 0.64. Sample size = 15.

NDCG@10, RR 20% The sample size for the NDCG@10 test was only 5 so these results have a high uncertainty. The t-score between the NDCG@10 of the TF-IDF model and the Re-rank 20% ensemble model is 1.24 and the p-value is 0.26.

4.6 Resource Demands

Training of the GPT-2 based model was done on GPU, more specifically using an NVIDIA GTX 980 graphics card with 4GB of memory. Training of the TF-IDF model used an Intel Core i7-6600U CPU (@2.60 GHz 2.81 GHz).

TF-IDF model GPT-2 model Training 6 ± 1s 2400 ± 120s = 40 ± 2min Scoring 22 ± 1s 720 ± 60s = 12 ± 1min Table 4.1: The training and scoring times for the two models.

The time for training the model was measured during the production of the results presented in the Results chapter and the scoring time was measured over 10 repetitions of the same scoring. The error margins were placed as to encompass all measured times.

According to table 4.1 training a TF-IDF model is very fast on modern CPUs, taking about 6 seconds, compared to training the GPT-2 based model which took on average about 40 minutes on an, at the time of the experiment, six years old graphics card. The TF-IDF model took a longer time to score articles than to be trained while the GPT-2 model took a longer time to train than to score articles. Overall both training and scoring was faster with TF-IDF. Chapter 5

Discussion

In this chapter the implications of the results are discussed as well as the eth- ical, sustainability and societal considerations of the project.

5.1 Model Performance

At first glance the results seem to indicate that a GPT-2 based model can out- perform a TF-IDF based model even when trained on relatively small amounts of data. For relevance scoring ( figure 4.1) this is more clear than for zone clas- sification accuracy ( figure 4.2) where the improvement was very small espe- cially compared to the standard deviation. However looking at the p-value for the relevance results (NDCG) we see that the statistical significance is only about 20% meaning that this indication is very weak. The p-value (0.64) for the Classification Accuracy is even worse and lends no credibility to the small difference between the models which is also indicated by the t-score (0.48).

Looking at figure 4.3 it could be argued that further work on training stabil- ity could remedy this since GPT-2 manages to outperform TF-IDF on 3 out of 5 folds and performs similarly well on the other 2 folds. The covariance between NDCG and Classification Accuracy indicates that there is no trade- off between them. If training stability was improved it might be possible for the GPT-2 model to achieve a better Classification Accuracy without reduc- ing the NDCG which might improve results. Regardless of training stability, with differences this small the sample size needs to be increased. For example increasing the sample size from 15 to about 35 would be necessary to give a p-value below 5%.

42 CHAPTER 5. DISCUSSION 43

Comparing these results with those of DeepRank [32] and DRMM [31] is difficult since the data sets they used were likely a lot harder because they used a non-fixed query meaning the data sets likely contain far lower concentrations of relevant documents.

5.2 Model Robustness

Consistency is important when choosing a model and according to the results in this study the model that had the most even performance when it came to relevance scoring was the purely GPT-2 based approach. When it comes to zone classification accuracy however the TF-IDF model was drastically more robust. The great variation in achieved zone classification accuracies by GPT- 2 is likely due to the early stopping heuristic which was designed to put em- phasis on achieving a high NDCG score over a high classification accuracy. Furthermore since relevance scoring is the main objective of the model the training loss was set up in such a way as to ensure that the model learnt rele- vance scoring over zone classification. Looking at performance on individual folds in figure 4.3 we see that it is likely that the training of GPT-2 partially failed for two models, one for fold 3 (blue) and one for fold 5 (yellow) since these achieved both low classification accuracies and NDCG scores.

5.3 Ensemble Models

When calculating the NDCG over the entire test fold, figure 4.1, the ensemble models all performed better than the TF-IDF model but worse than the GPT-2 model, with the exception of RR 50% which achieved a slightly higher average score than GPT-2, although with a higher standard deviation. No considerable improvement was achieved by the ensemble models with respect to NDCG over the entire test fold (K=68). Overall their performance should not come as a surprise since they could be seen as doing "averaging" between the per- formance of TF-IDF and GPT-2 when they perform poorly.

Looking at the NDCG over the entire ranked list is interesting as it tells us how well the models learn to separate articles of varying relevance but it is not always indicative of how well they will perform in a real world applica- tion. The system developed in this project will sort through more than 1000 articles every day and a person might reasonably be expected to look through only the top 20 to 100 articles. Therefore it is more important to have a high 44 CHAPTER 5. DISCUSSION

NDCG amongst the top results rather than throughout the entire ranking. For this reason the NDCG@10 of the model types was also calculated, although with a third of the sample size due to time constraints. Looking at these re- sults, figure 4.4, we see that all ensemble models, except for the multiplicative one, achieved a higher average NDCG@10 than both TF-IDF and GPT-2. It would seem then that the ensemble models sacrifice sorting quality through- out the ranking in order to achieve a higher quality ranking amongst the top results. The Re-rank 20% (RR 20%) model achieved especially good results having the highest average NDCG@10 and the lowest standard deviation. To give some indication of the reliability of these results a t-test was made of the NDCG@10 for RR 20%. The t-score (1.24) and p-value (0.26) were both worse than for the main NDCG (K = 68) t-test and since the sample size is only 5 the uncertainty is even higher. Because of this it can only be said that there is the mildest indication that RR 20% performs better than TF-IDF with regards to NDCG@10.

Worth noting is that the heuristic used for early stopping during training still used the NDCG over the entire ranking (K = 68) which is likely why the results are overall lower than in figure 4.1. The model trained for practical application should probably use a smaller K than 68 (the entire test fold) for the stopping heuristic.

5.4 Resource Demands

According to NVIDIA the graphics card used in this project (GTX 980) has a CUDA compute capability of about 5.2 while an equivalent card today has a 7.5 (RTX 2080), unfortunately this is just an abstract measure created by NVIDIA and is more akin to a version number than a comparable metric [42]. A better metric for comparison is probably the number of CUDA cores the graphics cards have in which case an equivalent card today has about 40% more and a card more dedicated to running and training neural networks (RTX Titan) has about 125% more CUDA cores [43]. It is therefore quite safe to say that the times presented here for training and for scoring articles using GPT-2 (smallest version) are higher than what most would experience using a more current GPU.

The TF-IDF model takes a longer time to score articles than to train the model but this is mostly due to the fact that while the model is trained on only 343 articles the scoring was done on 7970 articles. More interestingly, while the CHAPTER 5. DISCUSSION 45

GPT-2 model takes a lot longer to train, once it is trained running it on arti- cles does not take an excessively large amount of time. Considering that the scenario tested was to create a summary of articles scraped during one week, running the model for about 12 minutes per week is reasonable. By employ- ing the ensemble model RR 20% this time could be further cut down by about 80%, since it only runs GPT-2 on 20% of articles, and using a modern GPU on top of that would bring the time down to just a few minutes.

5.5 Ethical Considerations

When collecting large amounts of data automatically it is important to con- sider the process from an ethical perspective. It is important to obey the laws and conventions in place regarding data collection such as GDPR [44], as well as intellectual property rights and regulations against distributed denial-of- service attacks (DDoS). Different laws apply in different countries and the Internet can still be a legal gray zone in many cases but despite this anyone performing large scale data collection has an ethical responsibility to do so in the most sustainable and considerate manner possible. This can be done by following the existing conventions for web scraping such as the robots.txt con- vention where each website can implement a text file called "robots.txt" which contains regulations that automated web scrapers should obey while scraping the domain [45].

All web scraping done as part of this project respected the rules outlined in the robots.txt files on scraped domains. This was achieved by using the Python li- brary Scrapy’s built in robots.txt interpreter [35]. To avoid taxing websites the scraper also never sent more than one request to any domain at the same time.

While all text scraped from websites during this project was publicly available on the sites care was taken as to never compile sensitive or personal informa- tion from scraped text, such as names or emails, into lists which would make them more easily available than on the original site. All text was kept as it was scraped from the websites and saved locally on two different computers for the sole purpose of training the models. 46 CHAPTER 5. DISCUSSION

5.6 Sustainability & Societal Considerations

As a growing proportion of the world’s population comes online [46] the rate of information being published on the Internet is sure to keep increasing. Com- pared to its infancy in the 90s the Internet now contains a large number of sites, even for a narrow purpose such as space industry related news. Efficient infor- mation retrieval techniques such as search engines have played an important role in allowing the Internet to grow this much without becoming cumbersome to navigate. In the future as more businesses digitise and increasing amounts of news are published online, it could become impossible to keep track of all relevant sites. Most of us are already reliant on news feeds that summarize articles published within specific fields but often we can’t be sure how these feeds are curated. Methods for creating your own news feeds could become increasingly important and useful as it allows you to curate the feed’s bias yourself.

Intelligent and customizable news feeds that can read and understand articles could become an important tool for navigating the Internet as well as in com- bating misinformation and click baiting. They would allow the Internet to grow in a more sustainable manner as the amount of information a robot cu- rator can go through is only limited by connection speeds and computational power. Recent transformer language models such as GPT-2, and even more recently GPT-3 [47], show and increasingly deep understanding of language which could be leveraged to make these news feeds more intelligent.

Keeping the spread of information sustainable is ultimately important for all aspects of society. For some areas such as healthcare, politics and the military it is especially important since the manipulation of information in those areas can be harmful in a very direct way. An intelligent robot curator could for example find and flag dangerous misinformation, such as medically unsound health trends, fake accounts sowing dissent and news articles containing false information. Chapter 6

Conclusions

This chapter discusses the general conclusions drawn from the project with respect to the hypothesis and proposes areas for future work.

6.1 Employment of Transformer Language Models

The goal of the work in this report was to determine whether or not mod- ern transformer language models, such as GPT-2, can be used for information retrieval and document classification. While the main goal was to, through transfer learning, have a version of GPT-2 perform document ranking, having it perform document classification simultaneously was also investigated. In short one could say that the report investigates how modern transformer lan- guage models could be used as a catch-all data driven solution to handling documents of text.

The work was carried out with the hypothesis in mind that even using the smallest version of GPT-2 with a small data set would yield competitive re- sults to a more standard approach to the task, in this case TF-IDF. The differ- ence in performance between the models was very small and the results show that GPT-2 achieves a very similar level of performance as the implemented TF-IDF approach. This could be seen as competitive performance except for the fact that TF-IDF is shown to be a considerably faster and computationally cheap approach. Because of this the hypothesis is not supported by the find- ings in this work.

47 48 CHAPTER 6. CONCLUSIONS

Even though the use of large neural network models requires more compu- tational power, such as a dedicated graphics cards or cloud computing based solutions, such options are becoming increasingly widely available. Their in- creased availability should allow more people who have access to small anno- tated data sets to at least partially automate text classification or information retrieval tasks.

With access to more computational power it is also possible that by using a larger version of GPT-2 such as the medium or large versions (section 2.2.1) one could achieve better performance than by using TF-IDF, although this might also require a larger set of training data. If the performance of a GPT-2 based model scales with the size of the model and data set then it could be a competitive alternative provided that the performance of the TF-IDF model does not scale similarly.

The tests on ensemble models utilizing both the TF-IDF and GPT-2 based approach achieved very similar results to using each approach individually. Results indicated, with poor statistical significance, that combining the mod- els via a re-ranking of the top 20% of articles (as ranked by TF-IDF) using GPT-2 looked the most promising. If a GPT-2 based model was successfully trained to achieve significantly better results than a TF-IDF model, then us- ing a re-ranking approach one could potentially benefit from the performance boost of the transformer language model with less computational costs.

Comparing how well the GPT-2 model performed at document ranking vs clas- sification it was concluded that while there was no clear trade off between the two learning goals the early stopping heuristic seems to have had a large influ- ence on the classification accuracy, increasing its variance.

6.2 Future Work

Due to the limited computational power used during this project a proper op- timization of all hyper parameters could not be performed and values were largely based on previous work with some manual tweaking. If training times could be cut down it would be interesting to perform a proper automated op- timization of hyper parameters to see if training could be made to converge faster and be more stable.

This work merely investigated how well the different models were able to CHAPTER 6. CONCLUSIONS 49

model the data set. It would be enlightening to investigate the different ap- proaches in use to see what sort of qualitative differences there may be be- tween recommended articles, since TF-IDF uses only word frequencies while GPT-2 could potentially learn to understand the text on a deeper level.

There is a limit to how many tasks a finitely sized neural network can perform well but it is unclear whether this limit was met or not in this work. It would be interesting to see how the performance of transformer language models scale as more tasks are added and to see if/how much increasing the model size, for example from GPT-2 Small to GPT-2 Medium, affects this correlation.

Finally, due to the limited computational power available this work was lim- ited to training the smallest version of GPT-2. While this version is shown to achieve similar performance to TF-IDF the question remains whether simply upgrading the size of the model would increase performance. Bibliography

[1] Space Foundation. The Space Report 2019, 4 Quarterly Reports: PDF download. https://www.thespacereport.org/register/ the-space-report-2019-4-quarterly-reports-pdf- download/. 2019. [2] Eng Teong See. “Commercialization of Space Activities-The Laws and Implications”. In: J. Air L. & Com. 82 (2017), p. 145. HeinOnline. [3] Mark Harris. “Tech giants race to build orbital internet [News]”. In: IEEE Spectrum 55.6 (2018), pp. 10–11. IEEE. [4] John Karcz et al. “Red dragon: Low-cost access to the surface of mars using commercial capabilities”. In: (2012). [5] Brian Dunbar. Moon to Mars Overview. https : / / www . nasa . gov/topics/moon- to- mars/overview. [Online; accessed 24-Apr-2020]. 2019. [6] Alec Radford et al. “Language models are unsupervised multitask learn- ers”. In: OpenAI Blog 1.8 (2019), p. 9. [7] Irwin B Levitan and Leonard K Kaczmarek. The neuron: cell and molec- ular biology. Oxford University Press, USA, 2015. [8] Warren S McCulloch and Walter Pitts. “A logical calculus of the ideas immanent in nervous activity”. In: The bulletin of mathematical bio- physics 5.4 (1943), pp. 115–133. Springer. [9] Stephen Cole Kleene. Representation of events in nerve nets and finite automata. Tech. rep. RAND PROJECT AIR FORCE SANTA MON- ICA CA, 1951. [10] Marvin Minsky and Seymour A Papert. Perceptrons: An introduction to computational geometry. MIT press, 2017. [11] B. Mehlig. Artificial Neural Networks. 2019. arXiv: 1901 . 05639 [cs.LG].

50 BIBLIOGRAPHY 51

[12] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. 2014. arXiv: 1412.6980 [cs.LG]. [13] Ashish Vaswani et al. “Attention is All you Need”. In: Advances in Neural Information Processing Systems 30. Ed. by I. Guyon et al. Cur- ran Associates, Inc., 2017, pp. 5998–6008. url: http://papers. nips.cc/paper/7181-attention-is-all-you-need. pdf. [14] Sepp Hochreiter and Jürgen Schmidhuber. “Long Short-term Memory”. In: Neural computation 9 (Dec. 1997), pp. 1735–80. [15] Dzmitry Bahdanau, Kyunghyun Cho, and . Neural Ma- chine Translation by Jointly Learning to Align and Translate. 2014. arXiv: 1409.0473 [cs.CL]. [16] Alec Radford et al. “Improving language understanding with unsuper- vised learning”. In: Technical report, OpenAI (2018). [17] Ethem Alpaydin. Introduction to machine learning. MIT press, 2020. [18] Binbin Xu et al. Neural Language Model for Automated Classification of Electronic Medical Records at the Emergency Room. The Significant Benefit of Unsupervised Generative Pre-training. 2019. arXiv: 1909. 01136 [cs.CL]. [19] Trapit Bansal, Rishikesh Jha, and Andrew McCallum. Learning to Few- Shot Learn Across Diverse Natural Language Classification Tasks. 2019. arXiv: 1911.03863 [cs.CL]. [20] Jacob Devlin et al. BERT: Pre-training of Deep Bidirectional Trans- formers for Language Understanding. 2018. arXiv: 1810.04805 [cs.CL]. [21] Thomas Wolf et al. TransferTransfo: A Transfer Learning Approach for Neural Network Based Conversational Agents. 2019. arXiv: 1901 . 08149 [cs.CL]. [22] Amir Vakili Tahami and Azadeh Shakery. Enriching Conversation Con- text in Retrieval-based Chatbots. 2019. arXiv: 1911.02290 [cs.CL]. [23] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge, UK: Cambridge Uni- versity Press, 2008. isbn: 978-0-521-86571-5. url: http://nlp. stanford . edu / IR - book / information - retrieval - book.html. [24] Paul Sheridan and Mikael Onsjö. A hypergeometric test interpretation of a common tf-idf variant. 2020. arXiv: 2002.11844 [cs.IR]. 52 BIBLIOGRAPHY

[25] Joeran Beel et al. “Research-paper recommender systems : a literature survey”. In: International Journal on Digital Libraries 17.4 (2016), pp. 305–338. issn: 1432-5012. [26] Hans Peter Luhn. “A statistical approach to mechanized encoding and searching of literary information”. In: IBM Journal of research and de- velopment 1.4 (1957), pp. 309–317. IBM. [27] Juan Ramos et al. “Using tf-idf to determine word relevance in docu- ment queries”. In: Proceedings of the first instructional conference on machine learning. Vol. 242. Piscataway, NJ. 2003, pp. 133–142. [28] Yi-hong Lu and Yan Huang. “Document categorization with entropy based TF/IDF classifier”. In: 2009 WRI Global Congress on Intelligent Systems. Vol. 4. IEEE. 2009, pp. 269–273. [29] Amir Jalilifard et al. Semantic Sensitive TF-IDF to Determine Word Rel- evance in Documents. 2020. arXiv: 2001.09896 [cs.IR]. [30] Zhengdong Lu and Hang Li. “A Deep Architecture for Matching Short Texts”. In: Advances in Neural Information Processing Systems 26. Ed. by C. J. C. Burges et al. Curran Associates, Inc., 2013, pp. 1367–1375. url: http://papers.nips.cc/paper/5019- a- deep- architecture-for-matching-short-texts.pdf. [31] Jiafeng Guo et al. “A Deep Relevance Matching Model for Ad-hoc Re- trieval”. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management - CIKM ’16 (2016). ACM Press. [32] Liang Pang et al. “DeepRank”. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management - CIKM ’17 (2017). ACM Press. [33] Yifan Qiao et al. Understanding the Behaviors of BERT in Ranking. 2019. arXiv: 1904.07531 [cs.IR]. [34] Rodrigo Nogueira and Kyunghyun Cho. Passage Re-ranking with BERT. 2019. arXiv: 1901.04085 [cs.IR]. [35] Scrapinghub. Scrapy. https://github.com/scrapy/scrapy. [Online; accessed 11-Feb-2020]. 2020. [36] Lucas Ou-Yang. Newspaper3k: Article scraping & curation. https: / / . com / codelucas / newspaper. [Online; accessed 11-Feb-2020]. 2019. BIBLIOGRAPHY 53

[37] F. Pedregosa et al. “Scikit-learn: Machine Learning in Python”. In: Jour- nal of Machine Learning Research 12 (2011), pp. 2825–2830. [38] Kaiming He et al. “Delving deep into rectifiers: Surpassing human-level performance on classification”. In: Proceedings of the IEEE international conference on . 2015, pp. 1026–1034. [39] Siddharth Krishna Kumar. On weight initialization in deep neural net- works. 2017. arXiv: 1704.08863 [cs.LG]. [40] Xavier Glorot and YoshuaBengio. “Understanding the difficulty of train- ing deep feedforward neural networks”. In: Proceedings of the thir- teenth international conference on artificial intelligence and statistics. 2010, pp. 249–256. [41] Graeme D. Ruxton. “The unequal variance t-test is an underused al- ternative to Student’s t-test and the Mann–Whitney U test”. In: Behav- ioral Ecology 17.4 (May 2006), pp. 688–690. issn: 1045-2249. eprint: https://academic.oup.com/beheco/article- pdf/ 17/4/688/17275561/ark016.pdf. [42] NVIDIA. CUDA GPUs. https://developer.nvidia.com/ -gpus. [Online; accessed 22-May-2020]. 2020. [43] Inc. Studio 1 Productions. NVidia Graphics Card Specification Chart. https : / / www . studio1productions . com / Articles / NVidia-GPU-Chart.htm. [Online; accessed 22-May-2020]. 2020. [44] European Parliament and Council of the European Union. General Data Protection Regulation. https://gdpr-info.eu/. [Online; ac- cessed 22-May-2020]. 2020. [45] Google. Introduction to robots.txt. https://support.google. com / webmasters / answer / 6062608 ? hl = en. [Online; ac- cessed 22-May-2020]. 2020. [46] Internet World Stats. INTERNET GROWTH STATISTICS. https:// www . internetworldstats . com / emarketing . htm. [On- line; accessed 29-May-2020]. 2020. [47] Tom B. Brown et al. Language Models are Few-Shot Learners. 2020. arXiv: 2005.14165 [cs.CL].

TRITA-EECS-EX-2020:547

www.kth.se