Prerequisites for Extracting Entity Relations from Swedish Texts

EXAMENSARBETE INOM TEKNIK, GRUNDNIVÅ, 15 HP STOCKHOLM, SVERIGE 2020

ERIK LENAS

KTH SKOLAN FÖR ELEKTROTEKNIK OCH DATAVETENSKAP Authors Erik Lenas

University KTH Royal Institute of Technology

Supervisor Anders Sjögren

Examinor Fadil Galjic

Bachelor Thesis Degree Project in Computer Engineering, First Cycle

Faculty Electrical Engineering and Computer Science

University KTH Royal Institute of Technology

Partner Riksarkivet

2 Abstract

Natural language processing (NLP) is a vibrant area of research with many practical applications today like sentiment analyses, text labeling, questioning an- swering, machine translation and automatic text summarizing. At the moment, research is mainly focused on the English language, although many other languages are trying to catch up. This work focuses on an area within NLP called information extraction, and more speciﬁcally on relation extraction, that is, to extract relations between entities in a text. What this work aims at is to use machine learning techniques to build a Swedish language processing pipeline with part-of- speech tagging, dependency parsing, named entity recognition and coreference resolution to use as a base for later relation extraction from archival texts. The obvious difﬁculty lies in the scarcity of Swedish annotated datasets. For example, no large enough Swedish dataset for coreference resolution exists today. An important part of this work, therefore, is to create a Swedish coreference solver using distantly supervised machine learning, which means creating a Swedish dataset by applying an English coreference solver on an unannotated bilingual corpus, and then using a word-aligner to translate this machine-annotated En- glish dataset to a Swedish dataset, and then training a Swedish model on this dataset. Using Allen NLP:s end-to-end coreference resolution model, both for creating the Swedish dataset and training the Swedish model, this work achieves an F1-score of 0.5. For named entity recognition this work uses the Swedish BERT models released by the Royal Library of Sweden in February 2020 and achieves an overall F1-score of 0.95. To put all of these NLP-models within a single Lan- guage Processing Pipeline, Spacy is used as a unifying framework.

Keywords

Machine Learning, Natural Language Processing, Relation Extraction, Named Entity Recognition, Coreference resolution, BERT

3 Abstract

Natural Language Processing (NLP) är ett stort och aktuellt forskningsområde idag med många praktiska tillämpningar som sentimentanalys, textkategoriser- ing, maskinöversättning och automatisk textsummering. Forskningen är för när- varande mest inriktad på det engelska språket, men många andra språkområ- den försöker komma ikapp. Det här arbetet fokuserar på ett område inom NLP som kallas informationsextraktion, och mer speciﬁkt relationsextrahering, det vill säga att extrahera relationer mellan namngivna entiteter i en text. Vad det här arbetet försöker göra är att använda olika maskininlärningstekniker för att skapa en svensk Language Processing Pipeline bestående av part-of-speech tagging, dependency parsing, named entity recognition och coreference resolution. Denna pipeline är sedan tänkt att användas som en bas for senare relationsextrahering från svenskt arkivmaterial. Den uppenbara svårigheten med detta ligger i att det är ont om stora, annoterade svenska dataset. Till exempel så ﬁnns det inget till- räckligt stort svenskt dataset för coreference resolution. En stor del av detta arbete går därför ut på att skapa en svensk coreference solver genom att implementera distantly supervised machine learning, med vilket menas att använda en engelsk coreference solver på ett oannoterat engelskt-svenskt corpus, och sen använda en word-aligner för att översätta detta maskinannoterade engelska dataset till ett svenskt, och sen träna en svensk coreference solver på detta dataset. Det här arbetet använder Allen NLP:s end-to-end coreference solver, både för att skapa det svenska datasetet, och för att träna den svenska modellen, och uppnår en F1-score på 0.5. Vad gäller named entity recognition så använder det här arbetet Kungliga Bibliotekets BERT-modeller som bas, och uppnår genom detta en F1- score på 0.95. Spacy används som ett enande ramverk för att samla alla dessa NLP-komponenter inom en enda pipeline.

Nyckelord

Maskininlärning, Natural Language Processing, Relationsextrahering, Named En- tity Recognition, Coreference resolution, BERT

4 Contents

1 Introduction 7 1.1 Background ...... 7 1.2 Problem ...... 7 1.3 Purpose ...... 8 1.4 Research Method ...... 8 1.5 Scope ...... 8 1.6 Disposition ...... 8

2 Theoretical Background 11 2.1 Machine Learning ...... 11 2.2 Natural Language Processing ...... 15 2.3 Related Works ...... 19

3 Method 21 3.1 Research Method ...... 21 3.2 Relation extraction methods ...... 23 3.3 Technological Methods ...... 27

4 Training the Spacy Language Processing Pipeline 31 4.1 Datasets ...... 31 4.2 Part-of-speech tagging ...... 32 4.3 Dependency parsing ...... 33 4.4 Named Entity Recognition ...... 33 4.5 Coreference resolution ...... 36

5 Results 41 5.1 Part-of-Speech Tagging ...... 41 5.2 Dependency Parsing ...... 41 5.3 Named Entity Recognition ...... 42 5.4 Coreference Resolution ...... 42

6 Discussion 45 6.1 Natural Language Processing on Swedish Texts ...... 45 6.2 Discussion of the Results ...... 47 6.3 Future Work ...... 48

7 Appendix A - Datasets and Spacy Training Format 49

5 6 1 Introduction

Natural Language Processing (NLP) is a sub-field of Artificial Intelligence (AI), linguistics and computer science that deals with the ability of computers to draw meaning from spoken or written language. The field has been around since the 1950:s, with its ups and downs along the way. In the recent 10-15 years remarkable progress has been made, mainly due to new techniques in the field of machine learning, coupled with the explosion of data made available in the last decade.

1.1 Background The aim of this work is to use machine learning techniques to build a base model for extracting relations between entities in Swedish texts. An entity is some sort of noun-phrase that corresponds to an entity in the world, like a person, an organization, a time expression or a building. An n-ary relation between entities is some sort of relation between n entities in a span of text, such as "Erik sold the Downton Abbey in 1968", which would be a 3-ary relation between Erik, Downton Abbey (which would refer to an actual building in the world) and the year 1968.[12] Before you can extract this relation from an unseen text, several prerequisites needs to be met. For example, we need to recognize and categorize the different entities present in the text, focusing on the types of entities that will take part in the relations we want to extract. But extracting entities isn’t enough, we also need to extract grammatical and syntactical information about the text, as well as some kind of numerical representation for the meaning of words. One other thing that needs to be solved before extracting the relations is coreference-clusters, that is, linking different noun-phrases to each other, such as linking "he" to "Erik" and "computer" to "it" in "Erik has just bought a new computer. He is very happy with it." The project was done in collaboration with Riksarkivet (The Swedish National Archives), and the goal is to use this work’s resulting model as a base to use when implementing relation extraction from archival descriptions. You can then structure these relations in a database, and thereby adding search functionality to the archives.

1.2 Problem To extract relations between entities in a text, information about the text ﬁrst has to be gathered. For example, the relevant entities needs to be extracted before relations between them can be extracted. But other information has to be extracted as well. For example "Erik owns Fårö Herrgård" and "Erik works at Fårö Herrgård" signiﬁes two different relations even though the named entities (Erik, Fårö Herrgård) are the same. What information about this sentence, and the tokens comprising the sentence, could aid in separating these two, and other kinds of relations? What information is needed to extract the particular relation of interest, and not all relations with the same entity types? Another important condition for this work is that the work is focused on

7 Swedish texts. This presents other problems than if the work was focused on English texts. The Swedish datasets needed for supervised machine learning, if they exists at all, are much smaller than their English counterparts. Thus we arrive at the central problem underlying this work:

What are prerequisites to extract entity relations from Swedish texts, and how do you meet these prerequisites given that annotated datasets are scarce?

1.3 Purpose The purpose of this work is to produce a model capable of extracting entity relations from archival descriptions. The intent is that the model, with task-speciﬁc adjust- ments, can be used for many different kinds of relation extraction tasks within an on- going project at Riksarkivet to use machine learning techniques to make their archives more accessible.

1.4 Research Method The research methods used for this project were literature studies, both of books on the subject, different types of web-material, and research papers for learning the latest developments. Then, for the actual construction of models, observation combined with experimentation was used. In the case of machine learning this means training a model on only a part of the dataset and keeping the other part unseen for later validation and testing.

1.5 Scope This work was limited to creating a base model for relation extraction. The actual relation extraction wasn’t included in the work, but instead, the purpose was to design a model that can serve as a base for many different types of relation extraction tasks.

1.6 Disposition Chapter 2 describes the theory behind the work, and explains the concepts and struc- tures relevant to the task at hand. The chapter refrains from mathematical descriptions since this would be too in-depth, and also, when one actually builds models one usually do it with a high-level API like Keras or Pytorch, or using an NLP-framework, like AllenNLP.

Chapter 3 describes the research method used in this work, the phases governing the work, what was done in each phase and why it was done. It starts with the problem formulation, and describes how this problem formulation leads to several sub-questions that needs to be answered for the work to reach it’s aim. The chapter

8 then tries to answer these questions in order to come up with a strategy for the work.

Chapter 4 describes the frameworks and methods used to implement this strategy. The focus is on named entity recognition and coreference resolution, but the other parts, POS-tagging and dependency parsing are also covered. Chapter 5 presents the results of each step of the work, and compares these results with the results of related work.

Chapter 6 discusses and evaluate the results presented in the previous chapter, and analyzes the different methods as regards to their strengths and weaknesses.

9 10 2 Theoretical Background

This chapter will explain the theoretical background for the thesis. The explanations offered will be introductory and aim at conceptual understanding, rather than detailed mathematical insight. The ﬁrst part will cover machine learning in general, the basic building blocks of models, and the different ways to build models. Then the basics of Natural Language Processing will be covered, the general concepts and the techniques used when representing text numerically. The ﬁnal part references important related works, and how they were used in this thesis.

2.1 Machine Learning Machine Learning (ML) is a subsection of the more general ﬁeld of AI, which is ba- sically any system that behaves in a rational way. A system is said to be rational if it "does the ’right thing’ given what it knows" [40]. By the "right thing" you can, of course, mean different things. But for a computer system it generally means coming up with the right answer or the right behaviour given it’s input, where the answer or behaviour is judged correct or incorrect according to some predeﬁned measure. For a long time the paradigm technique of designing intelligent systems was rule-based AI, where you hard-coded the behaviour of the system via algorithms and rules. The problem with this approach is that it behaves badly on new data, data that is not included in the rules. Here’s where Machine Learning comes in. Machine learning models get data and answers as input, and it then outputs the rules that governs it’s behaviour on new, unseen data. The rules no longer has to be hard-coded since the system comes up with them by itself [7]. There are many different types of machine- learning, but this work will focus on three kinds: supervised machine learning, semi- supervised machine learning and distantly supervised machine learning.

2.1.1 Types of Machine Learning Supervised Machine Learning Supervised ML consists of learning a system to map input data to known targets. The input data together with the targets are called annotations, or a dataset [8]. From these annotations, the system learns the correct parameters to map new unannotaded data to the correct targets. The system can be a simple regression model or a complex neural network, and the parameters governing the behaviour of this system can range from a few thousands to billions.

Semi-Supervised Machine Learning Large hand-annotated datasets requires a considerable amount of work, and are often hard to come by, especially in the ﬁeld of NLP in languages other than English. Semi- supervised machine learning tries to get around this fact by starting from a small

11 seed of annotations and using a computer system to automatically generate a large dataset from this seed [13]. Imagine for instance that you want to extract all relations in a text consisting of a person, [PRS], buying a company, [ORG]. You can start with the relation [PRS] bought [ORG], and then you can bootstrap this relation by ﬁnding semantically close words of "bought", using different tenses and word-orders, and then scanning an unannotaded dataset for positive examples. You can then add these examples to your dataset and train your model with it.

Distantly supervised Machine Learning This is a technique where you use an already existing dataset to create a new dataset designed for your particular purpose [13]. This work used a model trained on a large annotated English coreference dataset to create a dataset for Swedish coreference resolution. This included a number of different techniques which are described in chapter 4.

Transfer Learning In humans transfer learning is ubiquitous. One skill is learned, say running, and then this skill is used when learning another skill, say playing football. This is what transfer learning is, to use the experience gained from learning one skill when learning another skill [39].

2.1.2 Neural Networks The fundamental building blocks of a neural network are layers, consisting of a number of mathematical units. These layers takes multidimensional vectors, so called tensors as input and produces an output, also in the form of tensors. These layers are stacked on top of each other, and the output from one layer becomes the input to the next layer (see figure 1). Each layer has a set of parameters that, together with a mathematical function called an activation function, determines it’s output. By comparing the output of the final layer to the actual target, a network can adjust it’s parameters to get closer to the target. When the network has found the right parameters, that is, parameters that takes it as close to the target as possible for all examples in the trainingset, then the network is said to be trained [9]. Neural networks are a kind of machine learning, since adjusting the parameters that determines the output of each layer, and thereby also the final output, is a way of coming up with the rules determining the output on different inputs. The rules are written in the internal relationship of the parameters, and the networks learns these relationships by themselves by adjusting the parameters to get as close as possible to the correct output for each input.

12 Figure 1: Example of a simple neural net.

Deep Neural Networks Simply put, a deep neural network consists of many layers stacked on top of each other. That is why it’s called a deep network.

Convolutional Neural Networks A convolutional network is a network that, in at least one of it’s layers, uses a mathematical operation called convolution instead of matrix multiplication. There’s no need to go into the details of this operation, but simpliﬁed it means that the layer feeds a more coarse-grained map of the input features to the next layer. Thereby it becomes possible to extract relations between larger parts of the input, rather then between single pixels if the input is an image [10]. Convolutional neural networks are the standard in image processing, but are also used extensively within NLP, and notably by Spacy, which is one of the NLP-libraries used in this thesis.

2.1.3 Model Evaluation When training a neural network it is important to evaluate the model with each training iteration, and also after the training is done. This is done by splitting up the training set into three parts, one for training, one for evaluation during the training, and one for testing after the training is done. The evaluation set isn’t used in the actual training, but is used to evaluate the results of the network after each iteration through the dataset. The test set is used to evaluate the network once all the training iterations are completed.

13 Overfitting A neural network is said to suffer from overfitting when it’s parameters gets so ad- justed to the training data examples that it starts to perform worse on new data. To prevent overfitting the training should stop when it starts performing worse on the evaluation set. The reason that there is both an evaluation set and a test set is that if the model is trained for many iterations, the problem of overfitting can apply also to the evaluation set [9].

Evaluation Scores Machine learning is about categorizing data. When the model categorizes an input there are four different possible results, false positive, true positive, false negative and true negative. False positive is when it incorrectly decides that an input belong to a certain category. True positive is when it correctly decides that an input belongs to a certain category. True negative is when it correctly decides that an input doesn’t belong to a category, and false negative is when the model incorrectly decides that an input doesn’t belong to a certain category. Depending on the specific categorization task the model has to perform, different evaluation metrics are used that focus on these different types of results. Accuracy quantifies the number of correct categorizations, positive or negative. Precision quantifies the number of positive categorizations that actually belong to the category (see figure 2). Recall quantifies the number of true positives out of all positive examples in the training-set. F1-score is a score that balances both precision and recall [5].

Figure 2: Precision and recall visualized. The F1-score balances both.

If, for example, every input can be categorized, then accuracy would be the chosen measure, since it measures the percentage of correct categorizations. If, on the other

14 hand, only a small part of the inputs can be categorized, and the rest of the inputs are said to belong to the "none-category", then the F1-score would be used as evaluation, since one wants to measure both how many true positives there is (out of the total number of guessed positives), and also how many of the actual positives the model guessed correctly. These different metrics will be further treated in the chapter 5.

2.2 Natural Language Processing Natural language processing is a subﬁeld of linguistics, computer science, information engineering and artiﬁcial intelligence. Its aim is to make computer programs, or models, that analyzes large bodies of spoken or written language and extract meaning from it[28]. It’s interesting to stop and think about this for a minute: How can a computer system, which only works with numbers, extract meaning from a text? That is, how is it possible to represent something so subjective as meaning with only numbers?

2.2.1 Word Vectors The basic idea behind word vectors is to represent words in a multidimensional vector space, where words that are similar to one another are positioned close to each other in this space (see figure 4). The similarity of two different words can then be quantified by taking the dot-product of their respective vectors. A fundamental idea in linguistics is that you can represent the meaning of words by analyzing their distribution in texts. And if the distribution of the words tell us something about their meaning, it can be represented by vectors, or so called word embeddings. To find such word representations is an example of self-supervised learning. The model learns these representations by itself, just by studying the distribution of words in the text, rather than minimizing the loss from an annotated dataset. The basic intuition behind representing words as vectors is to study how often they cooccur, and set the vectors accordingly [14]. The simplest way to do this is to have an n-dimensional vector for each word, where n is the number of words in the text, and the numbers in each vector entry, for each source word, is how many times the source word cooccur with the target word, given a specific window size (the distance in words within which cooccurence is assigned). An example of a co-occurrence matrix is shown in figure 3. The problem with this approach is that the vectors become very large and very sparse.

2.2.2 The Word2vec Algorithm The word2vec algorithm instead tries to create dense word vectors, with a dimension of 300 or less. The idea is to calculate the probability of the cooccurence of a target word and other words, and then use these probabilities as an annotated dataset. The probabilities is then fed into a simple machine learning model for binary classiﬁca- tion, and the trained weights (parameters) of the last layer are used as word-vector

15 Figure 3: Co-occurrence vectors for four words in the Wikipedia corpus, showing six of the dimensions

Figure 4: A two-dimensional projection of embeddings for some words and phrases, showing that words with similar meanings are nearby in space. representation for the target word [15]. The similarity between two words is then calculated by taking the dot-product of their vectors.

2.2.3 Contextual Word Representations One obvious problem with the word2vec algorithm is that each word only has one representation. If you take a word like "date", there is no way for a word2vec embedding to tell when the word is used as in "I went on a date last night", or as in "what date is it today?". This is a typical plural-meaning word, but the difference can also be more subtle than that, like for "need" in "you need to see this movie" and "humans need oxygen to survive". Here you would want the representation for "need" to be similar, but not quite the same. Contextual word representations tries to achieve this, by encoding the context of the token rather than having a single, global representation for each token. Now, these contextual embeddings can’t just be a matrix representation of each context since the possible contexts are endless. Instead a large neural network with an architecture designed to take context into account is trained on a huge unlabeled corpus, such as a Wikipedia dump or something similar. This

16 model is trained on a specific task, such as predicting the next word, given the words before it, or predicting randomly masked words given the surrounding words. With this pretrained model in place, one can feed it some input sequence for a downstream task, say named entity recognition, and for each token (the next or the masked one) the final hidden layers of the pretrained model provides a contextual representation of the token in question [39]. This is an example of transfer learning. A model is trained on, for instance, detecting the masked word or sequence, given the context, and then this model is used as a basis for other tasks, such as named entity recognition. No one really knows exactly what information, whether structural or semantic, is conveyed in these contextual representations, but they have been shown to outperform word2vec embeddings on almost all NLP-tasks [34]. This work has tried three different pretrained models for contextual representation, namely ELMO, BERT and ALBERT. ELMO is a model architecture that uses word2vec representations but concatenate these representations with representations of the context to the left and to the right of the token. It consists of a deep, bidirectional neural network that is trained to predict the next word given the previous ones. A weakness is that it doesn’t encode any relationship between the left and the right context, even though the bidirectionality allows it to concatenate the right and the left context [34]. The BERT model was first published by Google in 2018 [34]. It utilises something called the transformer architecture, a certain neural network design that has an at- tention mechanism which place weights on surrounding contextual words based on their relevance. This architecture can capture left and right context, and it can also capture relationships between contextual words. It’s aim is to predict masked words in an input sequence, and the final hidden states of this neural network can be used downstream as a contextual representation for the masked word [39]. ALBERT is a smaller, more condense version of BERT that has less parameters, but who performs almost as well as BERT on most NLP-tasks [34].

2.2.4 Part-of-Speech Tagging Part-of-speech tagging (POS-tagging) is the task of categorizing words in a text as belonging to a certain word-class, like a verb or an adjective. Since this is a categorization task it is ideal for a machine-learning model, given that you have a large enough annotated dataset.

2.2.5 Dependency parsing Dependency parsing is the task of analyzing the grammatical relations between words in a sentence and constructing a tree structure from these relations with one word as root. This is also a categorization task, where you label the parent for each word in the tree, and then specify the syntactic relations between the different nodes [16]. The theory behind dependency parsing dives deep into the ﬁeld of linguistics, so it

17 will not be explained in detail here. The task deﬁnitely has importance for this thesis though since the quality of the dependency parse, together with the tagger, is crucial for something called mention extraction - extracting pronouns, named entities, noun phrases and prepositional phrases from a text - which in turn was used when creating the Swedish coreference resolution dataset.

2.2.6 Named Entity Recognition The ﬁnal goal with this thesis was to provide a Swedish language model for extracting relations between entities in a text, and thus, extracting these entities, and categorizing them, of course, is crucial. An entity type can be, for example, a person, an organization, an event, or a time or date expression. Labeling these entities in a text is also a categorization task, and can be achieved through training a neural network on an annotated dataset.

2.2.7 Coreference Resolution Coreference resolution means extracting coreferring mentions from a text in so called coreference clusters, and also to identify the main mention in this cluster [17]. Con- sider the following text:

"Victoria Chen, CFO of Megabucks Banking, saw her pay jump to 2.3 million, as the 38-year-old became the company’s president. It is widely known that she came to Megabucks from rival Lotsabucks."

One of the coreference clusters here would be [Victoria Chen, her, the 38-year-old, she], and the main mention Victoria Chen. Another would be [Megabucks Banking, the company, Megabucks], and the main mention Megabucks Banking. To extract these coreferring mentions is considered a hard task within NLP, since the coreferring mentions can differ so much, and because they can be very far apart in the text. It is an important subtask for relation extraction though, since you need to resolve the coreferring mentions to capture all the entity relations in a text. A signiﬁcant part of the work done in this thesis has been focused on creating a Swedish coreference solver via distantly supervised machine learning, and in later chapters this work is explained in more detail.

2.2.8 Relation Extraction Relation Extraction means extracting certain types of relations between named entities in a text. A relation can be anything that binds together two or more entities (see ﬁgure 5) [18]. To provide a Swedish language model as a basis for later relation extraction between entities is what this thesis aims at. And when one has word-vectors, or a pretrained contextual model like BERT, a tagger, a parser, a named entity recognizer

18 Figure 5: Semantic relations with examples and the named entity types they involve.

(NER), and a coreference solver, models to extract the speciﬁc relations of interest for the task at hand can be engineered. This can be done with a rule-based model, or with a machine learning model, but in both cases the Language model needs to ﬁrst process the text and then use the results of this processing as features for a relation extraction model.

2.3 Related Works One important related work for this thesis was Alexander Wallins master thesis in computer science at KTH, "Creating a Coreference Solver for Swedish and German Using Distant Supervision" [49]. The method for creating a Swedish coreference solver has been drawn largely from this thesis. There are some differences though. Wallin, for instance, didn’t use Spacy but designed the neural networks from scratch. This work also used a different word aligner, and a different type of model for the actual coreference resolution, based on contextual representations rather than features (POS-tags, dependency parse trees etc.) Another important work for the setup of the project, and overall layout of the work was Eric Hallströms master thesis in computer science at KTH, "Relation Extraction on Swedish Text by the Use of Semantic Fields and Deep Multi-Channel Convolutional Neural Networks" [24]. This work investigates a method for relation extraction on Swedish texts which utilizes many of the NLP-tasks used in this work, such as POS-tagging, word representations, dependency parsing and named entity recognition. The method worked well for their intended purpose, extracting relations between entities from police reports. It doesn’t include coreference resolution though, which no doubt affects the ﬁnal result. One other related work was Marcus Klangs doctoral thesis at the department of computer science at Lunds University "Building Knowledge Graphs: Processing In- frastructure and Named Entity Linking" [26]. This work presented a great overview of the NLP-ﬁeld. The aim of the work was a bit different from mine though, focusing on entity linking, which is the task of linking named entities in a text to actual enitities in the world, stored in some sort of knowledge base. This will be very important for future work, linking entities in archival texts and descriptions to real-world entities, and thereby structuring the information contained in the archives.

19 20 3 Method

This chapter describes the main phases governing this work, and what conclusions were drawn from these phases. It then expands the research question into sub-questions, focusing on what methods there are for relation extraction, and what the prerequisites are for each of these methods. The chapter ends with a motivation of the frameworks chosen to meet these prerequisites.

3.1 Research Method The phases of this work corresponds to the ones originally laid out by Bunge in his Epistemology and Methodology from 1983 [6]. He proposes an iterative research process where the problem formulation phase is where one deﬁnes the problem, followed by the research phase where tentative solution strategies are formulated. In the development phase these tentative solutions are tried out and implemented, and in the evaluation phase these implementations are evaluated, after which the next iteration begins (see ﬁgure 6).

Figure 6: Research Method.

3.1.1 Problem Formulation In the problem formulation phase, the goal of this thesis was formulated in discussion with Catharina Grönqvist at Riksarkivet. The purpose was to find a way to extract and structure data from Riksarkivets archival material. To do this, relevant entities, and relations between these entities, has to be extracted. The aim of this work, therefore, was to crate a base model for relation extraction from Swedish archival texts and descriptions. The central problem governing this thesis thus was to find out what the prerequisites are to extract relations from Swedish texts, and then find a way to meet these prerequisites, with the important proviso that Swedish annotated datasets are

21 scarce. Another important aim of this thesis was that the base model created should be as ﬂexible as possible, or in other words, that it can be used for many different types of relation extraction tasks in the future. This problem formulation then gave rise to several sub-questions that each needed to be answered before a solution to the central research question could be found:

1. What different methods exists for relation extraction?

2. Which of these methods are most suitable to a situation where annotated datasets are scarce?

3. What are the prerequisites for each of these methods, that is, what information needs to be gathered from the investigated texts before you can implement any of these methods?

4. What methods can be used to meet these prerequisites?

5. Can a unifying framework be found to help meet these prerequisites?

The problem formulation phase proceeded in iterations because results from the other phases impacted on, and constrained, the original problem formulation. At ﬁrst, the goal was to do the actual relation extraction, but due to the uncertainty as to how much time would be spent on each step of the work, the problem formulation was constrained to how one creates a base model that meets the requirements for relation extraction, without doing the actual relation extraction itself.

3.1.2 Research and Literature Studies This phase also proceeded in iterations, and each time new results came in from the later phases, research was done as to how one could improve these results. Dur- ing the first iteration it was about getting an overview of the NLP-field, and surmise what was possible, and what could be done, given that the investigated texts are in Swedish and not in English. Jurafsky, 2019 [11] gives an excellent overview of the NLP-field, and from this book a tentative strategy was formulated. In later iterations it was about reading research-papers on relation extraction and related subjects. The field of NLP is evolving at a high speed, and so are the methods, so what gave state- of-the-art results a couple of years ago is often outdated now. The problem therefore was to filter out what methods were relevant today, and also what methods were doable given the time-constraints of this work. Reading books are a good way to understand the theory behind the algorithms, but when actually implementing these theories frameworks has to be chosen and learned. So research during the later iterations was mainly focused on reading the documentation for these frameworks. Only open-source frameworks were utilized in this work, so of course, a big part of the research on these framework consisted in reading the actual code, and when needed, rewrite, or add to it.

22 3.1.3 Implementation and Development During this phase, the chosen frameworks was used to implement tentative solutions to the problems emanating from the problem formulation phase and the research phase. Python was chosen as programming language due to the abundance of open source ML-frameworks written for Python. Pycharm was chosen as the IDE, and almost all code was written in ipynb-notebooks, mainly because this code is very easy to debug, and the code can easily be written step by step because of the partition of the code into separately executable cells. A large part of the work when writing ML- models consists of translating data between different formats. The datasets used were all in different formats depending on which task it was intended for, and of course, more often than not, the data-format required for the frameworks are of a different format than that of the datasets. When a dataset and a ready-made model exists, almost the only work that has to be done is to convert data between different formats. When a dataset doesn’t exist, more work has to be done, because a dataset has to be created ﬁrst, using the techniques of semi-supervised or distantly supervised machine-learning. How to implement these techniques varies with the task at hand, because the data you want to create looks very different depending on what is to be done with it.

3.1.4 Evaluation and Observation This phase concluded each iteration. It consisted in evaluating the models created in the development phase. Sometimes this is done by the framework itself, and sometimes the code for evaluation has to be written manually. Evaluating the models was done with the evaluation metrics described in the theory chapter, and the results were compared with the results gotten from English models, trained on English datasets, with the same methods. Observations of the results were then analysed, and explanations of the difference in scores from the English models were formulated. With these tentative explanations the work could then return to the ﬁrst phase of the research method with new insight as to what could be achievable (problem formulation), and how the results could be improved (research and literature studies), and how to implement these improvements (implementation and development).

3.2 Relation extraction methods This section describes the different methods used for relation extraction, and what prerequisites needs to be met to implement these methods.

3.2.1 Rule-based Relation Extraction Rule-based relation extraction means that you use lexico-syntactic patterns to extract relations from texts. This method was the ﬁrst one to arrive and it is still in common use today [12].

23 Imagine for instance that you want to extract hyponym relations from a text (A is a kind of B). A few handcrafted rules, together with examples of sentences conforming to the rules, are given in ﬁgure 7.

Figure 7: Handcrafted rules to extract the hyponym relation, NP stands for noun- phrase , {} for optional, * for repeated sequence, and | for logical or. (NP subscript H is the hyponym)

To write patterns such as these, and more elaborate ones, information about the investigated text is needed. If named entities are included in the patterns, these have to be extracted first, and to extract all the entities from the text, coreference resolution is needed. It is also common to use part-of-speech tags and dependency parse trees when writing the patterns [12]. For instance, if there is a certain dependency between the relevant entities, or a certain structure of the parse tree containing the relation, generalizing patterns can be drawn from this example. Same goes for POS-tags. If the words in between, and around, the entities containing the relation are of a certain POS-class, then a generalizing pattern can be drawn from this example also. The goal is for the patterns to be as general as possible without introducing false relations into the extracted relations. The definition of a relation here is an ordered set of tuples on a specific domain (see figure 8). Another useful feature when writing the patterns is word similarity. For instance, if one pattern for a certain relation is "PERSON sold OB- JECT to PERSON" then it would be useful to find semantically close words to "sold" to generalize the pattern. This can be easily implemented with word vectors, generated for instance with the word2vec algorithm. Hand-built patterns have the advantage of high precision, and can be tailored to lots of different relations, but they have low recall, that is, they behave badly on new data, since it is very difficult to hand-craft patterns for all possible expressions of the relation.

3.2.2 Relation Extraction via Supervised Learning This method requires a large annotated dataset where the positive examples of the relations under investigation are marked. A model, usually some form of neural network, is trained on the dataset, enhanced with features such as entities, POS-tags, dependency paths and word embeddings [12]. The model then learns the patterns by itself and learns how to extract the annotated relations on unseen text. Figure 9 shows an algorithm for supervised relation extraction. If the texts under investigation are similar enough to the training data, then supervised relation extraction can

24 Figure 8: A model of the restaurant world. The domain consists of restaurants, persons, and cuisines. Noisy is a property of restaurants, and the relations "likes" and "serves" are deﬁned as an ordered set of tuples on the domain. result in high accuracy, but it generalizes badly for texts in different genres [12]. To label a new training set for each relation extraction task is also very expensive and time-consuming, and that’s why a lot of the research on relation extraction has focused on semi-supervised or distantly supervised ML[12].

Figure 9: Algorithm for supervised relation extraction. First it ﬁnds all entity pairs, and then, for all pairs, if the entities are related, it classiﬁes the relation.

3.2.3 Semi-Supervised Relation Extraction via Bootstrapping Suppose for instance one wants to ﬁnd the owners of different mansions by investi- gating a collection of documents. Semi-supervised relation extraction, or bootstrap-

25 ping, means starting with a few entity seed-tuples known to have the relation R, for instance [Peter Hanson, Fårö Herrgård]. From these seeds generalizing patterns are extracted, with features such as dependency paths, entity types, surrounding words etc. The documents are then processed, scanning the texts for sentences containing the entities in the seed tuples. After that the context in between and around the entities in the found sentences are generalized, and then used to construct new patterns. These patterns are then used to find more tuples. To filter the newly found tuples a confidence measure is calculated for the pattern used to extract the new tuple. This confidence measure is based on two factors, one how it performs on the already ac- cepted tuples, that is, if it manages to extract ordered entity pairs that are known to have the relation R, and two, it’s productivity in terms of the number of matches it produces in the document collection. If the pattern has high confidence then the tuples extracted are added to the relation domain. The result of this process is a list of entity tuples defining the relation: "PERSON owns MANSION". The algorithm is summarized in figure 10 [12].

Figure 10: Algorithm for semi-supervised relation extraction

3.2.4 Relation Extraction via Distantly Supervised Machine Learning Relation extraction via Distantly Supervised Machine Learning combines the techniques of bootstrapping with supervised machine learning. Instead of starting with a few seeds, a large database, like DBPedia, is used to ﬁnd any number of seeds. For instance, to train a model to extract the place-of-birth relation, DBPedia has over a hundred thousand tuples having this relation. Next, a large collection of documents, perhaps a text-dump of Wikipedia, is scanned for sentences containing entities in the tuples. For each match, different features, like the named entity labels of the two mentions, the words and dependency paths in between, and around the mentions, and neighboring words, are extracted and used to construct a training instance. A supervised classiﬁer, like a linear regression model, or a neural network, is then trained

26 on all of these extracted training instances, and the result is a model that can extract this relation from unseen text. The algorithm is summarized in ﬁgure 11 [12].

Figure 11: Algorithm for distantly-supervised relation extraction. Observations is the resulting trainingset, and f is the features gathered from the sentences.

3.3 Technological Methods 3.3.1 Selection of the Methods The given overview of the different methods used for relation extraction gives some answers as to what was needed from this work’s model. In all of the methods described above word-vectors, POS-tagging, dependency-parsing, named entity recognition, and preferably coreference resolution, are needed to extract patterns and contexts from the texts. The question therefore becomes: How does one build a model that accommodates all of these features, given that annotated Swedish datasets, at least for coreference resolution and relation extraction itself, are scarce? The goal was to build a model that is flexible and can be used as a base model for both rule-based, semi-supervised, and distantly supervised relation extraction. The decision was made to go with Spacy as a unifying NLP-framework to build a language processing pipeline featuring POS-tagging, dependency parsing and named entity recognition. But Spacy doesn’t support contextual representations, so a choice was made to try out two additional frameworks, AllenNLP and Transformers, for the crucial task of named entity recognition, and then integrate these frameworks with Spacy to get a unified pipeline. For coreference resolution the first choice was to go with the framework NeuralCoref which features a ready-built model for coreference resolution that can be retrained in other languages than English. But initial results weren’t promising, so it was decided to instead go with AllenNLP:s coreference resolution model that is specially built to use pretrained contextual word-representations. Since no large enough Swedish training set exists for coreference resolution, distantly- supervised machine learning was implemented to create one, which succinctly means that the English AllenNLP model for coreference resolution was used on a bilingual

27 corpora to create a large enough Swedish dataset for a blank AllenNLP coreference model to be trained on. More on this in the next chapter.

3.3.2 Spacy as a Unifying Framework Spacy is an open source framework for NLP written in Python. For English there are trained models to download for a wealth of NLP tasks, but for Swedish there is only the skeleton, no trained model was available at the time of writing this thesis. One of the ideas behind Spacy though is that their untrained models, consisting of deep, convolutional neural networks for the most common NLP tasks, can be used to train a model in any language [50]. Custom pipeline components, designed for the speciﬁc task at hand, can also be added to the pipeline, which was important for this work, since models from other frameworks were used and then wrapped by a custom Spacy component, thereby integrating these frameworks with the Spacy pipeline. Figure 12 shows an image of the Spacy pipeline. It consists of preprocessing (tokenization), POS-tagging, dependency parsing and named entity recognition [27].

Figure 12: Spacy Language Processing Pipeline.

The advantages of having a single pipeline to separate models for each NLP task is that you get all the resulting data and features in one single Doc object, which can then be used for customized tasks downstream, for instance relation extraction. To train a Spacy pipeline component in another language than English one needs an annotated dataset that is converted into one of Spacys training formats. The idea behind having a tagger, a parser and a NER ﬁrst in the pipeline is that these features are relevant for almost all more complicated NLP-tasks, like relation extraction. Things have been happening though during the last couple of years, with the advent of contextual embeddings and the transformer architecture. On many NLP-tasks, end-to-end models, that replaces the variety of features, like POS-tags and parse-trees, with just the contextual embeddings, have been shown to outperform the more traditional models. This work uses one of these end-to-end models for coreference resolution. Still, the decision was made to go with Spacy as the overarching NLP framework since it is geared towards production, and since it can incorporate other more research-oriented frameworks into it’s pipeline. It is also very easy to work with, one gets a single Doc object that each of the pipeline components manipulates and then pass on to the next component. When building a language model for Swedish, the

28 starting point is a blank model, no pipeline components are trained. What’s attrac- tive about Spacy as an overarching framework is also that it is very ﬂexible. Custom components, as mentioned, can be added, and for the NER, new entities can be added to an already existing model, and already included entites can be enhanced with lists and regular expressions [20]. There are several ways to train a Spacy model. One can use the Spacy command line interface (Spacy CLI), which requires the data to be in jsonl format [4], or one can use the Spacy API to train the model in Python code. This requires the data to be in a different format. For each of the NLP-tasks in this work, where the actual Spacy models were used, both ways of training were tried, but the one found most practical was using the Spacy CLI. Mainly because evaluating the results, and protecting the model from overﬁtting, were much easier this way.

3.3.3 NeuralCoref NeuralCoref is an open source framework for coreference resolution fully compatible with Spacy. It features a trained model for English only, but can be retrained for other languages, given that one has a large enough dataset. This work started out using NeuralCoref’s trained English model to create the Swedish dataset, and a blank NeuralCoref model to be trained on this dataset. As stated earlier though, the initial results weren’t promising, so it was decided to switch to AllenNLP’s end-to- end model for coreference resolution, which utilizes contextual representations and achieves a much higher F1-score for English coreference resolution. However, when constructing the Swedish dataset, NeuralCorefs mention extractor was indeed used. NeuralCoref works by ﬁrst extracting potential mentions from the text, mentions being noun-phrases, pronouns, entities or prepositional phrases, and then, for each possible mention pair, it calculates the probability that these two mentions are coreferring. The mention extractor is rule based and is dependant on the POS-tagger and the parser, but the calculating of probabilities is done through a neural network [25]. A problem with the mention extractor is that it is dependent on the Spacy standard En- glish parser, which uses the Stanford Dependencies scheme designed for the English language. The training sets available for Swedish dependency parsing instead uses The Universal Dependencies scheme. This means that, for this thesis, the mention extractor in Neuralcoref had to be rewritten for universal dependencies. See ﬁgure 13 for an example of mention extraction, where the mentions are marked by a blue box.

Figure 13: Mention Extraction, "I" and "me" are pronouns, "it" a preposition, and "many friends", "almost everyone" and "this SMS", are noun-phrases. All of these mentions can corefer to one another.

29 3.3.4 AllenNLP AllenNLP is an NLP-framework mainly built for research [2]. It is built on top of Pytorch and can be used to build any types of NLP-models. It also features ready- built models for named entity recognition and coreference resolution amongst others, and the framework is designed to make use of contextual representations, which is the reason why the framework was used in this work. The English NER model uses ELMO combined with word2vec vectors, and their English end-to-end coreference model uses spanBERT, a variant of BERT that encodes random spans of tokens instead of single tokens [34].

3.3.5 Transformers Transformers is a framework that, as the name suggests, is directed towards implementing the transformer architecture for NLP-tasks [23]. It hosts a wealth of pretrained models ready to be used for their pipeline components. Important for this work, it hosts the Swedish BERT and ALBERT models trained by the National Library of Sweden, and these pretrained models can easily be used within the transformers framework.

3.3.6 Google Colab Training large models can be very time consuming, unless you have a really advanced GPU or TPU. At first, when training the Spacy models, this wasn’t a problem since Spacy models are designed to be compact and fast. But during the later phases of the work, when switching to AllenNLP, and pretrained transformers were used, this was impossible to do with the computer resources available for this work. The solution was to use Google Colab instead, which is a Python notebook environment that features a cloud GPU to use with each notebook created. Colab Pro guarantees a 16 GB GPU for each notebook, and several notebooks can be created and run simulta- neously. It is also easy to mount ones Google Drive storage in the notebook environment, so the datasets and the configuration files can be accessed via Google Drive, and the trained models can be uploaded to the drive from the notebook. Switching to training the models on the cloud was essential to this work.

30 4 Training the Spacy Language Processing Pipeline

This chapter describes the details of training the Swedish Spacy language processing pipeline. It will start with a description of the datasets used for each sub-task. It will then go on to describe the implementation details for each of the sub-tasks.

4.1 Datasets This section describes the datasets used for the different NLP-tasks when training the Spacy model. Extracts from the different datasets used, as well as Spacys different training formats, can be found in appendix A.

4.1.1 POS-Tagging and Named Entity Recognition For POS-tagging and NER the Stockholm Umeå Corpus (SUC 3.0) was used [38]. It features about one million tokens annotated with part of speech and named entities. It is in XML format and can be used by anyone who applies to Språkbanken for a license. Språkbanken is a Swedish research institute that provides support for research projects related to the Swedish language [31]. It also freely provides a large collection of Swedish annotated datasets, as well as different tools for language processing. SUC 3.0 features a lot of different named entities. SUC 3.0 is by far the largest Swedish dataset for POS-tagging and NER. A promising new project is undertaken by AI Innovation of Sweden, Lindholmen, and partner organizations, who, supported by Vinnova [48], will work from june 2019 to june 2021, to develop new, extensive trainingsets for Swedish NLP, focusing on NER [42]. When these new datasets are released, the model developed in this work can be trained on them without being retrained from scratch, incorporating both SUC 3.0 and these new datasets.

4.1.2 Dependency Parsing For dependency parsing two different datasets were used, "UD Swedish Talbanken" [46] and "UD Swedish Lines" [45]. Both datasets uses the Universal Dependencies annotation scheme, and they both consist of about a hundred thousand annotated tokens. This is by no means a large dataset, but the most extensive datasets available for Swedish dependency parsing.

4.1.3 Coreference Resolution When it comes to coreference resolution the only Swedish dataset available is the SUC-Core [41] issued by Stockholm university and Umeå university in collaboration. This dataset only contains 17 000 annotated tokens. This can be compared to OntoNotes [32], the standard coreference dataset for English that consists of 1.5 million tokens. This is why distant supervision was used for the coreference solver, creating a Swedish dataset using an English AllenNLP model trained on OntoNotes. The

31 Swedish model trained on this distantly created dataset can’t be evaluated on a part of this dataset since the dataset isn’t hand annotated, and in fact is created using the same model architecture, which means the results would be skewed if you used part of it as an evaluation set. This is where SUC-Core becomes useful because it is still large enough to use as an evaluation set for the distantly trained coreference solver. To create the Swedish dataset for coreference resolution Europarl [21] was used. This is a machine translated, bilingual, sentence-aligned corpus consisting of about 60 million tokens, extracted from proceedings of the European parliament.

4.2 Part-of-speech tagging At first, the Spacy POS-tagger was trained on SUC 3.0, which was split into a training set, an evaluation set and a test set. The Spacy API rather than the Spacy CLI was used. No word vectors for the first training was used, but instead the results from this first training iteration was used as a baseline to evaluate later improvements. The problem with using the API is that the code to evaluate the results has to be written manually. Another issue with using the API is that it is hard to know when the model starts overfitting. The model has to be trained a number of times to see how many training iterations yields the best result. The results of this first attempt weren’t promising. They improved significantly though when word vectors were added to the model. Swedish FastText vectors for about two million words was used. FastText vectors builds on the Word2Vec algorithm, but with certain modifications to make them fare better with unknown words [47]. As a guide to writing the code for the training, the Spacy Official Docs was used[36], which is excellent, although sometimes a bit hard to navigate. One issue with training the tagger, or any other pipeline component this way, is with the tokenization, and this tokenization issue proved to be significant for almost every step of this project. At first, the Spacy simple training format (see Appendix A), recommended in the docs, was used. With this format one provides the text sentence by sentence, and then the tags for each word in the sentence. The problem is that the tags marking the words follow the tokenization of the training set, while the text of the sentence is tokenized by the Spacy tokenizer ahead of training. Thus, the tokenization can be different for the text and the tags, resulting in out of sync annotations for that particular sentence, which in turn results in a less accurate model. This was the main reason why the Spacy simple training style was abandoned, and instead the Spacy CLI format was used. With the CLI the text is provided token by token in jsonl format together with the annotations[4]. Hence Spacy doesn’t first tokenize the text, so the tags are sure to be in sync with the provided text. Also, when training with the CLI, there are lots of parameters to play around with, and with one of them guards against overfitting by stopping the training when the evaluation results hasn’t improved after a certain number of iterations. When training with the CLI, Spacy saves a model to disk after each iteration, so if, for instance, the model is trained for 35 iterations, 35 different models are saved to disk, and one can simply

32 choose the one with the best evaluation score.

4.3 Dependency parsing For the first project iteration, here too, the Spacy simple training format was used, and the code for evaluation had to be written manually. The training took a fair amount of time, because it was necessary to manually delete every sentence from the training set where the hand-annotation didn’t build the parsing tree correctly, that is, where the parse tree contained cycles. Spacy gives an error when such a sentence is spotted, but it meant that the model had to be retrained about a 100 times, each time stopping and deleting the sentence from the training set whenever this error occurred. The time spent on this was the reason that only one of the two available training sets was used for the first project iteration. When returning to the parser later on, after having success with the Spacy CLI when training the tagger, the CLI was used for the parser too, and this went much faster. The training algorithm simply skipped the sentences with erroneous tree structure. Now both the training sets could be used, which resulted in a more accurate model. An issue faced when using the CLI was that when testing the parser on a text document, the parser didn’t separate the sentences, but tried to parse the whole document as one long sentence, even though it correctly marked the dot at the end of each sentence as an end of sentence relation. A question about this was posted on Stack overflow, and it turned out that there was only one sentence per paragraph in the parsed jsonl file, but for the model to learn to separate sentences from each other there has to be several sentences per paragraph [37].

4.4 Named Entity Recognition When training the Spacy NER, the main body of work consisted in converting SUC 3.0, which is in xml format, to the different annotation schemes used for entity marking. Figure 14, 15 and 16 describes the most common schemes, see ﬁgure 20 in Ap- pendix A for an extract from SUC 3.0.[44][4]

Figure 14: BILOU Format, B stands for the ﬁrst token in an entity, I for an inside token in an entity, L for last token in an entity, O for a token that is not part of an entity, and U for a single token entity.

33 Figure 15: IOB Format. B stands for the ﬁrst token of an entity, I for every following tokens in an entity, and O for a token that isn’t part of an entity.

Figure 16: Spacy Simple Training Format

As seen the Spacy simple training format uses character offsets into the sentences to mark the entities. The Spacy CLI train command on the other hand uses the BILOU scheme in a jsonl token-separated format (see Appendix A). For the first project iteration the Spacy simple training format was used, and SUC 3.0 was first converted into an IOB text file. This file was then converted into the Spacy simple training format, keeping only the entity categories that were relevant as well as frequent enough to yield an acceptable F1 score. Even though, later on, the decision was made to go with the Spacy CLI format for the NER as well, and thereby converting the SUC 3.0 xml to Spacy BILOU jsonl. Writing the converter to Spacy simple training format can still be useful in the future, since this is the preferred option if one wants to add entities to an already existing model. The resulting Spacy NER was fairly accurate when it came to names and places, but performed badly on more rare entities like organizations or time expressions. About halfway into this work, The Swedish National Library released their Swedish pretrained BERT-models [43], and also, the NLP research-framework AllenNLP was becoming more popular, so a decision was made to try and improve the results of the NER with these new frameworks and models, and then try to integrate these models with Spacy. For AllenNLP the integration was easy since it actually uses Spacys tokenizer, which means the indexing of the entity tokens will be in sync with the Spacy Doc object, but for the Swedish BERT-based model, combined with the Transformers NER-model, it was a bit more difficult since BERT-models uses their own tokenizer, and the predicted tags had to be aligned with the Spacy-tokenized Doc-object. The solution was to, sentence for sentence, transform the predicted entities into lists of

34 tokens, and then search the corresponding sentence tokenized by Spacy, for these sublists, and then add the entities to the Spacy Doc object. The AllenNLP NER-model uses ELMO combined with word2vec embeddings as a contextual representation, and the results were better than the Spacy NER, but still lacking when it came to rare entities. The best results were obtained from training the AllenNLP-NER model with a Swedish BERT-model, finetuned on SUC 3.0, as a base. To try out this NER, and also show some of the benefits of working with Spacy, it was tried out on an archival description of Jordberga Gods. An entity relation of interest here could for example be to work out which person owns the mansion at what year. To extract this relation, in this archive description, and other archive descriptions for other mansions, the persons, the mansions, and years has to be extracted. The trained NER already extracts persons, but it doesn’t extract mansions. Thus, a Wikipedia-list of Swedish castles and mansions was webscraped, and then the list was combined in different ways to construct patterns for a new, custom entity (MNS) to be added to the NER. A regular expression was then used to extract years and time- periods. Figure 17 shows an example of entity extraction. PRS stands for person, LOC for location, MNS for mansion, and YEAR for years and time-periods. One thing to note is that Harlösa wasn’t in the webscraped list of mansion, and thus was marked as a location instead, the rule-based extraction taking place after the regular NER and overwriting conflicting entities.

35 Figure 17: An example of relation extraction, combining a machine learning ner model with rule-based named entity extraction.

4.5 Coreference resolution Coreference resolution was by far the most time-consuming and technically demand- ing part of the work. At ﬁrst an open source Python framework, NeuralCoref, was chosen [30]. It features a model for English coreference resolution trained on OntoNotes, the standard coreference dataset for English. The choice of NeuralCoref was made because it is open-source, one of the requirements of this project, and because it is fully compatible with Spacy. The english part of a bilingual sentence aligned corpus was fed to this model, marking all the coreferences in the English text. Then a word-aligner program was used to translate these English coreference annotations into a Swedish annotated coreference dataset. Later on in the project, the decision was made to switch to AllenNLP’s coreference model that uses contextual representations at it’s base, and is end-to-end, which means that it doesn’t depend on other pipeline components as POS-tagging or dependency parsing for it to work. The process of creating the Swedish dataset was the same though for both frameworks and Algorithm 1

36 is a summary of the algorithm for creating the Swedish dataset, after which follows a more detailed description of the method to create both the dataset and the coreference model via distant supervision.

Algorithm 1 Creating the Dataset 1: for Every document do 2: AliFileEng, AliFileSwe = createAlignedFiles(XMLLinks, XMLEng, XMLSwe) 3: AlignmentsFile = runEfﬂomal(AliFileEng, AliFileSwe) 4: wordAlignments = extractWordAlignments(AlignmentFile) 5: nlpEng = LoadEnglishModel() 6: nlpSwe = LoadSwedishModel() 7: engDoc = runModel(nlpEng, AliFileEng) 8: sweDoc = runModel(nlpSwe, AliFileSwe) 9: corefSpansSwe = translateCorefSpans(wordAlignments, engDoc) 10: corefAnnotationsSwe = [] 11: for Every cluster in corefSpanSwe do 12: sweCorefMentionsInCluster = [] 13: for Every span in cluster do 14: sentText = extractSentContainingSpan(span, sweDoc) 15: sentDoc = runModel(nlpSwe, sentText) 16: mentionsInSent = extractMentions(sentDoc) 17: maxMentionInSpan = extractMaxMentionInSpan(span, mentionsInSent) 18: valid = validateMention(maxMentionInSpan) 19: if valid then 20: sweCorefMentionsInCluster.append(maxMentionInSpan) 21: end if 22: end for 23: if sweCorefMentionsInCluster.length > 1 then 24: corefAnnotationsSwe.append(sweCorefMentionsInCluster) 25: end if 26: end for 27: documentDataframe = saveToDataframe(corefAnnotationsSwe, sweDoc) 28: end for

4.5.1 Europarl Europarl [21] seemed ideal for this works purpose of building a Swedish coreference solver via distantly supervised machine learning, due to its bilingual sentence alignment between English and Swedish. It was also the work used by Wallin, 2017 [49]. The Swedish-English aligned version of Europarl consists of about 2.6 million sentences, with a total of 45 million tokens, divided into about 11200 documents. This corpus is way too large to process in it’s entirety. Instead, what this work was aiming

37 at was creating a Swedish coreference dataset out of this unannotated corpus comparable to OntoNotes with it’s 1.5 million tokens. However, constraints on both the time available for the work, and computer resources to train the models, made it impossible to work with such a large dataset. Instead, a dataset of about 250 000 tokens was created, which is still more than ten times the size of SUC-Core.

4.5.2 Create Tokenized Aligned Textfiles for Swedish and English Sentences The creation of the dataset was done document by document until a suitable total size was reached. The raw Europarl corpus consists of 11200 xml-files for English and Swedish as well as 11200 xml-link files, one for each document, linking the English sentences to the Swedish sentences. The first step for processing a particular document was to, from these xml-files, create sentence-aligned and tokenized textfiles, one for Swedish and one for English. It was crucial here to use the same English tokenizer and the same Swedish tokenizer as was used later in the dataset creation process. This will become apparent in what follows.

4.5.3 Word Alignment The next step was to, for each aligned sentence, calculate word-alignments from En- glish to Swedish (actually one sentence in English can be aligned to several sentences in Swedish, and vice versa, but the important thing is that the original text and the translated text are aligned by lines). The University of Helsinki has an excellent NLP resource site for handling parallel corpora, and their open source word aligner Eflo- mal was used for this task [35]. The word-aligner was first trained on the corpus in its entirety. After this, the trained Eflomal model could be run on the document being processed. The result was a text file containing word-alignments from English to Swedish, line by line from the tokenized English and Swedish text files. From this alignment file a Python dictionary was created with English token ID:s as keys, and the aligned Swedish token ID:s as values.

4.5.4 Running the English Coreference Solver By now the tokenized English document was processed by the English coreference solver (NeuralCoref and AllenNlp coref). Experimenting with lots of different document sizes yielded somewhat different results. It seemed like short documents gave slightly better results, but only processing short documents would have taken too long, since too many documents would’ve had to be processed. The chosen solution was instead to process medium-sized documents, but splitting them up into many parts before running the English model on them. The coreference annotations are structured in so called coreference clusters, that identiﬁes groups of coreferring mentions, and within each group there’s a main mention. The annotations were then translated to a nested Python list of coreference Spans, containing token indexes for all coreferring mentions in each coreference cluster. It was important here that the

38 text to be processed was tokenized in exactly the same way as it was when the word alignments were calculated, because otherwise the indexing of the coreference annotations would be out of sync with the word alignments dictionary and the following steps would’ve been impossible.

4.5.5 Translating the Coreference Spans Now that the English coreference spans were resolved, the word alignment dictionary could be used to get the corresponding Swedish spans. From the aligned Swedish token indexes for each token in the English coreferring mention, the maximum continuous aligned span in the Swedish document was calculated. Each span was then saved to a nested list (corefSpansSwe).

4.5.6 Mention Extraction The next step was to, for each span in corefSpansSwe, do mention extraction on the Swedish sentence containing the span. This required rewriting NeuralCoref’s rule based mention extractor, because it is dependent on a different tagging scheme and a different dependency-parsing scheme than the trained Swedish model. The differences pertain to how prepositional phrases and copula constructions (Jag är..., He is...) are handled. NuralCoref’s mention extractor is dependent on the parser, the tagger and the NER, so for this step the tokenized Swedish text was processed by the trained Swedish Spacy model, including the BERT-based AllenNLP NER.

4.5.7 Filtering the Mentions When the Swedish mentions were extracted from the sentence containing the span, the idea was to check, for each coreferring mention, if any of these mentions were contained in the Swedish maximum continuous span. If a mention was found, it was saved to a nested list of Swedish coreferring mention spans. If several mentions were found in the span, the mention with the largest span was chosen.

4.5.8 Storing the Annotations The last step of processing a document was to store the annotations to a Pandas dataframe, one entry for each token in the document, with the coreference annotations and other relevant information like document index, sentence index etc., and then convert this data frame to the CONLL2012 format used by AllenNLP’s coreference model.

4.5.9 Training the Swedish NeuralCoref model When training the AllenNLP coreference model on the created Swedish dataset, SUC- Core was used as an evaluation set, and the Swedish National Library’s BERT-model

39 was used as a contextual embedding. At ﬁrst, the free version of Google Colab was used, where one gets access to a 12 GB GPU, but this wasn’t enough for a dataset consisting of 250 000 tokens, so the account was upgraded to Colab Pro, where one gets a 16 GB GPU, which was enough for a dataset of this size.

40 5 Results

In this chapter the results from all the different steps of building a Swedish language processing pipeline are presented. Spacy serves as the overarching framework, and Spacy models were trained for POS-tagging and dependency parsing and NER. Given the importance of NER for future use with archival texts, several frameworks, with several embeddings, were tried to achieve as high results as possible for this crucial task. For coreference resolution, this work ﬁnally decided to go with AllenNLP’s end-to-end coreference model, which achieves state of the art results on the English coreference dataset OntoNotes [3]. Depending on the type of NLP-task, different evaluation metrics were used, as explained in the theory chapter.

5.1 Part-of-Speech Tagging For POS-tagging accuracy was used as an evaluation metric since every word belongs to a word class. When training the tagger with the Spacy API simple training format without word-vectors, an accuracy of 82 percent was achieved. This is pretty low compared to the Spacy English models, and would’ve impacted badly on the later coreference resolution task, with the mention extraction being dependent on the tagger. When training with the FastText vectors the accuracy improved to 89 percent, which was still considerably lower than the English models. The low results were suspected to be caused by the tokens being out of sync with the tags for certain sentences. Switching to training with the Spacy CLI, where an already tokenized text is fed to the model, improved the accuracy to 96.6 percent. See table 1 for a comparison of different Spacy models.

5.2 Dependency Parsing Two metrics were used for evaluating the parser, unlabeled accuracy score (UAS) and labeled accuracy score (LAS). The unlabeled accuracy measures the correctness of the tree structure, that is, which words are dependent on which, or specifically, which word is the head of a given word in a sentence. The labeled accuracy also checks the correctness of the labeled dependencies, that is, if the words are assigned the correct dependency category. When training the model with the Spacy CLI on only one of the two datasets (Swedish Lines) UAS was 83.7 percent and LAS was 78.8 percent. When training the model on both datasets (Swedish Lines as well as Talbanken) an UAS of 85.7 percent and an LAS of 80.03 percent was achieved. This is a significantly lower score than achieved by the Spacy English models, but better than the official Danish Spacy model (see table 1). A possible explanation for this is that the English Spacy tokenizer is much more accurate than the corresponding Swedish or Danish ones.

41 Table 1: Evaluation scores for Spacy dependency parsing and POS-tagging. LAS is the labeled accuracy, UAS is the unlabeled accuracy, and POS is the part-of-speech accuracy [22]

5.3 Named Entity Recognition For named entity recognition F1-score is the recommended evaluation metric since most tokens doesn’t belong to any entity at all, and hence, accuracy would be high even if no entities were extracted. Different frameworks, models, and embeddings were tried for this step, given it’s importance in future work. The best results, as can be seen in table 2, were obtained from training AllenNLP’s fine-grained NER model with a pretrained Swedish BERT model finetuned on the SUC 3.0 dataset, as a contextual embedding. Finetuning a BERT-model means adding a few neural layers and an output layer to the pretrained BERT-model and feeding it an annotated dataset, and let the model fine-tune the BERT model’s parameters to achieve minimal loss on the training-set. This fine-tuned BERT-model can then serve as a base for a NER-model, in the same way an original BERT-model can. The actual fine-tuning wasn’t part of this work but was done by the Swedish National Library. The entities extracted was persons (PRS), organizations (ORG), time-expressions (TME) and locations (LOC). All the models were trained on SUC 3.0 using a split between training set, evaluation set and test set. No time-expressions were extracted with the Spacy model, so the overall result is not quite comparable with the other models. Notable is the significantly higher score achieved on organizations with the fine-tuned BERT model as a base. This is probably due to the model being fine-tuned on each of the entities.

5.4 Coreference Resolution The evaluation metric used for coreference resolution, recommended by the shared CONLL-2012 task [49], is a metric called MELACoNLL (coreference F1-score). It is the weighted average of three other coreference evaluation metrics called MUC6, B3 and CEAFE. MUC6 models the links between mentions as vertices in a graph, and what

42 Table 2: NER evaluation scores for the Spacy NER with FastText, AllenNLP:s ﬁnegrained NER with elmo and FastText, BERT-base and a BERT-base ﬁnetuned on SUC 3.0

the metrics does is measures how many links must be added or deleted from coreference chains such that all chains are correct [49]. B3 does something similar but assumes perfect mention detection. CEAFE is an entity based metric, rather than a mention based metric.

5.4.1 Word Alignment As stated before, Helsinki University’s word-aligner Eﬂomal was used when align- ing the English words to the Swedish words in Europarl. The average error rate for English-Swedish sentences with Eﬂomal is about 0.13 [19].

5.4.2 Translated Coreference Annotations When creating the Swedish dataset about 83% of the english coreference annotations were translated to Swedish ones. The untranslated annotations could be due to erroneous word-alignments, or incomplete mention detection in the Swedish documents.

5.4.3 Trained Model Table 3 shows a comparison of the coreference F1-score for some English models, Wallin’s model, and for this work’s model. Wallin (2017) used the exact same evaluation set as this work (SUC-Core), so the results are perfectly comparable. This work achieved a 43% improvement on Wallin’s work, even though the generated dataset was about six times smaller. This is undoubtebly due to the fact that a better En- glish model was used when creating the Swedish dataset, and also when training the Swedish coreference solver on this dataset.

43 Table 3: Coreference F1-scores for Lee’s et al. original end-to-end coreference solver, using Glove embeddings, AllenNLP’s model, based on Lee’s model but using Span- BERT and BERT instead (LSTM is a speciﬁc recurrent neural network archtecture), as well as Wallins (2017) Swedish coreference solver, and this work’s coreference solver, using AllenNLP’s BERT-model [1]

44 6 Discussion

This chapter discusses the process of this work, outlines implications for future work, as well as draw possible conclusions from the results. The chapter starts with a general discussion of doing Natural Language Processing on Swedish texts, reviewing the frameworks used for this work, and discussing the pros and cons of different types of machine learning methods. The results gathered from the work are then analyzed, and suggestions for possible improvements are put forward.

6.1 Natural Language Processing on Swedish Texts The field of NLP is already large, and continues to grow as more and more practical implementations are developed. Of course, the biggest developments take place within the English language, but other countries, with different languages, are trying to catch up. A number of difficulties must be overcome when working with NLP in smaller languages, of which the main difficulty is the scarcity of annotated datasets. This obstacle though, can be overcome with techniques such as semi- and distantly supervised machine learning, as well as transfer learning. Another difficulty is that the existing NLP-frameworks are mainly designed for use with the English language, and often requires a bit of work to sync with other languages. But since most of them are open-source and can be customized, this is not a big obstacle.

6.1.1 Frameworks The main frameworks used for this thesis were Spacy and AllenNLP. They are two very different frameworks, Spacy being developed with speed and production as the main goals, and AllenNLP mainly for research. Spacy has great documentation, and is very easy to work with. It was designed to work with many different languages, featuring, for example, specialized tokenizers, POS-tagmaps for over 20 languages. The models are designed to be fast, both in training and production, which is great if one actually wants to implement the models. The drawback is that the models aren’t that accurate, and the framework is yet to implement support for pretrained transformer models like BERT or ALBERT. AllenNLP is an open source NLP-framework built on top of Pytorch, aimed at research, and features an easily understandable syntax for constructing your own models, as well as many ready-made state-of-the-art models for a number of NLP-tasks. The documentation isn’t great, and one often has to read the code and the comments to understand how to use it, which, on the other hand, is a great way to learn. Dur- ing this work a number of questions about implementation details were posted on the AllenNLP forum, and they all got answered within a day, so the forum is very ac- tive. AllenNLP is also easy to integrate with Spacy, you can train an AllenNLP model, and then create a custom Spacy Pipeline component, and this way one can choose: if speed is important and there exists a Spacy model for the particular task, then go with

45 Spacy, but if accuracy is important, or the task is such that the model has to be built from scratch, then go with AllenNLP, and integrate the trained model with Spacy.

6.1.2 Supervised Machine Learning SUC 3.0, the Swedish dataset for NER and POS-tagging is comparatively large, which was reﬂected in the results for these two tasks. But for other tasks, such as coreference resolution, or relation extraction, the Swedish datasets are either small or nonex- istent. Thus, the need for other forms of machine learning becomes apparent. Also, supervised machine learning is only applicable on specialized categorization tasks [39]. When the tasks becomes more complicated it has to be replaced or combined with other, more general, methods. Transfer learning represents such a generalized method, and according to many within the AI-community, to live up to it’s promises, AI has to make a shift towards these more general methods [39].

6.1.3 Transfer Learning For humans, transfer learning is the main paradigm, one uses skills already learned to learn new, more specialized skills. The challenge within AI is how to model transfer learning in an efficient way. For long transfer learning has been the standard within the field of computer-vision, where you train a large model on some generalized task and let this model serve as a base for other, more specialized tasks. With the advent of transformer models, transfer learning made it’s way into NLP as well. A model is trained on some general task, for instance masked language modelling, and you then use the hidden layers of this trained model, to represent contextual information about a token, or a sequence of tokens. Exactly what information is encoded in these hidden layers is quite poorly understood [39], but specialized models built on top of these generalized models have been shown to consistently outperform feature-based models on almost all NLP-tasks, a fact that was also reflected in the results of this work.

6.1.4 Distantly Supervised Machine Learning Broadly, distantly supervised machine learning means to, via some automated process, create a dataset from another dataset, that is either smaller or written in another language. Of course, this is a very relevant method for situations where hand- annotated datasets are scarce, for example when you’re doing NLP in a different language than English. In this work, the method was used for coreference resolution, and the idea of using English models, trained on extensive hand-annotated English datasets, for creating Swedish datasets where no others exist, is an important one, and will probably be used quite extensively in the future.

46 6.2 Discussion of the Results This section discusses the results of the different parts of this work, draw conclusions from them, as well as suggests different, possible areas of improvement.

6.2.1 POS-tagging and Depcendency Parsing The results for POS-tagging obtained from training the Spacy POS-model were pretty good compared to the English ofﬁcial models, which reﬂects the quality and size of SUC 3.0. For dependency parsing, the results were not equally good. Perhaps they would have been better with a more accurate tokenizer (the sizes of the English and French datasets for dependency parsing are not very much larger then the sizes of the Swedish datasets). It would be intresting to try AllenNLP:s model for dependency parsing, and to try it with different tokenizers. It is possible to write your own in AllenNLP, but this was out of the scope for this work.

6.2.2 Named Entity Recognition Named Entity Recognition was one of the main goals of this work, given it’s importance in relation extraction between named entities, so a lot of work was put into converting SUC 3.0 to different formats, and training different models with different embeddings. The best results were obtained by using AllenNLP:s fine-grained NER model with The Swedish National Library’s finetuned BERT-model as a base. These results are comparable with English models trained on the CoNLL-2003, the largest annotated English dataset for NER [29]. Clearly contextual embeddings and transfer learning improve the results of NER quite significantly. Figure 18 and 19 shows a demonstrative difference between using contextual embeddings and non- contextual embeddings. The name Christian Bille is correctly extracted in the first sentence, but missed in the second sentence, probably because, in the second sentence, the rare word "bortförplantade" gives a very different contextual embedding. This miss was not present in the ELMO-FastText model, probably because it uses FastText non-contextual embeddings as well as ELMO.

Figure 18: Named Entity Extraction with BERT.

Figure 19: Named Entity Extraction with ELMO and FastText.

47 6.2.3 Coreference Resolution The results obtained for the Swedish coreference solver, although significantly lower than the English AllenNLP model, shows some promise. The Swedish model was trained on a dataset six times smaller than the English model, and still got an F1-score of 0.50, which is a significant improvement on Wallin’s (2017) Swedish coreference solver, which got an F1-score of 0.34 [49]. The improvements though are not a reflec- tion of a better method when creating the Swedish dataset, but rather of the development of end-to-end coreference models that uses pretrained transformer models as contextual embeddings. It would have been interesting to train the same model on a distantly generated Swedish dataset with a size comparable to OntoNotes 5.0, but this requires computational power that simply wasn’t available for this work. An- other thing to note is that the English models were evaluated on a part of OntoNotes, the dataset it was trained on (although it wasn’t trained on this specific part), which might yield a higher result since the annotation procedure was the same for all of OntoNotes, whereas the Swedish model developed in this work was evaluated on the SUC-Core, that had absolutely no relation to the created training-set based on Europarl.

6.3 Future Work The intention of this work was to create a language processing pipeline that can serve as a base for future work, so obviously, there’s a lot to be done further on, the main thing being to use the model developed in this work for extracting relations between named entities in archival texts and descriptions, using the methods described in chapter 3. A main part of this work was of course to develop a Swedish coreference solver, and as stated in the previous section, when the required computational power is available, it would be of great interest to extend the Swedish dataset, using the code developed in this project. It would also be interesting to try out different bilingual corpora for creating this dataset, using for instance ParaCrawl instead of Europarl, ParaCrawl consisting of machine-translated text obtained from crawling the web [33]. Perhaps ParaCrawl would yield a more varied dataset, and would perhaps achieve a higher score. After all, proceedings in the European parliament form a pretty narrow base for a dataset.

48 7 Appendix A - Datasets and Spacy Training Format

Figure 20: An extract from the Stockholm Umeå Corpus (SUC 3.0), the entities "Litauens", "Vytautas Landsbergis" and "Gorbatjov" are marked out.

49 Figure 21: A sample of the IOB1 format used when training the AllenNLP NER, all tokens within an entity are marked I-ENT unless a new entity begins right after an entity before it, in which case the ﬁrst token of the entity is marked B-ENT.

50 Figure 22: A sample of the Spacy CLI training format for NER, marked according to the BILOU annotation scheme

51 Figure 23: An extract of the created coreference resolution dataset in CONLL2012- format. The coreference annotations are in the rightmost column. Since the model to be trained is end-to-end no parsing or tagging information is included.

52 References

[1] Kenton Lee et al. End-to-end Neural Coreference Resolution. Dec. 15, 2017. [2] AllenNLP. URL: https://www.allennlp.org. [3] AllenNLP - Demo. URL: https://demo.allennlp.org/coreference- resolution. [4] Annotation Specification - Spacy API Documentation. URL: https://spacy.io/ api/annotation#json-input. [5] Jason Brownlee. How to Calculate Precision, Recall, and F-Measure for Imbalanced Classification. URL: https://machinelearningmastery.com/precision- recall-and-f-measure-for-imbalanced-classification/. [6] Mario Bunge. Epistemology and Methodolgy 1: Exploring the World. Kluwer Aca- demic Publishers, 1983. [7] Francois Chollet. “Deep Learning with Python”. In: chap. 1.1. [8] Francois Chollet. “Deep Learning with Python”. In: chap. 4.1.1. [9] Francois Chollet. “Deep Learning with Python”. In: chap. 3.1. [10] Francois Chollet. “Deep Learning with Python”. In: chap. 5.1. [11] James H. Martin Daniel Jurafsky. Speech and Language Processing, Third Edition draft. 2019. [12] James H. Martin Daniel Jurafsky. “Speech and Language Processing, Third Edi- tion draft”. In: chap. 18.2. [13] James H. Martin Daniel Jurafsky. “Speech and Language Processing, Third Edi- tion draft”. In: chap. 18.2.3. [14] James H. Martin Daniel Jurafsky. “Speech and Language Processing, Third Edi- tion draft”. In: chap. 6.3. [15] James H. Martin Daniel Jurafsky. “Speech and Language Processing, Third Edi- tion draft”. In: chap. 6.8. [16] James H. Martin Daniel Jurafsky. “Speech and Language Processing, Third Edi- tion draft”. In: chap. 15. [17] James H. Martin Daniel Jurafsky. “Speech and Language Processing, Third Edi- tion draft”. In: chap. 22. [18] James H. Martin Daniel Jurafsky. “Speech and Language Processing, Third Edi- tion draft”. In: chap. 18.2. [19] Eflomal, Efficient Low-Memory Word Aligner. URL: https : / / github . com / robertostling/eflomal. [20] EntityRuler - Spacy API Documentation. URL: https : / / spacy . io / api / entityruler.

53 [21] EuroParl Parallel Corpus. URL: http://statmt.org/europarl/. [22] Facts Figures, Spacy Usage Documentation. URL: https://spacy.io/usage/ facts-figures#benchmarks. [23] Giuliano Giacaglia. How Transformers Work. URL: https://towardsdatascience. com/transformers-141e32e69591. [24] Eric Hallström. Relation Extraction on Swedish Text by the Use of Semantic Fields and Deep Multi-Channel Convolutional Neural Networks. July 3, 2019. [25] How to train a neural Coreference Model. URL: https://medium.com/huggingface/ how- to- train- a- neural- coreference- model- neuralcoref- 2- 7bb30c1abdfe. [26] Marcus Klang. “Building Knowledge Graphs: Processing Infrastructure and Named Entity Linking”. PhD thesis. Department of Computer Science, Lund Univer- sity, 2019. [27] Language Processing Pipeline. URL: https://spacy.io/usage/processing- pipelines. [28] Natural Language Processing. URL: https : / / en . wikipedia . org / wiki / Natural_language_processing. [29] NER algo benchmarks. URL: https://towardsdatascience.com/benchmark- ner-algorithm-d4ab01b2d4c3. [30] NeuralCoref. URL: https://huggingface.co/coref/. [31] Om oss - Språkbanken Text. URL: https://spraakbanken.gu.se/om. [32] OntoNotes Release 5 - Linguistic Data Consortium. URL: https : / / catalog . ldc.upenn.edu/LDC2013T19. [33] Paracrawl - releases. URL: https://paracrawl.eu/. [34] Phil Blunsom Qi Liu Matt J. Kusner. A Survey on Contextual Embeddings. Apr. 13, 2020. [35] robertostling/eflomal - effecient low memory word-aligner. URL: https://github. com/robertostling/eflomal. [36] Spacy 101: Everything you need to know, spaCy usage. URL: https://spacy.io/ usage/spacy-101. [37] Stack Overflow. URL: https://stackoverflow.com/questions/60958714/ spacy-parser-parses-the-whole-document-as-one-sentence. [38] Stockholm Umeå Corpus. URL: https://www.ling.su.se/english/nlp/ corpora-and-resources/suc/stockholm-ume%C3%A5-corpus-suc- 1.14045. [39] Peter Norvig Stuart Russel. “Artificial Intelligence, a Modern Approach”. In: chap. 24.

54 [40] Peter Norvig Stuart Russel. “Artiﬁcial Intelligence, A Modern Approach, Third Edition”. In: chap. 1.1. [41] SUC-CORE - Stockholm University. URL: https://www.ling.su.se/english/ nlp/corpora-and-resources/suc-core. [42] Svenskt Språkdatalabb. URL: https : / / www . vinnova . se / p / svenskt - sprakdatalabb/. [43] swedish-bert-models. URL: https://github.com/Kungbib/swedish-bert- models. [44] Tagging Scheme For NER - Donovan Ong. URL: https://donovanong.github. io/ner/tagging-scheme-for-ner.html.

[45] UDSwedish−LinES. URL: https://universaldependencies.org/treebanks/ sv_lines/index.html.

[46] UDSwedish − T albanken. URL: https://universaldependencies.org/ treebanks/sv_talbanken/index.html. [47] Understanding FastText:An Embedding To Look Forward To. URL: https://medium. com / @adityamohanty / understanding - fasttext - an - embedding - to-look-forward-to-3ee9aa08787. [48] Välkommen till Vinnova, Sveriges Innovationsmyndighet. URL: https : / / www . vinnova.se/. [49] Alexander Wallin. Creating a coreference solver for Swedish and German using distant supervision. Apr. 2, 2017. [50] What’s New in Spacy 2.0. URL: https://spacy.io/usage/v2#features- models.

55 56 TRITA-EECS-EX-2020:607