DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2020

A compact language model for Swedish text anonymization

VICTOR WIKLUND

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

A compact language model for Swedish text anonymization

VICTOR WIKLUND

Master in Computer Science Date: August 28, 2020 Supervisor: Mats Nordahl Examiner: Olov Engwall School of Electrical Engineering and Computer Science Swedish : En kompakt språkmodell för svensk textanonymisering.

3

Abstract

The General Data Protection Regulation (GDPR) that came into effect in 2018 states that for personal information to be freely used for research and statistics it needs to be anonymized first. To properly anonymize a text one needs to identify the words that carry personally identifying information such as , locations and organizations. Named Entity Recognition (NER) is the task of detecting these kinds of words and in the last decade a lot of progress has been made on it. This progress can be largely attributed to machine learn- ing, in particular the development of language models that are trained on vast amounts of textual data in the target language. These models are powerful but very computationally demanding to run, limiting their accessibility. AL- BERT is a recently developed language model that manages to provide almost the same level of performance at only a fraction of the size. In this thesis we explore the use of ALBERT as a component in Swedish anonymization by combining the model with a one-layer BiLSTM classifier and testing it on the Stockholm-Umeå corpus. The results show that the system can separate per- sonally identifying words from ordinary words 79.4% of the time and that the model performs the best when it comes to detecting names, with a F1-score of 87.7 percent. Looking at the average performance across eight categories we obtain a F1-score of 77.8% with five-fold cross-validation and 77.0 ± 0.2% on the test set with 95% confidence. We find that the system as-is could be used for the anonymization of some types of information, but would perhaps be better suited as an aid for a human controller. We discuss ways to enhance the performance of the system and conclude that ALBERT can be a useful component in Swedish anonymization, provided that it is optimized further. 4

Sammanfattning

I och med dataskyddsförordningen (GDPR) som började gälla 2018 krävs det att personlig information måste anonymiseras innan den kan användas fritt för statistik och forskning. För att anonymisera en text krävs det att man kan upptäcka de ord som bär på personlig information såsom namn, platser och organisationer. Named Entity Recognition (NER) är ett område inom datave- tenskap som handlar om hur man automatiskt kan upptäcka dessa typer av ord, och under det senaste årtiondet flera framsteg gjorts inom det. Dessa framsteg är i allmänhet resultatet av kombinationen av maskininlärning och bättre dato- rer, men speciellt utvecklingen av allmänna språkmodeller tränade på massiva mängder språkdata har varit viktig. Dessa modeller är kraftfulla men kräver mycket systemresurser för att använda, vilket begränsar deras tillgänglighet. ALBERT är en nyutvecklad språkmodell som leverar liknande prestanda med bara en bråkdel av antalet parametrar. I det här arbetet utforskar vi användning- en av ALBERT för anonymisering av svensk text genom att kombinera model- len med en enkel BiLSTM-klassificerare och att testa den på Stockholm-Umeå korpuset. Våra resultat visar att systemet lyckas skilja på personligt identifie- rande information och vanliga ord i 79.4 procent av fallen samt att den är bäst på att känna igen namn, med en F1-poäng på 87.7 procent. Sett över de åtta mest intressanta ordkategorierna i korpuset erhåller vi en F1-poäng på 77.8% med femfaldig korsvalidering och 77.0 ± 0.2% på testdatan med 95% kon- fidens. Vi finner att systemet i dess nuvarande tillstånd skulle kunna anony- misera vissa typer av information, men att det troligtvis skulle fungera bättre tillsammans med en människa som dubbelkollar dess förslag. Vi diskuterar sätt att förbättra systemets prestanda och drar slutsatsen att ALBERT kan vara en användbar komponent i svensk anonymisering förutsatt att den optimeras till en högre grad. Contents

1 Introduction 7 1.1 Research Question ...... 8 1.1.1 Limitations ...... 8 1.1.2 Evaluation ...... 9

2 Background 10 2.1 GDPR ...... 11 2.1.1 Who does the GDPR apply to? ...... 11 2.1.2 What is personal data ...... 11 2.1.3 How is the data protected? ...... 12 2.1.4 GDPR and NER ...... 13 2.2 Named Entity Recognition ...... 14 2.3 Language models ...... 15 2.3.1 BERT ...... 16 2.3.2 ALBERT ...... 18 2.3.3 The issue of having a limited vocabulary ...... 20 2.4 Wordpiece Labels ...... 21 2.5 The SUC 3.0 Corpus ...... 21 2.5.1 Named Entity Abbreviations ...... 22 2.6 Related work ...... 24 2.6.1 Recent developments in NER with BERT ...... 24 2.6.2 A chronology of Swedish NER ...... 25

3 Method 28 3.1 Data Preprocessing ...... 29 3.1.1 Extraction of relevant data from SUC ...... 29 3.1.2 Wordpiece tokenization ...... 30 3.1.3 Wordpiece tagging ...... 30 3.1.4 Data cleaning ...... 31

5 6 CONTENTS

3.1.5 Padding and Truncating ...... 32 3.1.6 Formatting ...... 33 3.2 Generating embeddings with ALBERT ...... 33 3.3 Model selection ...... 34 3.3.1 Architecture ...... 35 3.3.2 Conceptual understanding of the classifier ...... 36 3.4 Evaluation ...... 37 3.4.1 Training ...... 37 3.4.2 Experiments ...... 38 3.4.3 Metrics ...... 40 3.5 Resources and code ...... 41 3.5.1 Resources ...... 42 3.5.2 Code ...... 42

4 Results 44 4.1 Average cross validated performance ...... 45 4.2 Cross validation training ...... 46 4.3 Entity/Non-entity confusion ...... 47 4.4 Main Category Performance ...... 48 4.5 Full Category Performance ...... 49 4.6 Category confusion ...... 50 4.7 Model output ...... 51 4.8 Statistical measures ...... 55 4.8.1 Confidence intervals ...... 55 4.8.2 Spread of results ...... 56

5 Discussion 58 5.1 ALBERT and anonymization ...... 59 5.2 Class confusion ...... 61 5.3 Wordpiece results ...... 62 5.4 Ethical aspects ...... 62 5.5 Sustainability considerations ...... 63 5.6 Time and resources used ...... 64 5.7 Comparison with other results ...... 64 5.8 Future work ...... 65 5.8.1 Going more complex ...... 67 5.8.2 Exploring other datasets ...... 68

6 Conclusion 69 Chapter 1

Introduction

It is common in modern society for personal information to end up stored digitally. This information can be everything from blog posts about your daily life to personal emails and medical history. Opinions on what information needs to be protected and to what degree may vary, but it is reasonable to state that most would prefer, if given the option, for their personal information to not be exploited and collected by others. At the same time the pursuit of perfect confidentiality and privacy should not be taken too far. Issues of feasibility aside there are a lot of useful things that can be done with personal information. Consider the case where one could better study the effects of a viral treatment with full access to patient data, or when one wants to analyze traffic flow with location data from cellphones to avoid congestion in cities, or perhaps look into trends in crime or education on a nation-wide scale. There are without a doubt benefits to this kind of research but the privacy of the individual should take precedence. One way to do this is to simply scrub or replace all information that could be used to identify an individual, a process known as anonymization or pseudonymization that was made compulsory on the 25th of May in 2018 as part of the EUs General Data Protection Regulation (GDPR). Accordingly one is only allowed to freely use private data for research and statistics if it isn’t possible to tie the data to a unique individual [42]. Detecting all such sensitive information is a time-consuming process that eventually becomes completely infeasible to perform manually as the amount of data to scan through grows too large. A natural way of countering this is by automation, and luckily the field of Natural Language Processing (NLP) has made great leaps the past decade. In particular the pieces of data that could be considered as personally identifying (such as names, locations, dates,

7 8 CHAPTER 1. INTRODUCTION

objects, organizations, products, etc.) fall under the scope of Named Entity Recognition (NER), an active subfield of NLP. Some of the best results achieved on named entity recognition were ob- tained with a language model released by Google known under the BERT [10]. This model is computationally heavy however, with the base model requiring a GPU with 12-16 GB RAM to run, and larger models go- ing beyond that [10]. In fact most NLP models now require multiple instances of specialized hardware such as GPUs or TPUs, which limits the accessibility of the technology [38]. This constitutes a problem as high-quality NER is use- ful for all actors who deal with personal information, but not all have access to these resources. In 2020 Lan et al. [25] published a new model that went against the trend of increasing model complexity with their release of ALBERT, a much more compact version of BERT parameter-wise with a minor degradation in perfor- mance. This combination of few parameters and high performance in a single model makes state of the art NLP something more people could make use of, though given how recent the model is it has not yet been evaluated for NER at the time of writing [25]. Most research in NER has been conducted on English, but privacy and GDPR compliance is relevant for all languages. Given the ALBERT model’s potential as a key component in lightweight anonymization systems, this thesis aims to evaluate it for the Swedish use-case.

1.1 Research Question

Is the language model ALBERT suited for anonymization of Swedish texts?

1.1.1 Limitations • This work will not seek to optimize the performance and should only be considered as a proof of concept.

• This work will only treat the use-case of Swedish text

• This work does not pre-train its own ALBERT model, but uses one pro- vided by the Swedish KB-lab [18].

• This work will only consider eight categories of named entities, as a more fine-grained approach would lead to too few examples of each cat- egory. CHAPTER 1. INTRODUCTION 9

• The discussion on the usability of the system will be limited to the areas of privacy and anonymization.

1.1.2 Evaluation To answer this research question we will combine ALBERT with a simple clas- sifier and measure the system’s ability to detect named entities from different categories on the Swedish Stockholm-Umeå 3.0 corpus. We will compare per- category performance and measure precision, recall, and F1-scores, as well as study how the system handles class confusion. We will look at the rate of false positives and false negatives for the different classes and discuss how this might affect anonymization efforts. We will also consider how the model han- dles words than can classified as multiple types of entities. In to enhance the scientific value of the results we will make use of five-fold cross validation and repeat the experiments with fifty differently initialized classifiers. Chapter 2

Background

To better understand how GDPR, NER and ALBERT are connected this chap- ter presents a more detailed description of each of these subjects together with information about the dataset and related work. In particular this chapter de- scribes:

• GDPR and privacy

• NER for anonymization

• ALBERT and wordpiece embeddings

• BIO-tagging

• The SUC 3.0 corpus

• Related work

10 CHAPTER 2. BACKGROUND 11

2.1 GDPR

The General Data Protection Regulation was put into effect the 25th of May in 2018 with the goal of safeguarding the personal data of all individuals within the territories of the EU. This section describes who needs to adhere to the regulation, what data the regulation aims to protect, in what way the data is protected, as well as why NER is relevant for ensuring this protection.

2.1.1 Who does the GDPR apply to? GDPR applies to any entity that in any way processes the personal data of an individual physically present in the territories of the European Union, regard- less of whether this processing is done within the Union or outside, regardless of whether the individual is a European citizen or not. Processing is defined in the regulation as;

‘...any operation or set of operations which is performed on per- sonal data or on sets of personal data, whether or not by auto- mated means, such as collection, recording, organisation, struc- turing, storage, adaptation or alteration, retrieval, consultation, use, disclosure by transmission, dissemination or otherwise mak- ing available, alignment or combination, restriction, erasure or destruction; - GDPR Article 4, "Definitions" [35]

It can thus be stated that any industry, organization or entity, regardless of size, that interacts with someone within the EU is subject to the regulation.

2.1.2 What is personal data ‘personal data’ means any information relating to an identified or identifiable natural person (‘data subject’); an identifiable nat- ural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a , an iden- tification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, men- tal, economic, cultural or social identity of that natural person; - GDPR Article 4, "Definitions" [35] 12 CHAPTER 2. BACKGROUND

In brief, all data that can be associated with a unique individual is the personal data for that individual. As an important distinction from earlier reg- ulations the GDPR does not only apply to structured data, but also unstruc- tured data. That means that the regulation also covers digital content such as social media posts, emails, text documents, audio, images, video and more, which was previously out of scope. This is significant since it is estimated that over eighty percent of all data in an organization exists in an unstructured format [33], and this type of information is growing at an fast rate.

2.1.3 How is the data protected? Processing personal data in any way is prohibited by default. Per Article 6 of the GDPR there are only six cases where an entity is legally allowed to process personal data.

‘Processing shall be lawful only if and to the extent that at least one of the following applies:

(a) the data subject has given consent to the processing of his or her personal data for one or more specific purposes; (b) processing is necessary for the performance of a contract to which the data subject is party or in order to take steps at the request of the data subject prior to entering into a contract; (c) processing is necessary for compliance with a legal obliga- tion to which the controller is subject; (d) processing is necessary in order to protect the vital interests of the data subject or of another natural person; (e) processing is necessary for the performance of a task car- ried out in the public interest or in the exercise of official authority vested in the controller; (f) processing is necessary for the purposes of the legitimate interests pursued by the controller or by a third party, ex- cept where such interests are overridden by the interests or fundamental rights and freedoms of the data subject which require protection of personal data, in particular where the data subject is a child.

- GDPR Article 6,"Lawfulness of processing" [35] CHAPTER 2. BACKGROUND 13

In general terms the article states that personal data can only be processed if a person has given their explicit consent, or if it is necessary and in their interest. If an organization violates this rule they become subject to fines up to twenty million euro or up to four percent of the their annual worldwide turnover, whichever is the highest [35]. While this is excellent for promoting confidentiality and privacy it does bring about some drawbacks. In particular there are many societal benefits to be had in using private data. Location data for example can be used for optimization of traffic flow and urban planning, and medical history can be used to study long-term effects of medicine and treatments. For this reason it is stated in the regulation that:

"... principles of data protection should therefore not apply to anonymous information, namely information which does not re- late to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable. This Regulation does not therefore concern the processing of such anonymous information, including for statistical or research purposes" - GDPR Recital 26, "Not Applicable to Anonymous Data" [35]

As long as the data is anonymous it is legal to use, and anonymization (in which all identifying information is removed) is one of the suggested ways to make an organization compliant with the GDPR, as is pseudonymization (in which all identifying information is replaced by ). For a text to be considered properly anonymized it should not be possible to single an individual out from the data in the text [35].

2.1.4 GDPR and NER From the previous sections it is clear that GDPR compliance is something that is relevant to nearly all organizations, big or small. The majority of data an organization processes tends to exist as unstructured, difficult-to-overview in- formation containing personal data, making it subject to the regulation. One way to support compliance is through anonymization, which also makes the data possible to use for research and statistical purposes. Of course, to actually make a text anonymized, one has to obscure all personally identifying infor- mation within it in such a way that no unique individual can be singled out. To that end the automated recognition of named entities - NER - is a useful 14 CHAPTER 2. BACKGROUND

tool. It allows for the automatic detection of names, locations, organizations, numbers, dates, products and events which could be used to single someone out without manual effort. For this reason the development of named entity recognition systems is very interesting in the context of privacy and GDPR- compliance.

2.2 Named Entity Recognition

Named Entity Recognition (NER) is a topic concerned with the question "Given a word ’X’, what kind of entity is ’X’?" Entity in this context is a colloquial term in natural language process- ing that can be considered equivalent to ’category’, the most common being Names, Locations, Organizations and Miscellaneous [52]. More fine-grained groupings exist as well, tuned for specific use-cases and different corpora. This thesis for example also includes the entities Events, Products, Measurements, Times and Works of Art in its scope. The recognition of these entities is useful in many contexts. It has played a central role for improving performance on tasks such as Machine Trans- lation [3], Question Answering [39] and Information Retrieval [19] to name some. More relevant to this thesis is the way it can be used to detect words that might carry sensitive information such as names, organisations or locations. Due to this potential NER been evaluated for the anonymization of court or- ders [31], the anonymization of clinical texts [4] and the removal of sensitive data in material from the police [14]. All in all there seems to exist many uses for a system that can detect and categorize named entities with high precision. Developing sufficiently good NER systems has not been trivial however. Early attempts at NER consisted of checking words against lists of manually added named entities, so-called "gazetteers" [52, 17, 28]. This approach, while easy to implement, suffered from several drawbacks. For one it could not deal with words that hadn not been added to the gazetteer. If a parent ever named their child with a name not in the lists, these systems would not be able to anonymize text about that child. As a natural next step, more advanced NER systems made use of gram- matical and syntactical rules to determine whether something was an entity or not, often in combination with expert domain knowledge and adapted to spe- cific use-cases [52, 28]. One such system could be tuned for medical journals, consisting of rules that checked punctuation, whether a word was captioned or not, as well as searching through words for subparts such as ’cardio’, ’os- CHAPTER 2. BACKGROUND 15

seo’, ’neuro’, et cetera. These systems were more powerful, especially in com- bination with gazetteers, but they still suffered from drawbacks. They were costly to develop and tune for each specific use-case, they relied on humans not making any syntactical or grammatical mistakes, and they had problems with ambiguity - the fact that a word can have different meanings depending on context [52, 28]. The most recent advances in named entity recognition, starting with the work of Collobert et al. in 2008 [7], have aimed to combat all these prob- lems. In particular machine learning, word embeddings and deep learning have pushed the field of NER forward [49]. Through machine learning the need to rely on experts has been reduced; as long as enough data exists the system can ’learn’ these rules on its own. With word embeddings systems no longer operate on just the literal words, but on vectors embedded with great amounts of semantic information describing the word. Through deep learn- ing, systems have become able to model long-term dependencies and thus take context into account for its calculations. Leveraging all of these advances, the most recent results in NER are ob- tained by creating large-scale machine learning models that are trained on vast amounts of text in order to develop information-rich word embeddings [49, 10]. BERT, and its successor ALBERT that is used in this thesis, are two examples of such language modeling architectures.

2.3 Language models

The development of general purpose language models is one of the great steps forward in natural language processing. These models tend to be multi-layer neural networks trained on large amounts of textual data to generate powerful word representations. These similarities aside there are however many dif- ferences in the conceptual foundation and architectural details of all models, each with their own advantages and disadvantages [10, 32, 51]. In this thesis two of these models are examined in greater detail, BERT developed by De- vlin et al. [10] and its successor ALBERT developed by Lan et al. [25]. The following subsections describe the BERT architecture, the idea behind word embeddings, the way these models gain the ability to model language, as well as the differences between BERT and ALBERT. 16 CHAPTER 2. BACKGROUND

2.3.1 BERT BERT (Bidirectional Encoder Representation from Transformers) is a general- purpose language modeling architecture released in 2018 by Devlin et al. [10]. It is a neural network with several components that exists in different varia- tions, with most of these variations stemming from changes to the size of its hidden layers. In its base version the model consists of twelve layers with a hidden size of 768 and 110M parameters [10]. The most significant advantage BERT offers over earlier language models lies in its bidirectionality. When acting upon a word in a sentence it not only takes into account context to the left of the word, but context to the right as well. This bidirectionality comes from the fact that BERT uses so-called transformers, a concept introduced by Vaswani et al. [41] as an alternative to recurrent and convolutional neural networks. Recurrent and convolutional neural networks has proved adequate for many sequence-related tasks, but show limitations when it comes to speed of pro- cessing and ability to relate elements far apart in sequences, something that can be quite important when dealing with sentences [41]. Transformers get around this problem by leveraging a technique known as attention, in which each ele- ment is replaced by a weighted average of all elements in the sequence. In this way the entire sequence processing can be made parallel, improving speed, and the distance between elements in the sequence becomes a non-issue. BERT does not, however, use a single transformer. Instead it stacks mul- tiple ones on top of each other. After passing through one transformer layer each element in the sentence becomes a weighted average of all elements in the sequence. By passing through multiple transformers different levels of abstraction are produced, with the first layers producing slightly coupled rep- resentations and the later layers producing highly coupled transformations. In the context of this thesis these elements would be words, though it would be more correct to describe them as representations of words instead. These rep- resentations are vectors, arrays of numbers that describe different aspects of the word. This idea of replacing words with representations of them can be traced back to the work of Rumelhart et al. [36] in 1986, though the way they are generated has changed greatly since then. In their current state these represen- tations, better known as word embeddings, act as one of the key components in most recent general-purpose language models [10, 32, 51]. CHAPTER 2. BACKGROUND 17

Word embeddings A word embedding is nothing more than a different way to represent a word to make it more meaningful for computers to interact with. As a toy example one could consider the word "Gladly". Instead of that sequence of characters one could embed information about the word in a vector such as:

In such a way each word is described in a richer fashion, and since the data in the embeddings can consist of numbers rather than characters it becomes possible to perform operations on the different dimensions of the words. One of the more influential papers on the subject of word embeddings was pub- lished in 2013 by Mikolov et al. [27], describing the potential of using neural networks to learn embeddings rather than having a human define them. Em- beddings created this way were shown to explicitly encode many linguistic regularities and patterns, capturing both syntactic and semantic word relation- ships [27]. One of the largest drawbacks with this approach is the fact that it is difficult to understand what each value in the embedding actually means. In the toy example above it is clear what type of feature each value corresponds to, but when the embeddings are learnt automatically only the numbers are pro- duced, without any convenient labels. Despite this lack of insight into what the values mean explicitly, these embeddings have been used to a great effect and most modern language models tend to use them. This holds for BERT and ALBERT as well, both developing their own word representations during a process known as pretraining [10, 25].

Pretraining: How the models learn a language Both BERT and ALBERT are models created with the intent to emulate a general understanding of language. In practice this understanding is achieved by training on vast amounts of textual data in an unsupervised fashion with the goal of maximizing performance on two tasks. The first of these tasks is Masked Language Modeling (MLM) [10], in which some percent of the words in the input to the model is hidden, and the goal of the model is to predict these word based on their context. The second of these is Next Sentence Prediction (NSP) [10] in which the goal is for the model to, given an arbitrary sentence, successfully predict which among all the sentences in the data should be the next. Becoming adept at these two tasks is taken as an indication that the model has achieved a general understanding of language. In essence, the ability to 18 CHAPTER 2. BACKGROUND

infer what words should be at what position in a sentence and how to chain sentences in a logical way for a particular language. One important aspect of becoming better at these tasks is modifying the 768-dimensional internal representation the model keeps of each word. Each step of the training involves making the embeddings better and better descrip- tors of each word, until the model stores embeddings that work well for the aforementioned tasks no matter in which sentence they appear. This leads to the model producing general-use embeddings for each word, a process that takes approximately ninety-six hours on sixteen TPU chips for the most basic BERT model [10]. While embeddings that work decently in any context alone are desirable this is usually only the first step in using BERT and ALBERT. The next step is fine-tuning, adapting the embeddings for a specific task.

Finetuning: becoming specialized at a task BERT and ALBERT once pretrained come with a good general language un- derstanding in the form of powerful word embeddings that work well in any context. There are however many tasks in which performance can be enhanced by fine-tuning the model [10]. Such tasks could be sentiment analysis, text summarization, translation or perhaps named entity recognition as in this the- sis. While the exact process varies depending on the task, fine-tuning tends to take hours rather than days like pre-training, providing yet another advantage to general language models [10]. Once pretrained, models can be saved and shared, making it a one-time effort. Following this procedure Devlin et al. [10] produced state of the art results on eleven tasks when fine-tuning the same pre- trained BERT model, and it has since played a key role for both research and industrial purposes. One drawback to using BERT is that in order to fine-tune even the most basic version of it for a specific task, it is recommended to use a GPU with at least 12GB of RAM, with larger versions of the model exceeding that [13]. To counter this problem researches looked into which factors contributed the most to performance and which factors contributed the least in order to create a more efficient version, with ALBERT being the realization of these efforts.

2.3.2 ALBERT ALBERT (A Lite BERT) is one of the most recently released general purpose language models at the time of writing, becoming open-sourced shortly after the 20th December in 2019 by Lan et al. [25]. Different from many other devel- opments in natural language processing, the primary purpose behind creating CHAPTER 2. BACKGROUND 19

ALBERT was not to push state of the art forward for some arbitrary tasks. In- stead it was to develop a model capable of leveraging the insights gained from studying its predecessor, BERT, in order to create a more lite model.

How ALBERT differs from BERT Much like the basic BERT model described previously, the ALBERT model used in this paper is a transformer encoder with twelve layers and a hidden size of 768 that uses the attention mechanism to produce context-dependent word embeddings [25]. The model is functionally the same as BERT, though it has two important structural changes implemented for the sake of efficiency.

1. Decreased redundancy: Transformer-based neural networks such as BERT rely on stacking independent layers on top of each other, with the idea that each layer will create its own representation of the input. The actual operations carried out at each layer were however found to do essentially the same things, and as such ALBERT introduced parameter sharing across all layers, at the cost of some accuracy [25].

2. Decoupling hidden space from embedding space: In BERT the di- mensionality of the embeddings corresponds to the size of the hidden layers of the model [25]. If you use a BERT with a hidden size of 768, the embeddings will be 768-dimensional as well. If you use a hidden size of 2048, the embeddings will also be 2048-dimensional. With ALBERT the embedding space is considered separate from the hidden space. In practice this change reduced the number of embedding parameters from

O(V ∗ H)

to O(V ∗ E + E ∗ H) where V is the size of the vocabulary of the model, H is the size of the hidden space and E is the size of the embedding space. This parameter reduction is significant when one scales the models up, though it was noted to lead to some loss in performance [25].

The cumulative effect of these changes is the base model of ALBERT, a model with only twelve million parameters in contrast with the 110 million pa- rameters of BERT. Despite reducing the number of parameters by eighty-nine percent this model nevertheless showed respectable performance across the 20 CHAPTER 2. BACKGROUND

benchmarks it was tested on [25]. Not only does this reduction make high-end NLP more accessible, it also reduces training time and acts as a form of reg- ularization for the model. The parameter-sharing and decoupling introduced with ALBERT also makes the model more scalable, and the model actually outperforms BERT for many tasks when it is scaled to a similar size [25]. Since the focus of this thesis is on the compact size of the model rather than pushing state of the art for some task however, only the basic unscaled version of ALBERT will be used in this project.

2.3.3 The issue of having a limited vocabulary During pretraining the model creates and modifies an internal representation of the words it encounters. Yet it is computationally and space-wise infeasible for the model to keep a representation of each possible word in the language. In Swedish alone more than 100,000 words exist in the official dictionary, and that is without taking into account all possible variants of a word through con- jugation or slang or even misspellings. Each of the representations the model produces are 768-dimensional vectors where each element is a 64-bit float, and creating a lookup table for each possible string would be slow and wasteful. For this reason both BERT and ALBERT have a limited vocabulary of 30,000 words. If a word is not in that vocabulary the models cannot represent it. To get around this problem these models make use of wordpiece tokeniza- tion, a technique introduced by Wu et al. [48] in which words outside of the vocabulary gets broken down into pieces that are in the vocabulary. This is a powerful approach for two main reasons.

1. The model becomes capable of interacting with any input, from mis- spelled words to slang to strange Swedish composite words that have never even been written before.

2. Words actually tend to consist of meaningful subwords, and spliting words into pieces may thus enhance the model’s ability to model and understand language. Consider the words "Invulnerable", "Indigestible" and "Infinite" which may be split into "In" and "vulnerable", "In" and "digestible", "In" and "finite". There is a relation there, that "In" negates the meaning of the second word when they are made into a composite. This sort of relation may be captured during the pretraining of the model. CHAPTER 2. BACKGROUND 21

2.4 Wordpiece Labels

ALBERT expects its input to be on the form of wordpieces, a type of tokeniza- tion described in Section 2.3.3. But one also has to consider what happens with the labels of each word when they are split apart. The word "Oscar" might be labeled as a "person" (abbreviated as PRS in the SUC dataset), but if Oscar is not one of the words in the vocabulary of the model it might get split into "Os" and "car". Just keeping the PRS label for the first piece would be incor- rect, since "Os" is not actually a name. Labeling both pieces with PRS shares the same problem. The most common approach to solving this problem is using the so-called BIO (or IOB) tagging scheme. In this scheme labels are prepended with a B (Beginning) if they correspond to the first token of a split entity, I (Inside) if they correspond to a token inside the entity, and O (Outside) if it is not an entity. Following the example from earlier, "Os" would be tagged with "B-PRS" and "car" would be tagged with "I-PRS". It is worth noting that many other studies have found improved results with alternative tagging schemes. One study [34] tested a BILOU scheme, an ex- pansion to the BIO scheme that also takes into account if a wordpiece is the last one in the entity (L) or unknown (U), and found it to significantly out- perform the BIO scheme with average increase of 0.7 percent accuracy on both validation and test sets. In a medical context Dai et al. [8] compared BIO, BIOE (with E corresponding to End, functionally equivalent to the L tag in the BILOU scheme) and BIOES (With S corresponding to Single unit, for the case where a word was not split into pieces) schemes and found that BIOES was the best with regards to precision, recall and F1-score. While some work [24] seem to have only noted a negligible improvement to using BIOES as compared to BIO, Yang et al. [50] found a statistically significant (p <0.05) advantage to BIOES over BIO.

2.5 The SUC 3.0 Corpus

The Stockholm—Umeå Corpus (SUC) is a collection of Swedish texts to- talling 1,166,593 words in 74,245 sentences obtained from a wide array of sources from the 1990s onwards. The latest version (3.0) was released in 2012 with the intent to serve as a benchmark corpus for the Swedish language. To that end the creators have tried to ensure that it contains a balanced mix of styles and genres, that it reflects commonly read texts, and that it avoids us- ing texts translated into Swedish to the greatest extent possible. Each word 22 CHAPTER 2. BACKGROUND

is annotated with features such as part of speech tagging, genre, source, mor- phological features and more [15]. Most important for this work is the fact that they also made use of named entity tagging. Each named entity belongs to one out of fourteen classes described in Fig. 2.5.1 and one of the sixty-one subclasses that are used for a more fine-grained description of an entity. For example, while "Eric" and "Freja" are both names, one has subtype person while the other has subtype mythological. For this project such a fine-grained approach has not been considered in order to reduce the number of classes.

Figure 2.1: The principal entity types in the dataset. A description of all the abbreviations can be found in Section 2.5.1

Fig. 2.1 presents a breakdown of the frequency of named entities in the cor- pus after cleaning had been applied to remove erroneous entries. The cleaning process is detailed in Section 3.1.

2.5.1 Named Entity Abbreviations The following list is a comprehensive set of the abbreviations for the named entities, ignoring subtypes, as well as some examples of each type of entity. The category descriptions were taken from Borin et al. [5] and the examples were selected at random from the entire corpus. CHAPTER 2. BACKGROUND 23

• Entity: Person (PRS): Description: people names (forenames, ), animal/pet names, mythological etc. Examples: [’Mårten Landqvist’, ’Veikko Ahvenainen’, ’Klara Johan- son’]

• Entity: Location (LOC): Description: functional, geographical, geo-political, astronomical, street names. Examples: [’Skå’, ’Zaires’, ’Norge’]

• Entity: Organization (ORG): Description: political, athletic, media, military, transportation, educa- tion etc. Examples: [’Bostadsdepartementet’, ’Veritas’, ’bulten’]

• Entity: Artifact (OBJ): Description: food/wine products, prizes, means of communication (ve- hicles), etc. Examples: [’Dry Martini’, ’SAAB 9000’, ’Chrysler Voyager’]

• Entity: Work&Art (WRK): Description: printed material, names of films, novels and newspapers, sculptures, etc. Examples: [’" Karl XII och släkten Carlstierna "’, Teknikens Värld, ’Världsbankens rapport’]

• Entity: Event (EVN): Description: religious, athletic, scientific, cultural, races, championships, battles, etc. Examples: [’the International Congress of Accountants’, ’teaterföreställ- ningen Frankensteins monster’, ’Skandiacupen’]

• Entity: Measure/Numerical (MSR): Description: volume, age, index, dosage, web-related, speed etc. Examples: [’462 kronor’, ’84 procent’, ’565 Mkr’]

• Entity: Temporal (TME): Description: relating to time, both relative and absolute expressions Examples: [’de senaste två åren’, ’redan under medeltiden’, ’den 21 mars’] 24 CHAPTER 2. BACKGROUND

• Entity: Outside (O): Description: any non-structural token not classified as an entity Examples: ’han’, ’blomma’, ’möjligtvis’

Outside of the main categories described above, the corpus also contained composite categories. No direct description of these was found but they were assumed to correspond to entities that an annotator had considered as equally likely to belong to multiple categories. One example from the corpus could be "Bilen utgör en stor satsning för Ford" (Eng: "The car is a big investment for Ford"), in which "Ford" could be thought to refer to either the person (Henry Ford), or the company (Ford).

• Entity: Location/Person (LOC/PRS): Examples: [’Hartford’, ’Irvines’, ’Basel’]

• Entity: Location/Organizations (LOC/ORG): Examples: [’Abisko’, ’Höganäs’, ’PERSTORP’]

• Entity: Organizations/Person (ORG/PRS): Examples: [’Nobel’, ’Carli’, ’Nike’]

• Entity: Location/Location (LOC/LOC): Examples: [’Smyrna’, ’Gelsenkirchen’, ’Kastrup’]

• Entity: Person/Work&Art (PRS/WRK): Examples: [’Allers’, ’Valkyrian’]

• Entity: Objects/Organizations (OBJ/ORG): Examples: [’Rolls-Royce’]

2.6 Related work

ALBERT is a very recent model and no research with it in the context of NER, Swedish or otherwise, has been published at the time of writing. For the sake of comparison this section describes some recent results in NER with BERT, as well as a chronology on how Swedish NER has developed.

2.6.1 Recent developments in NER with BERT In 2018 Devlin et al. [10] released the paper "BERT: Pre-training of Deep Bidi- rectional Transformers for Language Understanding". In it the performance CHAPTER 2. BACKGROUND 25

of BERT as a language model was evaluated for several tasks, among them named entity recognition on the CoNLL-2003 NER task [10]. Using a case- preserving wordpiece model with maximal document context they used two different methods, one consisting of fine-tuning the entire model for the NER task using the input data, the other one consisting of using the model to first generate embeddings and then train a cheaper classifier on those embeddings. The first approach was noted to require more computational resources and achieved a final F1-score of 96.4% while the second methodology (in which the last four hidden states were simply concatenated before being passed to the classifier) resulted in a validation F1-score of 96.1%, using a two-layer 768-dimensional BiLSTM. Only the first method was evaluated on a test set, resulting in a F1-score of 92.4% using the large BERT model with 340M pa- rameters, claiming state of the art performance [10]. In 2019 Souza et al. [37] made an attempt to pre-train and evaluate a large version of the BERT model for use with the Portuguese language. Much like the original paper on BERT [10] both the feature- and fine-tuning based ap- proaches were used. Differently from that paper however, the authors also added a Linear-Chain conditional random field on top of the token-level clas- sifier (a one-layer BiLSTM with one hundred units in both directions). When training for fifteen epochs they achieved state of the art F1 scores on a Por- tuguese test set containing ten named-entity classes using the pre-training ap- proach, with precision 80.08%, recall 77.31% and F1 78.67%. They do not re- port their results for the feature-based approach, but note that the performance gap was much higher than the reported values for NER on English language. Also in 2019 Pires et al. [30] decided to evaluate a BERT-model intended to work well no matter which language it was tested on, a "multilingual" BERT. They did this by training a BERT model on data from 104 languages before fine-tuning it for English, German, Dutch and Spanish NER respectively. Each fine-tuned model were tested on all four languages, and the four intra-language F1 scores obtained were respectively 90.70%, 82.00%, 89.86% and 87.18%, significantly higher than the inter-language scores. The classification layer used was not specified.

2.6.2 A chronology of Swedish NER One of the earliest attempts at Swedish named entity recognition and classi- fication was done in 2001 by Dalianis and Åström [9]. They elected to use lexicons of known entities which were expanded by using learning rules on a training corpus of news articles. Testing on one hundred manually tagged 26 CHAPTER 2. BACKGROUND

news texts they obtained a strict F1-score of 49% on recognition and correct classification, an increase by 9% from only using the original lexicons and word lists. Another early work in the area, Johannessen et al. [17] compared the per- formance of six systems developed concurrently in Sweden, Denmark and Norway, testing them on news datasets concerning the same topic during the same time period from local newspapers in 2005. Using rule-based approaches, constraint grammars and finite-state grammars as well as some statistical meth- ods based on maximum entropy, on average all systems obtained F1 scores be- low seventy percent with the highest performing ones being those who made ample use of gazettes (word-entity lists). Removing the gazettes from the most successful system caused recall to drop from ninety-one to fifty-three percent however, highlighting the importance in tailoring systems by hand to achieve performance in a specific domain. In 2007 Kokkinakis et al. [22] made use of a model consisting of multi- word entity detection, finite-state grammars, detection of new entities, detec- tion of entities via extensive domain-specific word-lists and refinement applied in a sequential fashion, developed for the Nomen Nescio project [17]. Testing this model on a medical corpus of 1450 entities they obtained an F-score of ninety-three percent. An excellent result, however owing much to the exten- sive word-lists tailored for identifying medical entities such as diseases and dysfunctions. In 2010 Borin et al. [5] made an attempt to apply NER on Swedish literary works to reduce the need for manual identification of names, locations, orga- nizations, numerical values, objects and and more, basing their approach on an earlier publication on the same subject [6]. Their method consisted of a pipeline of modules that checked entities against multi-word lists, single-word lists as well as using a rule-based approach to detect entities. The main differ- ence from the earlier publication laid in augmenting the original models with more complex rules based on gender, context and more. The entity categories used were however not described. The following year Ek et al. [11] decided to create a similar application for information extraction on Swedish SMSs using regular expressions and training a classifier with logistic regression. While the corpus they used ended up being quite small for a training-based approach, they found that the regex and logistic linear classification approach resulted in roughly the same score, the logistic classifier having a slightly higher performance with an average F1 score of 77.73%. In 2014 Kokkinakis et al. [21] adapted a NER tagger conceptualized by CHAPTER 2. BACKGROUND 27

Kokkinakis in 2004 to a so called Helsinki Finite-State Transducer platform and tested it on the Stockholm-Umeå Corpus, SUC version 3.0 containing fif- teen different classes and achieved a balanced F1 score of 74.55%. This NER tagger was based on the idea of a sequential set of modules that carried out word and sentence matching against lists, grammars for detection of common entities and templates for finding new entities, in the same way as similar works of the area [5, 6, 22]. In 2016 the first attempt to use neural networks for Swedish entity recogni- tion was carried out, the model in question being a character-based bi-directional lstm. This work by Almgren et al. [2] made use of the Stockholm EPR med- ical corpus, trying to identify (1) disorders and findings, (2) pharmaceutical drugs, and (3) body structure. They achieved an average F1 score of 75% when it came to classifying detected entities, 35% when taking into account both correct detection and correct classification. As an important distinction to previous works, no word lists or gazettes were used in this approach. One of the more recent approaches by Weegar et al. [45] made use of a Swedish medical corpus consisting of a training set with 4754 entities and 1513 test entities split over the entities "Body part", "Disorder" and "Find- ing" [44, 45]. They achieved an average F1 score of 76.04% and concluded that deep methods were superior to shallow methods for this type of entity extraction. This is the latest of several articles published with the same au- thor, who had previously studied medical corpora as well as shallow, deep and semi-supervised methods for Swedish and Spanish NER. The most recent results found using SUC 3.0 were published in 2018 by Klang et al. [20]. They designed a BiLSTM combined with CNNs and CRFs and extracted word-level, character-level and word-case level features, gener- ating embeddings with the Swedish Culturomics Gigaword Corpus. They se- lected three classes, PRS, LOC and ORG, grouping the remaining five classes into a MISC category before classifying. They obtained an average precision of 84.31%, recall of 84.32% and F1-score of 84.31%. Chapter 3

Method

Named entity recognition is an instance of a classification problem; given a token, is it a named entity - and if so what kind? Making use of ALBERT in NER is in essence a matter of combining the context-rich word embeddings generated by ALBERT with a sufficiently powerful classifier. This chapter describes the entire pipeline from data cleaning and formatting to classification and evaluation of the results, as well as the rationale behind the decisions. In particular the questions this chapter answers are:

• How was the data preprocessed for ALBERT?

• How was the output from ALBERT selected?

• How was the classifier architecture chosen?

• How was the model trained?

• How was the model evaluated?

• What resources were used and what code was written?

28 CHAPTER 3. METHOD 29

3.1 Data Preprocessing

The Stockholm-Umeå corpus used in this project is a richly annotated XML- tree. In order for the ALBERT model to be able to interact with it however, it first needs to be processed into sequences of the exact same length consisting of encoded wordpieces. This section describes the six steps required to format and clean the data.

3.1.1 Extraction of relevant data from SUC The XML-tree of the SUC has annotations on three levels; Text, Sentence, and Word. Each level is tagged with different information such as the , text of origin, morphological structure, part-of-speech information and more. All of these features are possibly useful for the task of classifying a word as a named entity or not, however they are not used by ALBERT. The model only interacts with the words that exist in its vocabulary and disregards the descriptive information created by human experts. As such all these features were removed, keeping only the actual words in the sentences. When it came to selecting what labels to keep the SUC provides 8 possi- ble types (e.g PRS, LOC, ORG) and sixty-one possible subtypes (e.g Animal, Person and Mythological are the subtypes of the PRS tag). Words with the type/subtype tag are the named entities that are relevant for this project, and one of the first decisions to make was which level to work on. While it could have been interesting to look at the fine-grained classification performance using all sixty-one subclasses, this was decided against for three reasons.

1. First and foremost using all sixty-one subtypes would risk diluting the support for the different labels to such a degree that the model would entirely fail to classify them, when it might be able to find a pattern in a more coarse grouping.

2. Secondly, it would have made comparisons with other work done in the field more difficult, as most have elected to use only the four classes LOC, PRS, ORG, MISC.

3. Thirdly, it would have made the results unnecessarily difficult to overview and reason about, especially as such a fine-grained approach was outside of principal focus of the thesis.

This type of extraction resulted in 1,166,593 (word, type) pairs, each of which was tagged with the index of the sentence they appeared in. The reason 30 CHAPTER 3. METHOD

for this was because ALBERT acts upon a sentence level, and a way to easily group the words into sentences was needed.

3.1.2 Wordpiece tokenization The next step was to split all words outside ALBERT’s vocabulary into smaller pieces in the vocabulary, the process known as wordpiece tokenization de- scribed in Section 2.3.3. This step was carried out using the AlbertTokenizer provided by the Swedish KB-lab through the HuggingFace transformers li- brary [18, 16]. The AlbertTokenizer is an implementation of the Sentence- Piece tokenizer published in 2018 by Kudo et al. [23], an unsupervised method to find the most frequent and diverse wordpieces that exist in a corpus, given that exactly x wordpieces must exist. What happens in practice is that the to- kenizer takes each word it encounters and attempts to split it into the largest possible pieces that exist in its vocabulary. For common words like "Efteråt" (Eng: "Afterwards"), no split at all may happen. For a compound word - of which there are especially many in Swedish - a word like "Studieväglednings- diskussionen" (Eng: "The student guidance discussion") may be split into the pieces "Studie", "väglednings", and "diskussionen". If the model encounters truly strange words like "Åz$xY" it may have to reach as far as to splitting the word into the character-level pieces "Å", "z", "$", "x", and "Y". The fewer pieces a word gets split into, the more common it is. Due to the fact that ALBERT continuously modifies its internal representation of each wordpiece in its vocabulary during pretraining, it ends up being the case that the model develops better embeddings for the larger (more common) word- pieces, and worse embeddings for misspelled and rare words.

3.1.3 Wordpiece tagging When a word is split into subwords one has to make a decision on what label to use for the resulting pieces. Consider "Sovjetunionen" (The Soviet Union) that has the label "LOC". When split into "Sovjet" and "-unionen" one want to use a labeling scheme that conveys that these two pieces are still part of the same named entity. Labeling both with "LOC" would cause them both to be treated as distinct entities and labeling the second as a non-entity would be incorrect, since it is not a non-entity in this context. For that reason a special tagging scheme is needed. The majority of related work makes use of BIO-tagging as described in Section 2.4, and it is also the one used in this paper. CHAPTER 3. METHOD 31

It is important to note that this thesis diverged from the common way of applying the BIO scheme in which any non entity - regardless of it being a beginning or inside wordpiece - is simply labeled with "O". Instead it was decided to label any non-entity beginning wordpiece as "B-O", and any non- entity inside wordpiece as "I-O". The reason for this was the idea that it might improve the system’s ability to model language. For example, "Starting" is a non-entity and thus has the label "O". After wordpiece tokenization, the word might have been split into the pieces "Start" and "ing". With the traditional scheme these would both be labeled "O". The modification used in this paper would have the first piece be labeled "B-O", the second "I-O". Since "ing" is a common suffix this would have the model learn that if it encounters the word "ing", then the preced- ing word is likely a non-entity as well. In this way the system may become able to model some rules of conjugation and tenses, which could be an aid in classification. This possibility was deemed interesting enough to motivate the expense of splitting the class "O" into two.

3.1.4 Data cleaning After the wordpiece segmentation and BIO-tagging was applied the integrity of the data was checked in several ways. First the sentence sequences and corresponding label sequences were compared to ensure they were all the same length. They were not. 1127242 tokens (0.7 % of all tokens in total) were found to be the token ’\n’. Each of these newline tokens had a label that differed from the norm in the dataset. As an example one of these tokens could have ’person’ as its label instead of ’PRS’. Since the tokens in question were not actual words and their labels diverged from the rest of the corpus it was decided to treat them as erroneous data and removed them, after which the lengths of all sentences and their labels were verified to match. Following this, ten sentences from the entire dataset were picked at random and scrutinized in detail to ensure that the tagging scheme had been applied consistently and that labels and tokens matched. After this had been con- firmed five samples of all categories were selected at random to ensure that they matched the description of the category. They did, though it needs to be stated that the definitions and labeling used in the corpus were not as rigorous as one might wish. An example of this would be the category TME (Time), containing all entities that relate to time in some manner - an already vague description. One entity belonging to this category is "redan under medeltiden" 32 CHAPTER 3. METHOD

("already during the middle-ages"). One could argue that "medeltiden" alone would have been better suited as a named entity, or even "under medeltiden". For better or worse the categories allow for flexibility when designating the named entities that belong to them. This was especially true for the compos- ite categories LOC/PRS, LOC/ORG, ORG/PRS, LOC/LOC, PRS/WRK, and OBJ/ORG. The randomly selected items did seem to support the assumption that they corresponded to the cases where it was difficult for even a human to decide between two classes. For this reason it was decided to keep them. Unfortunately it was observed later in the project that several of these entities were in fact incorrectly labeled. In hindsight there may have been better ways to deal with the composite classes, some of which are discussed in Section 5.8, Future work.

3.1.5 Padding and Truncating The ALBERT model requires each input sentence to be the same length. This can be done by truncating sentences that are too long and appending padding to sentences that are too short. The most common approach when it comes to this part of the preproccesing is to simply pad all sentences until they all have the same length as the longest sentence, avoiding any truncation. This is done in order to ensure that the model is not trained on any broken or incorrect sentence structures, at the cost of requiring more memory and processing power when training the classifier. This project did not set the sequence length to equal the max sentence length. The reason for this was to preserve efficiency, both in computation time and disk storage space. With ALBERT you obtain an embedding matrix with shape nr of sentences × max sentence length × hidden size. The longest sentence had a length of 404 tokens, so padding all sentences to this length would have resulted in a embedding matrix of size 74165 × 404 × 768 of sixty-four bit floats, which would occupy 184 GB of disk space. Studying the distribution of sentence lengths in Fig. 3.1 it was found that only 0.1 percent of all the data had a length larger than one hundred tokens. As such it was decided to truncate all sentences to one-hundred tokens, reducing the size of the embedding matrix by a factor of four at the cost of seventy-seven sentences. CHAPTER 3. METHOD 33

Figure 3.1: Distribution of tokenized sentence lengths

3.1.6 Formatting After all sentences were adjusted to be the same length a special classifica- tion token [CLS] was prepended to the start of each sentence and appended a separation token [SEP] to the end of each sentence. The reason for this is that the ALBERT model creates special embeddings for the first and last to- kens. [CLS] for example gets an embedding that is representative of the entire sequence. These are not (necessarily) used for named entity recognition, but were added to be compliant with the way ALBERT expects its input to look. Any sentence with less than one hundred tokens was padded with a spe- cial token known to the system as "". After padding so-called attention masks were created for each sentence. Attention masks are another expected feature of the input to ALBERT, in essence simple boolean arrays where True (1) implies that the token should be paid attention to, and False (0) that it should not. The masks were created in such a way that all padding was to be ignored and only the actual wordpiece tokens would be paid attention to by the model. The labels were added ALBERT vocabulary to ensure it would be able to encode them, then all tokens were encoded into integers, after which the entire 74165 × 100 token matrix and attention matrix were converted into tensors.

3.2 Generating embeddings with ALBERT

As described in the background this pre-training is an expensive step com- putationally speaking even for more compact models such as ALBERT, re- quiring several days over multiple powerful GPU-cores to train in some cases. It was possible to avoid this time-consuming step thanks to the efforts of the Swedish KB data lab [18] who released a case-sensitive ALBERT pretrained 34 CHAPTER 3. METHOD

on Swedish in 2020 and made it available through the HuggingFace transform- ers library [16]. Functionally speaking ALBERT is a multi-layer neural network where each hidden layer generates a 768-dimensional representation for each token. Each of these representations are context dependent, and each layer focuses on different parts of the context [43]. The original paper on BERT found that the best results were achieved when training the whole model together with the classifier as one unit. They did however also note that this was more compu- tationally expensive than generating the features and then feeding the features into a classifier. The second approach resulted in a 0.3% drop in F1-score when one concatenated the representations of the last four layers into a 3072- dimensional feature vector, with an additional drop of 1% when one only used the 768-dimensional feature vector of the last layer[10]. Since part of the ra- tionale behind choosing ALBERT was its lightweight nature it was decided to use only the embeddings generated by the last layer, at the likely risk of some performance loss. This was additionally motivated by space concerns as an embedding matrix using features from x layers would take x * 46 GB to store on disk.

3.3 Model selection

After deciding to only use the feature vector from the last layer of ALBERT the next step was defining a classifier architecture that would be good at classifying this output. The model architecture chosen was a one-layer BiLSTM with 200 units in each direction with a reccurrent dropout rate of 0.1, followed by a time-distributed classification layer as described in Table 3.1. The model was created using the Keras library [1]. Section 3.3.1 describes the rationale behind the model architecture while Section 3.3.2 attempts to provide insight into how the classifier works conceptually.

Table 3.1: Classifier structure and parameters Layer: Output shape: Parameters: Input (None, 100, 768) 0 Bidirectional LSTM (None, 100, 400) 1550400 TimeDistributed (None, 100, 33) 13233 Optimizer: Adam Loss function: Categorical cross entropy Activation function: Softmax CHAPTER 3. METHOD 35

3.3.1 Architecture The decision to use a BiLSTM for the hidden layer was motivated by this type of network being a standard approach to achieving excellent results in many sequence-based tasks [45, 26, 29], owing to their ability to leverage informa- tion both before and after the token of interest. The reason only one layer was used instead of several was twofold; first, excellent results were achieved with this configuration during subset testing with 5000 sentences for different num- bers of cells as seen in Fig. 3.2. Second, as mentioned before one motivating factor behind this thesis was the interest in evaluating a modern lightweight model, and just one additional BiLSTM layer with the same structure led to an increase in parameters by a factor of 1.65. All hyperparameters unless oth- erwise specified were the default ones provided with keras [1]. The reason 200 units were chosen for the BiLSTM layer can be inferred from the results in Fig. 3.2. Not only did the change in cell count only affect the results in the most minor of ways, it was also observed that validation accuracy goes down while accuracy goes up as the number of cells increase. It is quite possible that the same results could be achieved with more units trained on fewer epochs, but it was reasoned that training for more epochs would allow us to select the best cutoff point during training with greater precision while also serving to keep the parameter count down.

Figure 3.2: Accuracy and loss as a function of lstm cells. Note that the ranges for the y-axis are very small 36 CHAPTER 3. METHOD

Figure 3.3: Parameters as a function of lstm cells. Each unit on the y-axis equals one million parameters

3.3.2 Conceptual understanding of the classifier The basic unit of information the model acts upon is a single sentence consist- ing of one hundred tokens where each token has its semantic meaning repre- sented by a 768-dimensional feature vector. This sequence of word embeddings is passed through each cell in the BiL- STM layer. Each cell acts upon the first word, "remembering" some of the information carried by the embedding and "forgetting" other. When the cell acts upon the second word this remembered information is taken as an ad- ditional input to the cell. Once again some information is remembered and is used as an additional input when the cell acts upon the third word. This process is repeated all along to the one-hundredth word, which outputs some information remembered from all the previous one-hundred words. The fur- ther away the word is in the sequence the less likely the information is to have been remembered, and vice-versa. By default only this last output is returned by the cell, which would be use- ful if the goal was classifying the next, or perhaps last, word. Since the goal is classifying all words in the sequence however, the output is returned at each step. This does not solve the problem entirely however, since the first word has no contextual information added to it, the second word only has one piece of contextual information, and so on. This is when the bidirectionality of the cell comes in; each cell is duplicated upon creation and is fed the sequence in reverse. So for each word two outputs are produced, one where informa- tion from all anterior words are remembered, one where information from all posterior words are remembered. Each cell thus returns two outputs for each word, and each cell learns to emphasize different parts of the context. For this model an output with the shape 100 words × 200 contexts × 2 directions is received. CHAPTER 3. METHOD 37

This output is sliced along the word axis by a time-distributed layer, which takes each slice of 200 contexts × 2 directions and passes it through a dense layer for classification into one of the thirty-three possible classes. In this way the model makes use of the rich contextual embeddings from ALBERT together with a model capable of dealing with sequential data (such as words in a sentence) to predict what kind of entity a word is likely to be.

3.4 Evaluation

In order to evaluate the system five-fold cross validation was used as described in Section 3.4.1 to get a less biased estimate of the models performance, af- ter which fifty models were trained with randomly initialized weights for the number of epochs at which the lowest validation loss was observed. The aver- age scores of these models on different experiments constitute the main results with a single model being used in order to provide examples of how the system acts upon sentences. These experiments are described in Section 3.4.2.

3.4.1 Training All data was shuffled using a set random seed to ensure that no particular style or text would be over- or underrepresented in either the train or test data. After shuffling the data was split into a training set consisting of 90% of all data and a test set consisting of the remaining 10%. The distribution shown in Fig. 3.4 was consulted to ensure that the spread of entities in train and test were roughly the same with no over- or underrepresentation of any class. 38 CHAPTER 3. METHOD

Figure 3.4: Entity distribution in train and test set

To decide for how long to train the final model five-fold cross validation was used to get a less biased estimate of the model’s performance, using twenty percent of each fold for validation. The hyperparameters used were the de- fault ones provided by Kerars. Training was conducted over ten epochs and the aggregate history over all folds was collected and used to determine for how many epochs to train the final model. The lowest average validation loss was observed at the end of epoch three, which guided the decision to train the final model on all training data for three epochs. Following this fifty ad- ditional models were created and trained the whole dataset for three epochs, each model initialized with random weights drawn from a uniform distribution by the Keras Glorot uniform initializer [1].

3.4.2 Experiments This subsection contains brief descriptions of aspects of the system that were deemed interesting to test in the context of the research question.

Unbiased estimation of the systems performance on unseen data The system’s ability to deal with unseen data is important as personal informa- tion come in many forms, some never seen before (such as alternative spellings CHAPTER 3. METHOD 39

of names, new organizations et cetera). To reduce the risk of bias in the results five-fold cross-validation is used, training each model on eighty percent of the training data and testing it on twenty percent. Precision, recall and F1 scores are aggregated and averaged from the five models.

Ability to discriminate between entities and non-entities When it comes to anonymization or pseudonymization of data, what one needs is the ability to detect information that makes it possible to single a person out from others. Names are one such entity, but also locations, organizations and events may be used to single one individual out from others. Since all named entity classes contain this type of discriminatory information it may be useful to first consider the model’s performance on a very coarse level - is it a named entity or not? These results are calculated as the averaged performance of fifty models on the test set.

Ability to discriminate between main named entity classes In some instances it may be more interesting to only remove one type of named entity and not erase data indiscriminately. In this case the per-category metrics are of interest. Only the B (Beginning) tokens are taken into account as this is a common way to treat named entity classification with wordpieces, the rationale being that if the first token is correctly classified it is easy to apply the labeling to the whole word in practice. These results do not take composite classes into account and are calculated as the averaged results of fifty models on the test-set.

Ability to discriminate between all named entity classes While the common approach is to only consider the B-tokens for classifica- tion, that does not mean that it is the only one. By looking at the performance on all labels it becomes possible to study the performance of the system on a more fine-grained level while also opening up for discussion on alternative ways to generate the final results. This experiment was selected because of the possibility that there may be better ways to conduct named entity recognition on wordpieces other than just using B-tokens. These results do not take com- posite classes into account and are calculated as the average performance of fifty models on the test-set. 40 CHAPTER 3. METHOD

Class confusion It is common to only consider four categories for NER, these being PER, LOC, ORG and MISC. SUC however provides eight main categories, which increases the possibility of class confusion. A confusion matrix is created in order to see what categories are mistaken for one another as this may be helpful information when adapting models for practical purposes. Only cate- gories containing more one-hundred samples are shown in order to keep the confusion matrix human readable. These results are calculated as the average performance of fifty models on the test-set.

Actual output For the purpose of understanding it can be helpful to see the actual output of the system. For this reason several examples of interest are presented. These results were created using a single model.

Statistics In order to get a better understanding of how the performance varies across models the spread of the results from the different models is plotted and con- fidence intervals are calculated for each category.

3.4.3 Metrics This subsection contains a short description of the metrics used in the thesis and how they are calculated.

• True Positives (TP); When the category is x and we guess x

• True Negatives (TN); When the category is NOT x and we guess NOT x

• False Positives (FP); When the category is NOT x but we guess x

• False Negatives (FN); When the category is x but we guess NOT x

Accuracy

TP + TN Accuracy = TP + TN + FP + FN CHAPTER 3. METHOD 41

Precision

TP P recision = TP + FP

Recall

TP Recall = TP + FN

F1

P recision × Recall F = 1 P recision + Recall

Micro average Obtained by taking the TP, TF, FP, FN for each class, adding them up, com- puting the metric of interest, then dividing by the total number of classes. Reflective of the performance when classes with little support are unim- portant.

Macro average Obtained by taking the TP, TF, FP, FN for each class, computing the metric of interest, adding them up, then dividing by the total number of classes. Reflective of performance when all classes are equally important.

Weighted average Obtained by taking the TP, TF, FP, FN for each class, computing the metric of interest, then multiplying it by the size of the support of the class, then normalizing it. Reflective of the performance when the support of each class determines their importance.

3.5 Resources and code

This section describes the resources used for the project as well as the code written for the thesis. 42 CHAPTER 3. METHOD

3.5.1 Resources

A pre-trained ALBERT One of the great advantages with general purpose language models such as ALBERT is that once the model has been trained, adapting it to a specific corpus or task (such as NER) requires comparatively little effort. The pre- training phase is costly, but it is a one-time cost for a single group as long as they elect to share the trained model afterwards. Despite how recent the ALBERT model is the computer lab of the National Library of Sweden (KB- lab) [18] have already pretrained one on 15-20GB of text from various sources such as the Swedish Wikipedia and posts from internet forums [18]. Aside from their GitHub repository the models are also hosted on S3 by Hugging- face [47], who also provide a convenient package and API for interacting with the models [16]. Upon contact the KB-lab have stated that they believe they will be able to improve their ALBERT model by training it more. Even so good results were observed when testing on a subset of the SUC dataset, motivating the choice to use their model rather than training one from scratch.

The SUC corpus The Stockholm-Umeå corpus used in this thesis is provided by the institution of linguistics at Stockholm university and is free to use for research, provided that the user signs a licence agreement between the user and Språkbanken at the University of Gothenburg [15].

3.5.2 Code All the code developed for the thesis is stored at a public github repository [46]. The repository contains five notebooks with descriptive functions and saved output to allow for reader comprehension and reproduction of the results. The SUC XML file is not provided as it is subject to licence. The model weights and embedding matrices are not provided due to space limitations. The notebooks in the repository are:

• Data_preparation.ipynb: Contains the entire data preprocessing pipeline, including conversion from xml to csv, removal of erroneous entries, BIO-tagging, creation of attention masks, padding et cetera. CHAPTER 3. METHOD 43

• Cross_validation.ipynb: The notebook containing the main results for the report. Contains code for creating and selecting the BiLSTM clas- sifier architecture based on subset testing. Contains code for training and testing on each fold of the data. Contains code for training and evaluating a single model. Contains code for generating classification reports on different categories of named entities as well as confusion ma- trices. Contains code for training and averaging the results of an arbri- tary amount of models. Contains code for calculating statistics on the models.

• Data_visualization.ipynb: Contains code for exploring the data. Con- tains comparisons between named entity distributions in train and test. Contains a closer examination of words belonging to different categories along with their context. Contains a closer examination of subclasses. Contains code for creating Section 2.5.1.

• main.ipynb: An attempt to move all central parts of the notebooks into a single one, to allow for others to reproduce the entire process from scratch to final results by just running a single notebook. Incomplete due to time limitations. Contains code for generating the full embedding matrix of ALBERT and storing it it on disk.

• Sentence_tester.ipynb: Contains functions for loading the fully trained NER system and using it to classify any sentence, including those out- side of the SUC corpus. Generates descriptive tables where each word- piece token, its predicted class and the probability of the prediction is presented, examples of which can be found in Section 4.7. Chapter 4

Results

This chapter consists of eight sections, starting with the average cross vali- dated performance (Section 4.1) and the training history during cross valida- tion (Section 4.2). Then the system’s results on the test set are presented, go- ing from its ability to separate named entities from non-entities (Section 4.3) to its performance on the principal named entity types (Section 4.4), to its performance on all categories (Section 4.5). A confusion matrix is produced (Section 4.6) and system output for some example sentences are shown (Sec- tion 4.7), ending the chapter with a statistical analysis on the models that were used to generate the results (Section 4.8).

44 CHAPTER 4. RESULTS 45

4.1 Average cross validated performance

Figure 4.1: Average results over the five folds at the ten epoch mark

Fig. 4.1 shows the average results obtained when performing five-fold cross- validation for the eight principal categories, consisting of Events (EVN), Ob- jects (OBJ), Measures (MSR), Persons (PRS), Times (TME), Organizations (ORG), Work&Art (WRK) and Locations (LOC). Only results on the B-labels are shown since these are often considered the most important; this owing to the fact that if you can label the first part of a word correctly it is an easy operation to apply that labeling to the rest of the word. The figure indicates that recall is always lower than both precision and F1, emphasized by the darker coloring. This indicates that the model is more likely to produce false negatives than false positives, meaning that it is more likely to not detect an entity than to detect it and classify it wrongly. Recall is especially low for category OBJ and it can be seen in the figure that only 29% of all Objects were correctly classified, to contrast with category PRS (person), of which 85.5% of all entities were correctly classified. Overall, weighting the scores by the support of each class results in an average F1-score of 77.8% with a standard deviation of 0.1%. 46 CHAPTER 4. RESULTS

4.2 Cross validation training

Figure 4.2: The evolution of metrics over epochs and folds during cross- validation

During cross validation the lowest validation loss (0.018) was observed at epoch three with a training loss of 0.020. The highest validation accuracy (0.993) was observed at epoch 6 while accuracy was 0.996. While it can be seen that the model begins to overfit during these ten epochs it is worth noting that the scale of overfitting is quite small. E.g the difference in accuracy be- tween the first epoch and the last is one percent, making the risks associated with overfitting very minor in this case. It is also worth noting that the metrics that were tracked during training were a function of all categories. For this reason the structural elements [CLS], [SEP], and non entities B-O and I-O were also taken into account, and their weighted contribution to the scores greatly skew the results given their abundance. The accuracy and loss of the final model, trained for three epochs, was 0.993 and 0.021 respectively, values very similar to those observed during cross validation. CHAPTER 4. RESULTS 47

4.3 Entity/Non-entity confusion

The report is written in the context of detecting personally identifying infor- mation in text. As all named entity classes are liable to hold such information it is interesting to see the system’s overall ability to separate entities from non- entities (labeled "O" for "Outside").

Figure 4.3: Confusion between named entities and non-entities

Fig. 4.3 shows the confusion between named entities and non-entities on the test set. The figure was created by using the system to predict labels for all wordpieces in all sentences in the test-set, then comparing the predicted labels with the true labels. The predictions shown in the figure are the aggregate of the predictions of all fifty models. In total there were 133082 non-entities and 10176 entities in the test set. It can be seen that the model classifies Entities as Entities 79.4 percent of the time, Entities as Non-entities 20.6 percent of the time, Non-entities as Entities 0.7 percent of the time and Non-entities as Non-entities 99.3 percent of the time. This indicates that the model has a moderate tendency to classify entities as non-entities, though it is unlikely to make mistakes in the other direction. This tendency poses a problem for anonymization efforts since it indicates that on average one in five entities (ignoring the specific categories) will remain undetected by the system. 48 CHAPTER 4. RESULTS

4.4 Main Category Performance

This section studies the performance of the system on categories WRK, EVN, PRS, MSR, LOC, OBJ, TME and ORG on the test-set. The composite cate- gories LOC/PRS, LOC/ORG, ORG/PRS and LOC/LOC are not shown as they all had a score of 0 on all metrics, owing to their rarity in both the train and test dataset. Only B-labels are shown as this is the approach mostly commonly seen in related work, and the results are calculated as the average of fifty mod- els.

Table 4.1: Classification report for the main entities precision recall f1-score support B-PRS 0.901 0.854 0.877 1553 B-LOC 0.864 0.779 0.820 968 B-TME 0.783 0.621 0.693 1227 B-MSR 0.725 0.629 0.673 247 B-ORG 0.755 0.551 0.637 316 B-WRK 0.678 0.352 0.463 80 B-EVN 0.667 0.245 0.359 41 B-OBJ 0.450 0.214 0.290 125 micro avg 0.839 0.722 0.776 10176 macro avg 0.728 0.531 0.601 10176 weighted avg 0.831 0.722 0.770 10176

Table 4.1 shows that the highest results were achieved on the categories People (PRS) and Locations (LOC), the lowest on Work&Art (WRK), Events (EVN) and Objects (OBJ), with Time (TME), Organizations (ORG and Mea- sures (MSR) in between. The high scores on PRS and LOC are important as names and locations are two types of information that when combined can act as powerful identifiers. Compared to the results from cross-validation a slightly higher precision can be observed, while recall and F1-score are lower. Nevertheless the values are similar enough that it seems reasonable to consider the cross-validated data a decent estimator for performance. CHAPTER 4. RESULTS 49

4.5 Full Category Performance

Table 4.2: Classification report for all entities precision recall f1-score support B-PRS 0.901 0.854 0.877 1553 I-PRS 0.884 0.858 0.871 2105 B-LOC 0.864 0.779 0.820 968 I-MSR 0.833 0.774 0.802 517 I-TME 0.799 0.739 0.768 1703 B-TME 0.783 0.621 0.693 1227 I-ORG 0.716 0.656 0.685 456 I-LOC 0.724 0.640 0.679 541 B-MSR 0.725 0.629 0.673 247 B-ORG 0.755 0.551 0.637 316 B-WRK 0.678 0.352 0.463 80 I-WRK 0.593 0.322 0.417 236 B-EVN 0.667 0.245 0.359 41 I-OBJ 0.485 0.224 0.307 49 B-OBJ 0.450 0.214 0.290 25 I-EVN 0.311 0.094 0.144 76 micro avg 0.824 0.732 0.775 10140 macro avg 0.698 0.535 0.593 10140 weighted avg 0.815 0.732 0.768 10140

Table 4.2 presents the averaged results from fifty models on all categories ex- cept for the composite ones, ordered by F1-score. The reason composite cate- gories were excluded is because they all had a F1-score of zero. It can be seen that B-PRS remains the category with the highest score, followed by I-PRS, indicating that the model is performs well at classifying intermediate pieces of names as well. For both the Time and Measure categories it appears that the system is more adept at detecting the intermediate parts than thebeginning parts (76.8% vs 69.3% for TME, 80.2% vs 69.3 for MSR). Despite the large number of classes the weighted average remains close to the case where only the main categories were considered, with a F1-score of 76.8%. 50 CHAPTER 4. RESULTS

4.6 Category confusion

To better understand where the system makes mistakes it is worth looking at the confusion matrix on the test set. The results in Fig. 4.4 are based on all categories that had a support >= 100 in the test data. This selection was done to improve the readability of the figure. Each element (x, y) shows the prob- ability that the system predicts category y when the true category is x. As in the previous sections the results are based on the aggregated predictions made by all fifty models.

Figure 4.4: Predictions on entities in test-set where support was greater than one hundred

Based on the confusion matrix, category I-WRK is the most difficult to correctly classify with the models only succeeding 65.0 percent of the time and mistakenly classifying it as I-ORG with a probability of 13.8 percent, I- PRS with a probability of 14.2 percent. The models also seems to have an CHAPTER 4. RESULTS 51

elevated risk of making a mistake within a category. E.g I-TME is more likely to be confused for B-TME than anything else and vice versa, and this pattern can be seen for all categories. I-TME for that matter is the category least likely to be confused for any other with a score of 96.7%, though B-TME entities are quite likely to be labeled as I-TME (with a probability of 6.6%). The category the least likely to be confused for any other category is I-TME, followed by B-PRS, I-PRS and B-LOC in that order.

4.7 Model output

This section will be used to highlight and exemplify how a single model acts upon sentences. All sentences were written manually in order to show different aspects of the model’s performance. In all subsequent tables the structural elements [CLS], [SEP] and have been removed as they are uninteresting for named entity classification, though it is worth mentioning that they were correctly identified and labeled with a near one hundred percent probability across all sentences. A reference for all the labels can be found in Section 2.5.1. The model created for these tests was determined to have a weighted average F1-score of 76.9%, marginally (0.1%) less than the sample mean, and should thus be a decent representative of how the model interacts with sentences. 52 CHAPTER 4. RESULTS

Table 4.3: The predicted labels and corresponding probability for each token in a sentence Word Predicted_Label Probability Osc B-PRS 0.778 uar I-PRS 0.805 Osc I-PRS 0.830 uar I-PRS 0.935 sson I-PRS 0.961 har B-O 1.000 studerat B-O 0.996 på B-O 0.972 KTH B-ORG 0.947 , B-O 0.775 Stockholm B-LOC 0.769 , B-O 0.615 sedan B-TME 0.537 några I-TME 0.729 månader I-TME 0.449 tillbaka I-O 0.740

In Table 4.3 the system acts upon the sentence "Oscuar Oscuarsson har studerat på KTH, Stockholm, sedan några månader tillbaka" (Eng: Oscuar Oscuarsson has studied at KTH, Stockholm, since a few months ago"). Some- thing worth noting is that neither "Oscuar" nor "Oscuarsson" nor "Oscuar Os- cuarsson" exist in the Stockholm-Umeå corpus, and as such the model has never been trained on them. But despite the fact that these words doesn’t ex- ist in the training data the model correctly infers that they are supposed to be names in this context, showing an example of the model’s robustness to un- known data and alternative spellings. KTH is correctly determined to be an organization with a high probabil- ity and Stockholm too is correctly classified as a location. The model is less certain when it comes to the words "sedan några månader tillbaka", and one could argue whether or not "tillbaka" should have been tagged with I-TME as well in this case. CHAPTER 4. RESULTS 53

Table 4.4: The predicted labels and corresponding probability for each token in a sentence Word Predicted_Label Probability Ericsson B-PRS 0.878 hade B-O 0.996 arbetat B-O 1.000 på B-O 0.998 Ericsson B-ORG 0.708 länge B-O 0.977

In Table 4.4 the system acts upon the sentence "Ericsson hade arbetat på Er- icsson länge" (Eng: "Ericsson had been working at Ericsson for a long time"). In this example one can see the model applying two different labels to the same word. Ericsson is a person when the word occurs the first time, and an orga- nization when it occurs the second time. This exemplifies how the model is capable of using information about position, preceeding words and succeeding words to determine the correct labels.

Table 4.5: The predicted labels and corresponding probability for each token in a sentence Word Predicted_Label Probability De B-O 0.998 tävlade B-O 0.996 i B-O 0.989 Fren B-EVN 0.765 ch I-EVN 0.690 Open I-EVN 0.807 , B-O 0.975 Frankrike B-LOC 0.742 , B-O 0.848 och B-O 0.967 vann B-O 0.969 54 CHAPTER 4. RESULTS

Table 4.6: The predicted labels and corresponding probability for each token in a sentence Word Predicted_Label Probability De B-O 0.998 tävlade B-O 0.994 i B-O 0.996 Fren B-ORG 0.357 ch I-ORG 0.385 Open I-EVN 0.307 , B-O 0.718 Frankrike B-O 0.652

In Table 4.5 and 4.6 the system acts upon two sentences with a small dif- ference. The first sentence "De tävlade i French Open, Frankrike, och vann" (Eng: "They competed in French Open, France, and won") has French Open correctly classified as an event and France as a location. But in the second sentence where the three last tokens are removed, performance suffers. French Open is classified as an organization with a small probability, and France is labeled O - not an entity at all, making the model fail at all named entities in the sentence. This highlights how context-sensitive the embeddings that ALBERT generates are. CHAPTER 4. RESULTS 55

4.8 Statistical measures

The results in Section 4.3 through 4.6 were created as the average of fifty mod- els. This section studies the spread of the per-category metrics across these models as well as the corresponding confidence intervals.

4.8.1 Confidence intervals

Table 4.7: Sample mean, min and max values and 95% confidence intervals mean interval min max support I-EVN 13.5 (11.6, 15.4) 0.0 36.1 76 B-OBJ 28.8 (26.9, 30.6) 18.8 38.3 25 I-OBJ 29.8 (27.4, 32.2) 10.5 46.6 49 B-EVN 35.6 (34.0, 37.2) 24.0 46.2 41 I-WRK 41.0 (40.3, 41.8) 32.3 51.5 236 B-WRK 46.1 (45.2, 46.9) 35.8 53.9 80 B-ORG 63.6 (63.3, 63.9) 56.1 68.2 316 B-MSR 67.3 (67.1, 67.5) 63.9 71.6 247 I-LOC 67.9 (67.8, 68.1) 63.2 70.9 541 I-ORG 68.4 (68.3, 68.6) 62.6 72.1 456 B-TME 69.2 (69.2, 69.3) 66.1 72.3 1227 I-TME 76.8 (76.7, 76.8) 74.8 78.4 1703 I-MSR 80.2 (80.1, 80.3) 74.6 83.1 517 B-LOC 82.0 (81.9, 82.0) 79.0 83.3 968 I-PRS 87.1 (87.1, 87.1) 84.1 88.3 2105 B-PRS 87.7 (87.7, 87.7) 86.8 88.3 1553

Table 4.7 shows statistics calculated on the fifty models that were used for the main results. In particular it shows the mean, minimal and maximal value for each category as well as confidence intervals, calculated as the mean ± 1.96 times the sample standard deviation divided by the square of the support. One can see that the system has problems with category I-EVN, consisting of intermediate pieces of event names. At least one model failed entirely on these entities and the span of the confidence interval is quite large, covering more than four percent. In contrast the system is stable for categories B-PRS and I-PRS, describing personal names, with their respective 95% confidence intervals spanning less than tenth of a percent. 56 CHAPTER 4. RESULTS

4.8.2 Spread of results

Figure 4.5: The spread of weighted average F1-scores across fifty samples. The blue vertical line corresponds to the mean. The red dotted line corre- sponds to the model used to generate sentences. The orange lines indicate the 95% confidence interval. The curve corresponds to the kernel density estima- tion

Figure 4.6: Per-category spread of results across fifty samples. The blue verti- cal line corresponds to the mean. The red dotted line corresponds to the model used to generate sentences. The orange lines indicate the 95% confidence in- terval. The curve corresponds to the kernel density estimation CHAPTER 4. RESULTS 57

Fig. 4.5 shows the spread of the average performance obtained with the fifty models, with Fig. 4.6 providing a per-category perspective. The scores of the model used to generate sentences in Section 4.7 is highlighted with red for comparison. From Fig. 4.5 it can be seen that the sample mean is 77.0%, with some outliers among the samples scoring above 78% or below 76%. Based on the fifty samples the results indicate a 95% confidence that the true weighted average F1-score of the model is 77.0 ± 0.2%. Looking at the individual classes it can seen that the model used to generate sentences was better than the sample mean on categories PRS, LOC, MSR and EVN, worse on categories TME, ORG and WRK. It is worth noting that while the distributions may look similar each spread is plotted on its own range, scaled for a perfect fit. PRS for example has values that all fall within a few percent of its mean, while for OBJ the difference can be greater than ten percent. Chapter 5

Discussion

In this chapter we begin with discussing what the results imply for using AL- BERT in anonymization (Section 5.1). We analyse how performance depends on the type of category (Section 5.2) and note some problems with the word- piece tokenization process (Section 5.3). We then consider the results in the context of ethics (Section 5.4), sustainability (Section5.5) and practicality (Section5.6). We finish the chapter by comparing the results with similar work (Section5.7) and propose ways to improve upon the results (Section 5.8).

58 CHAPTER 5. DISCUSSION 59

5.1 ALBERT and anonymization

In this project we combined the recent ALBERT language model together with a comparatively simple classifier and evaluated its use for named entity recog- nition across a wide range of styles and texts in Swedish. The hope was that the model, despite its lightweight nature, would be a strong candidate for Swedish text anonymization without the need for high-end hardware. We found a weighted average F1-score of 77.0 ± 0.2% across the eight main categories on the test set with the highest results being on names (PRS), locations (LOC) and time (TME), the lowest on works of art (WRK), products (OBJ), and events (EVN). Ideally we would have liked to see a recall close to one hundred percent for some categories even at the expense of precision, as this would have meant that we correctly detected all named entities in that category, even at the risk of false positives. This is because it is preferable to risk obscuring non-sensitive information (the effect of false positives) to not obscuring sensitive information. As it currently stands only names (though this is arguably the most important category) could be expected to be anonymized with any degree of reliability, further evidenced by the example sentence in Table 4.3, in which the system correctly identified all parts of a never-seen- before name with high confidence. The system would be slightly less reliable for obscuring locations, decent but inexact when it comes to obscuring time- and measure-related words, and unreliable for the others. There is of course a way to improve the recall and F1-score, at the cost of precision. The model labels each word according to the category with the highest probability, but these probabilities can be adjusted according to the needs of the user. If we have a use-case where we want to remove all orga- nizations (ORG) from a text, it would be easy to increase the probability of a word being classified as such by a set percent, say ten. This would lead to an increase of false positives and by consequence more non-sensitive informa- tion would be obscured. However, as argued earlier, for the sake of privacy it is better to err on the side of caution. Another way to enhance the performance of the model would be in the case where one simply wanted to obscure all potentially identifying data. If the goal is to remove all such information from a text, does it matter if Ericsson is a place, a name or an organization? For practical purposes, no. No matter the category it is private information that should be removed. In this case where we can ignore the effect of class confusion the model has a 79.4 percent accuracy as seen in Fig. 4.3. Twenty out of one-hundred named entities are missed by the system as such, but once again it would be possible to adjust 60 CHAPTER 5. DISCUSSION

the classification probabilities to bring this miss-rate to almost zero, at the cost of an increased risk of obscuring non-sensitive information. It is also important to acknowledge that the categories the model is the most likely to make mistakes on such as objects, events and works of art, are less important than the categories it is good at. This makes the miss rate seem more severe than it might actually be for practical purposes. Of course it is quite possible that one doesn’t want to scrub all information carried by named entities. For example the category Measure (MSR), consist- ing of words describing metrics and numbers, might be the actual information of interest. In this case it is necessary to avoid false negatives for the MSR to- kens, since this would cause the token to be labeled as another category and be scrubbed, and we want to avoid false positives for the MSR tokens, since that would mean that data we wanted scrubbed would be kept. To minimize the risk of false positives and false negatives on a specific category without harm- ing overall performance one could consult the confusion matrix of the classes such as Fig. 4.4 to see where classes are mistaken for one another. Consider- ing only the B-tokens we would have 3.3 percent false negatives (categorized as B-TME and then scrubbed), and 0.3 percent false positives (B-TME tokens categorized as B-MSR and kept). So the risk is rather low that we reveal in- formation we wanted to keep hidden (0.3 percent), and the risk is also low that we lose information we wanted to keep (3.3 percent). If this is acceptable or not would in the end depend on the needs of the user. Is the risk of revealing some time-related information too big, or does it matter little? In general one can obtain desired performance on one category by sacrificing performance on others. Alternatively one could always supple- ment the model with additional rules depending on the specific needs of the user. In conclusion the system as-is would be a poor choice for scrubbing data about works of art, events and commercial products, but possibly useful for scrubbing names and locations. For a practical implementation it would be important to perform a task-specific analysis and manually adjust the model to achieve the desired performance. A more realistic use of the system could be to have it act as an aid rather than a completely autonomous process. In this case the function of the system would be to simply suggest words to obscure for a human controller, still saving a lot of time and effort while reducing the risk of mistakes. CHAPTER 5. DISCUSSION 61

5.2 Class confusion

From a language perspective it is interesting to see what classes gets confused for one another and reason about why, questions which may aid in developing models with better discriminatory ability. Based on Fig. 4.4 we can for ex- ample see that I-LOC is most likely to be missclassifed as I-ORG or I-PRS, but not any other category. There is in fact a trend throughout the entire con- fusion matrix of I-categories only being confused for other I-categories, B- categories only being confused for other B-categories. This indicates that the model seems to recognize when a token is only the first part of a split word, as well as what intermediate tokens belong to the word. The only case where this trend is broken is within each category. E.g B-TME is most likely to be miss- classified as another B-token, or I-TME. In essence, the model either mistakes the position (B vs I) but get the class correct, or it gets the position correct but makes a mistake on the class. It is quite rare for both position and class to be wrong at the same time. Performance-wise I-WRK is the least successful category by a large mar- gin, often mistaken as both I-ORG and I-PRS. This could be an effect of the low support of the class, though there are instances where confusion is ram- pant despite high support, such as the confusion of I-LOC with I-ORG and I-PRS. This could be a result of the strong ambiguity inherent in some words; for example Ericsson can be both a person, a name, and a location. In Table 4.4 we see the model correctly determining the labels for the two instances of the word, but in Table 4.5 and 4.6 we see how just a difference of three tokens is enough to cause the classifications to become completely different. Dealing with this kind of context dependency should be one of the ad- vantages to using the embeddings generated by ALBERT, and it may be that the classifier used was not complex enough to fully make use of them. The choice of a single-layer BiLSTM was done with the rationale that we wanted to keep everything scaled-down and simple. Even so it may have been better to add one more layer to the classifier or to use more features from ALBERT’s hidden layers, both of which would have increased the model’s capacity for abstraction. This idea is supported by the fact that many of the most modern NER systems make use of more complex classifiers; Devlin et al. [10] used a multi-layer BiLSTM and concatenated embeddings from four layers in BERT to achieve their results, others made use not only of multi-layer BiLSTM:s, but also added convolutional neural networks and conditional random fields as their classifier [20]. Making the classifier more complex would have increased training time 62 CHAPTER 5. DISCUSSION

and memory footprint, but the possible advantages in reducing class confusion may well have been worth it.

5.3 Wordpiece results

The question "when is a named entity correctly classified" is simple when one operates on whole words. Things get more problematic with wordpieces, where a single named entity can consist of two tokens, or three, or even more. The common approach (as is the one used for comparing the results in this paper with other work in Section 5.7), is to only consider the classification of the first token in the named entity. This makes sense since it is easy to recombine wordpieces, and requiring all parts of a word be correctly classified rather than just the first would only lead to worse results. On the other hand one could easily imagine other, equally valid ways to score the results. One such way could be to do a majority count of the labels placed for each entity and select the most common one, with priority given to the first label in case of a tie. This would make sense as the model is good at both I- and B-token classification, and the longer an entity is the more I-tokens it consists of. The use of wordpieces have undeniable advantages, but the confusion be- tween B and I labels within categories that result from it does pose some prob- lems. Finding a way to reduce this confusion would raise the score for all categories, significantly so for some. Looking at Fig. 4.4 we see that enti- ties belonging to the B-TME category are classified as I-TME 6.6 percent of the time, and if this confusion could be avoided the accuracy on time-related entities could be raised from 92.4 to 99.0 percent, for example.

5.4 Ethical aspects

From a practical standpoint lightweight anonymization systems are interest- ing because they can make it easier to comply with the rules and regulations that govern the use of personal information. The potential fines of more than twenty million euro are a strong motivator for complying with the GDPR, but it is important to also consider the reason why the regulation came to be in the first place. Like many other laws the regulation rests on the foundation of an ethical argument, this one based on the idea of human dignity.

Human dignity is inviolable. It must be respected and protected. CHAPTER 5. DISCUSSION 63

- EU Charter Article 1, Human dignity [40]

Human dignity is the central point around which much of the GDPR has been developed, though the lawful rationale and basis for it can also be inferred from Article 3, "Right to the integrity of the person" or Article 8 "Protection of personal data" [40]. In the statement of the purpose of the GDPR one can read:

This Regulation protects fundamental rights and freedoms of nat- ural persons and in particular their right to the protection of per- sonal data. - GDPR Article 1, Subject-matter and objectives [35]

Despite the fact that the central idea of the GDPR revolves around this concept of protection we begun this thesis by discussing the importance of research and statistics, treating anonymization as a means to an end, an argu- ment made from a utilitarian perspective. But it is important to acknowledge that anonymization is important in and of itself. Human beings have a right to have their privacy respected and their personal information protected from malicious actors, and anonymization is one way to support this pursuit. In safeguarding the sanctity of the private life, one also acts to preserve human dignity. One could however look at the pursuit of powerful anonymization systems from another point of view. The system developed in this report would only be the first component in an anonymization system, at this stage all it does is col- lecting and categorizing potentially personal information from unstructured text. That by itself does not promote privacy and one could imagine the sys- tem instead being used to systematically collect rather than obscure personal information. Like any tool an advanced NER system could be used for both good and bad purposes. The potential value of them for the sake of promoting research, development and the protection of human dignity nevertheless make them a worthwhile pursuit.

5.5 Sustainability considerations

One of the major advantages of ALBERT comes from the fact that it has com- paratively low requirements hardware-wise. It makes modern NLP models available to many more than otherwise, be it for named entity recognition or other tasks. But even aside from the benefit to accessibility, ALBERT is 64 CHAPTER 5. DISCUSSION

interesting from an environmental perspective. Strubell et al. [12] recently published a paper concerned with the environmental resources needed to train and develop NLP models, echoing concerns from others that while great ef- forts have been made for making computer hardware more power-efficient, the same cannot be said for the software. In their paper they found that the cost of training a BERT-base model (110 M parameters, ninety-six hours on six- teen TPU chips) for English had a power footprint of 12041 W, producing 652 kilograms of CO2-emissions - the equivalent of a trans-American flight [38]. One of the Sustainable Development Goals is to combat climate change due to CO2 and other greenhouse gasses, and the trend of developing massively complex models for the chance of winning a few more percent in performance runs counter to this. Given how recent the model is no numbers like those referenced for BERT have been released, so it is difficult to estimate how big the difference be- tween the two would be in practice. Nevertheless, when the trend is increasing power consumption and carbon emissions for winning a few percentages for performance, ALBERT shows a welcome break with its ten-fold reduction of parameters and increased efficiency.

5.6 Time and resources used

Once the entire pipeline was set up, going from parsing the xml-tree to gen- erating embeddings with ALBERT to testing the model with five-fold cross validation and plotting the final graphs took approximately half a day. This was done using a portable laptop with a GT1050ti graphics card with 4 GB RAM. The greatest limitation of speed laid in the fact that the full embedding matrix of the wordpiece tokenized data took around 50GB, creating the need to save and read from disk rather than keep the entire matrix in RAM.

5.7 Comparison with other results

Table 5.1 describes other work in the field that can be considered related to this thesis, be it in that they use BERT, test their language models on NER on different languages, or in that they are recent. There are however several caveats to take into account when trying to make any such direct comparison:

1. The number of classes used are different, the most common number being four while this thesis use eight. CHAPTER 5. DISCUSSION 65

2. The datasets used for evaluation are different, so differences in the results cannot necessarily be attributed to the models themselves.

3. ALBERT is an new model that has not been evaluated in NER for En- glish or any other language.

4. Most of the models are developed with English in mind while this thesis focuses on the Swedish use-case. It is possible the results are a function of the language rather than the model itself.

5. The system in this paper was developed to be more efficient rather than beating state of the art, making it different in purpose from the other works referred to.

These differences are extensive enough to render any direct comparisons essentially meaningless. Nevertheless we have chosen to include them for the sake of providing some sort of context.

Table 5.1: Recent NER results using deep neural architectures, ordered by average F1 score Paper Year Language F1 Classes Architecture Devlin et al. 2018 English 96.1 4 BERT-BiLSTM Pires et al. 2019 English 90.7 4 BERT-BiLSTM Pires et al. 2019 Dutch 89.9 4 BERT-BiLSTM Pires et al. 2019 Spanish 87.2 4 BERT-BiLSTM Klang et al. 2018 Swedish 84.7 4 BiLSTM-CNN-CRF Pires et al. 2019 German 82.0 4 BERT-BiLSTM Souza et al. 2019 Portugese 78.7 10 BERT-CRF This thesis 2020 Swedish 77.0 8 ALBERT-BILSTM Weegar et al. 2019 Swedish 76.0 3 BiLSTM-CRF

5.8 Future work

This thesis made use of one of the most recent and lightweight language mod- els made publicly available. All efforts from selecting features to designing the classifier were made with the intent to keep things computationally inexpen- sive and simple. Some suggestions to improve performance without making the system more complex would be to: 66 CHAPTER 5. DISCUSSION

• Use a more advanced tagging scheme than BIO, as these tend to in- crease average F1-score, sometimes by several percent as discussed in Section 2.4.

• Make use of a fully trained ALBERT model. Upon contact, the Swedish KB-lab who pretrained the model used for this thesis stated that they expected performance to increase with more training.

• Mimic the standard approach and group all categories except PRS, LOC and ORG into a miscellaneous category, reducing the number of classes from eight to four. This would reduce the risk of category confusion.

Aside from these suggestion it would also be well worth looking into ways to enhance the SUC dataset, two ideas which are presented below.

Removing composite classes It was mentioned already in Section 3.1.4 that there might have been better ways to deal with the composite categories LOC/PRS, LOC/ORG, ORG/PRS, LOC/LOC, PRS/WRK and OBJ/ORG. These were likely used when an anno- tator was uncertain which class a word should belong to, an example of which being "The car is a big investment for Ford", where Ford could refer to either a person or an organization. In this paper we went along with keeping each of these ambiguous cases as a separate class, but this posed some drawbacks. It increased the total number of classes by six (twelve if you also consider the separation into B- and I-labels) despite less than one percent of all named en- tities belonging to these categories. This flaw is seen in its extreme by looking at the word "Rolls-Royce", the only word in the entire corpus that fell into the category OBJ/ORG. As a natural solution to this problem one could instead take each sentence where such a composite label appears, duplicating it, and then using one part of the label on each sentence. Reusing the previous example this would result in two sentences, one where Ford was labeled as a person and one where Ford was labeled as an organization. This would pose no problem since the sentence did have two perfectly valid interpretations to the human mind, and it is the level of human language understanding we would ideally want to reach. This would lead to an increase in the support for categories in the main classes, reduce the complexity of the classification task and make the results easier to interpret at virtually no cost. Admittedly this approach relies on the annotations being correct in the first place, something that was not always the case. CHAPTER 5. DISCUSSION 67

Increasing the amount of data In Section 4.7 of the results we saw two very similar sentences in Table 4.5 and Table 4.6 where only the first sentence had the named entities correctly classi- fied. This shows how important context is to be able to correctly label entities; with only a difference of three tokens the model went from being completely right to completely wrong. The likely easiest way to enhance the model’s per- formance for these kinds of scenarios would be to simply add more data for it to train on, a solution limited by the fact that it is very time-consuming to gather and manually annotate more text. As an alternative way to get more data we would suggest augmenting the sentences in SUC with subsets of the sentences already in it. If we have the sentence "They competed in French open, France, and won" with correct labels, why not also add "They competed in French open, France" and "They competed in French open"? Once we have a correctly labeled original sentence it would be possible to obtain more sen- tences by taking subsets of it. This would be a low-cost solution that would lead to a massive increase of training data while also making the model better at dealing with different amounts of context. One could argue that a drawback of this would be that the model would no longer be trained only on well-formed sentences, but at the same time unstructured data isn’t necessarily well-formed in the first place, making the argument moot. Another possibility to increase the amount of data that could be considered would be to take sentences and duplicating them, replacing words by their synonyms to expose the model to even more contexts and possible sentence structures.

5.8.1 Going more complex It is of course possible to take another direction with ALBERT, to go for perfor- mance and beating benchmarks rather than efficiency and accessibility. Aside from the steps mentioned in the previous section there are some changes to both ALBERT and the classifier that are likely to increase performance, for example:

• Scaling ALBERT up. Tests of this has yielded state of the art results on the GLUE, SQuAD, and RACE benchmarks for natural language under- standing, beating out even BERT [25].

• Optimizing the classifier. The popular approach seems to be multi-layer BiLSTMs with conditional random fields, but just a more complete grid- 68 CHAPTER 5. DISCUSSION

search over the classifier architecture and hyperparameters would likely produce a better classifier than the one used in this paper.

• Using full fine-tuning on the model and classifier as described by Devlin et al. [10].

• Using more features than just those of the last layer of ALBERT.

5.8.2 Exploring other datasets Looking beyond just enhancing the performance of the model it would also be interesting to see its use in a more specific context. The recognition of medical entities such as diseases, chemicals, disabilities etc is of great importance for summarizing and parsing large amounts of biomedical data for example, and it would be interesting to see the system’s performance for such a use-case. Chapter 6

Conclusion

In this thesis we trained and evaluated a system for named entity recognition in Swedish using the compact ALBERT language model. The system achieved its best results on words describing names and locations followed by organi- zations, times and measures, while scoring the lowest on works of art, objects and events. Whether or not the system as-is could be used for anonymization of Swedish text would thus depend on the type of words that needed to be ob- scured. If only names and perhaps locations needed to be removed it would be worth considering, for any other category the system would likely be too unre- liable. Instead of using the system by itself it may thus be more appropriate to consider its utility as an aid for a human controller, having the system suggest data to obscure with a human making the final call. Regardless of which sce- nario is envisioned there are several improvements to the system that would be worth exploring first. Examples of these include scaling the ALBERT model up, optimizing the classifier, as well as augmenting the SUC dataset and man- aging its inconsistencies better. We conclude that the ALBERT model can be a useful component in Swedish NER and that it has potential worth exploring further.

69 Bibliography

[1] Layer weight initializers. https://keras.io/api/layers/initializers/. Ac- cessed: 2020-05-30.

[2] Simon Almgren, Sean Pavlov, and Olof Mogren. Named entity recog- nition in swedish health records with character-based deep bidirectional lstms. In Proceedings of the Fifth Workshop on Building and Evaluat- ing Resources for Biomedical Text Mining (BioTxtM2016), pages 30–39, 2016.

[3] Bogdan Babych and Anthony Hartley. Improving machine translation quality with automatic named entity recognition. In Proceedings of the 7th International EAMT workshop on MT and other language technology tools, Improving MT through other language technology tools, Resource and tools for building MT at EACL 2003, 2003.

[4] Hanna Berg, Taridzo Chomutare, and Hercules Dalianis. Building a de- identification system for real swedish clinical text using pseudonymised clinical text. In Proceedings of the Tenth International Workshop on Health Text Mining and Information Analysis (LOUHI 2019), pages 118– 125, 2019.

[5] Lars Borin and Dimitrios Kokkinakis. Literary and language technology. In Literary Education and Digital Learning: Methods and Technologies for Humanities Studies, pages 53–78. IGI Global, 2010.

[6] Lars Borin, Dimitrios Kokkinakis, and Leif-Jöran Olsson. Naming the past: Named entity and animacy recognition in 19th century swedish literature. In Proceedings of the Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2007)., pages 1–8, 2007.

[7] Ronan Collobert and Jason Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In

70 BIBLIOGRAPHY 71

Proceedings of the 25th international conference on Machine learning, pages 160–167, 2008.

[8] Hong-Jie Dai, Po-Ting Lai, Yung-Chun Chang, and Richard Tzong-Han Tsai. Enhancing of chemical compound and drug name recognition us- ing representative tag scheme and fine-grained tokenization. Journal of cheminformatics, 7(S1):S14, 2015.

[9] Hercules Dalianis and Erik Åström. Swenam-a swedish named entity recognizer. its construction, training and evaluation. Technical Report TRITA-NA-PO113, 2001.

[10] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language under- standing. In NAACL-HLT, 2019.

[11] Tobias Ek, Camilla Kirkegaard, Håkan Jonsson, and Pierre Nugues. Named entity recognition for short text messages. Procedia-Social and Behavioral Sciences, 27:178–187, 2011.

[12] Eva García-Martín, Crefeda Faviola Rodrigues, Graham Riley, and Håkan Grahn. Estimation of energy consumption in machine learning. Journal of Parallel and Distributed Computing, 134:75–88, 2019.

[13] google research. bert. github.com/google-research/bert. Accessed: 2020-05-30.

[14] Filip Graliński, Krzysztof Jassem, Michał Marcińczuk, and Paweł Wawrzyniak. Named entity recognition in machine anonymization. Re- cent Advances in Intelligent Information Systems, pages 247–260, 2009.

[15] Sofia Gustafson-Capková and Britt Hartmann. Manual of the stockholm umeå corpus version 2.0. 2006.

[16] Huggingface. transformers. github.com/huggingface/transformers. Ac- cessed: 2020-02-30.

[17] Janne Bondi Johannessen, Kristin Hagen, Åsne Haaland, Andra Björk Jónsdottir, Anders Nøklestad, Dimitris Kokkinakis, Paul Meurer, Eck- hard Bick, and Dorte Haltrup. Named entity recognition for the mainland scandinavian languages. Literary and Linguistic Computing, 20(1):91– 102, 2005. 72 BIBLIOGRAPHY

[18] KB-labb. swedish-bert-models. github.com/Kungbib/swedish-bert- models. Accessed: 2020-02-30.

[19] Mahboob Alam Khalid, Valentin Jijkoun, and Maarten De Rijke. The impact of named entity normalization on information retrieval for ques- tion answering. In European Conference on Information Retrieval, pages 705–710. Springer, 2008.

[20] Marcus Klang and Pierre Nugues. Comparing lstm and fofe-based archi- tectures for named entity recognition. page 54, 2018.

[21] Dimitrios Kokkinakis, Jyrki Niemi, Sam Hardwick, Krister Lindén, and Lars Borin. Hfst-swener—a new ner resource for swedish. In LREC, pages 2537–2543, 2014.

[22] Dimitrios Kokkinakis and Anders Thurin. Anonymisation of swedish clinical data. In Conference on Artificial Intelligence in Medicine in Eu- rope, pages 237–241. Springer, 2007.

[23] Taku Kudo and John Richardson. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text process- ing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium, November 2018. Association for Computational Lin- guistics.

[24] Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. Neural architectures for named entity recog- nition. In Proceedings of NAACL-HLT, pages 260–270, 2016.

[25] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations. In International Conference on Learning Representations, 2019.

[26] Xuezhe Ma and Eduard Hovy. End-to-end sequence labeling via bi- directional lstm-cnns-crf. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers), pages 1064–1074, 2016.

[27] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compo- BIBLIOGRAPHY 73

sitionality. In Advances in neural information processing systems, pages 3111–3119, 2013.

[28] David Nadeau and Satoshi Sekine. A survey of named entity recognition and classification. Lingvisticae Investigationes, 30(1):3–26, 2007.

[29] Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christo- pher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Proceedings of NAACL-HLT, pages 2227–2237, 2018.

[30] Telmo Pires, Eva Schlinger, and Dan Garrette. How multilingual is mul- tilingual bert? In Proceedings of the 57th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 4996–5001, 2019.

[31] Claus Povlsen, Bart Jongejan, Dorte H Hansen, and Bo Krantz Simonsen. Anonymization of court orders. In 2016 11th Iberian Conference on Information Systems and Technologies (CISTI), pages 1–4. IEEE, 2016.

[32] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9, 2019.

[33] Prabhakar Raghavan, S Amer-Yahia, and L Gravano. Structure in text: Extraction and exploitation. In Proceeding of the 7th international Work- shop on the Web and Databases (WebDB), ACM SIGMOD/PODS, vol- ume 1, 2004.

[34] Lev Ratinov and Dan Roth. Design challenges and misconceptions in named entity recognition. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009), pages 147–155, Boulder, Colorado, June 2009. Association for Computational Linguistics.

[35] General Data Protection Regulation. Regulation (eu) 2016/679 of the european parliament and of the council of 27 april 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing directive 95/46. Official Journal of the European Union (OJ), 59(1-88):294, 2016.

[36] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-propagating errors. Nature, 323(6088):533–536, 1986. 74 BIBLIOGRAPHY

[37] Fábio Souza, Rodrigo Nogueira, and Roberto Lotufo. Portuguese named entity recognition using bert-crf. arXiv preprint arXiv:1909.10649, 2019.

[38] Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep learning in nlp. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3645–3650, 2019.

[39] Antonio Toral, Elisa Noguera, Fernando Llopis, and Rafael Munoz. Im- proving question answering using named entity recognition. In Interna- tional Conference on Application of Natural Language to Information Systems, pages 181–191. Springer, 2005.

[40] European Union. Charter of fundamental rights of the european union. https://www.refworld.org/docid/3ae6b3b70.html, October 2012. Ac- cessed 16 June 2020.

[41] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.

[42] Nicholas Vollmer. Article 4 eu general data protection regulation (eu- gdpr), Sep 2018.

[43] Bin Wang and C-C Jay Kuo. Sbert-wk: A sentence embedding method by dissecting bert-based word models. arXiv preprint arXiv:2002.06652, 2020.

[44] Rebecka Weegar, Alicia Pérez, Arantza Casillas, and Maite Oronoz. Deep medical entity recognition for swedish and spanish. In 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 1595–1601. IEEE, 2018.

[45] Rebecka Weegar, Alicia Pérez, Arantza Casillas, and Maite Oronoz. Re- cent advances in swedish and spanish medical entity recognition in clin- ical texts using deep neural approaches. BMC Medical Informatics and Decision Making, 19(7):274, 2019.

[46] Victor Wiklund. thesis. github.com/laureleo/thesis. Accessed: 2020-05- 30. BIBLIOGRAPHY 75

[47] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R’emi Louf, Morgan Funtowicz, and Jamie Brew. Huggingface’s transformers: State-of-the- art natural language processing. ArXiv, abs/1910.03771, 2019.

[48] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.

[49] Vikas Yadav and Steven Bethard. A survey on recent advances in named entity recognition from deep learning models. In Proceedings of the 27th International Conference on Computational Linguistics, pages 2145– 2158, 2018.

[50] Jie Yang, Shuailong Liang, and Yue Zhang. Design challenges and mis- conceptions in neural sequence labeling. In Proceedings of the 27th In- ternational Conference on Computational Linguistics, pages 3879–3889, Santa Fe, New Mexico, USA, August 2018. Association for Computa- tional Linguistics.

[51] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. Xlnet: Generalized autoregressive pretraining for language understanding. In H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett, editors, Ad- vances in Neural Information Processing Systems 32, pages 5753–5763. Curran Associates, Inc., 2019.

[52] Imed Zitouni. Natural language processing of semitic languages. Springer, 2014. TRITA -EECS-EX-2020:631

www.kth.se