2.2 Named Entity Recognition

DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2020 A compact language model for Swedish text anonymization VICTOR WIKLUND KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE A compact language model for Swedish text anonymization VICTOR WIKLUND Master in Computer Science Date: August 28, 2020 Supervisor: Mats Nordahl Examiner: Olov Engwall School of Electrical Engineering and Computer Science Swedish title: En kompakt språkmodell för svensk textanonymisering. 3 Abstract The General Data Protection Regulation (GDPR) that came into effect in 2018 states that for personal information to be freely used for research and statistics it needs to be anonymized first. To properly anonymize a text one needs to identify the words that carry personally identifying information such as names, locations and organizations. Named Entity Recognition (NER) is the task of detecting these kinds of words and in the last decade a lot of progress has been made on it. This progress can be largely attributed to machine learn- ing, in particular the development of language models that are trained on vast amounts of textual data in the target language. These models are powerful but very computationally demanding to run, limiting their accessibility. AL- BERT is a recently developed language model that manages to provide almost the same level of performance at only a fraction of the size. In this thesis we explore the use of ALBERT as a component in Swedish anonymization by combining the model with a one-layer BiLSTM classifier and testing it on the Stockholm-Umeå corpus. The results show that the system can separate personally identifying words from ordinary words 79.4% of the time and that the model performs the best when it comes to detecting names, with a F1-score of 87.7 percent. Looking at the average performance across eight categories we obtain a F1-score of 77.8% with five-fold cross-validation and 77.0 ± 0.2% on the test set with 95% confidence. We find that the system as-is could be used for the anonymization of some types of information, but would perhaps be better suited as an aid for a human controller. We discuss ways to enhance the performance of the system and conclude that ALBERT can be a useful component in Swedish anonymization, provided that it is optimized further. 4 Sammanfattning I och med dataskyddsförordningen (GDPR) som började gälla 2018 krävs det att personlig information måste anonymiseras innan den kan användas fritt för statistik och forskning. För att anonymisera en text krävs det att man kan upptäcka de ord som bär på personlig information såsom namn, platser och organisationer. Named Entity Recognition (NER) är ett område inom datave- tenskap som handlar om hur man automatiskt kan upptäcka dessa typer av ord, och under det senaste årtiondet flera framsteg gjorts inom det. Dessa framsteg är i allmänhet resultatet av kombinationen av maskininlärning och bättre dato- rer, men speciellt utvecklingen av allmänna språkmodeller tränade på massiva mängder språkdata har varit viktig. Dessa modeller är kraftfulla men kräver mycket systemresurser för att använda, vilket begränsar deras tillgänglighet. ALBERT är en nyutvecklad språkmodell som leverar liknande prestanda med bara en bråkdel av antalet parametrar. I det här arbetet utforskar vi användning- en av ALBERT för anonymisering av svensk text genom att kombinera model- len med en enkel BiLSTM-klassificerare och att testa den på Stockholm-Umeå korpuset. Våra resultat visar att systemet lyckas skilja på personligt identifie- rande information och vanliga ord i 79.4 procent av fallen samt att den är bäst på att känna igen namn, med en F1-poäng på 87.7 procent. Sett över de åtta mest intressanta ordkategorierna i korpuset erhåller vi en F1-poäng på 77.8% med femfaldig korsvalidering och 77.0 ± 0.2% på testdatan med 95% kon- fidens. Vi finner att systemet i dess nuvarande tillstånd skulle kunna anonymisera vissa typer av information, men att det troligtvis skulle fungera bättre tillsammans med en människa som dubbelkollar dess förslag. Vi diskuterar sätt att förbättra systemets prestanda och drar slutsatsen att ALBERT kan vara en användbar komponent i svensk anonymisering förutsatt att den optimeras till en högre grad. Contents 1 Introduction 7 1.1 Research Question . .8 1.1.1 Limitations . .8 1.1.2 Evaluation . .9 2 Background 10 2.1 GDPR . 11 2.1.1 Who does the GDPR apply to? . 11 2.1.2 What is personal data . 11 2.1.3 How is the data protected? . 12 2.1.4 GDPR and NER . 13 2.2 Named Entity Recognition . 14 2.3 Language models . 15 2.3.1 BERT . 16 2.3.2 ALBERT . 18 2.3.3 The issue of having a limited vocabulary . 20 2.4 Wordpiece Labels . 21 2.5 The SUC 3.0 Corpus . 21 2.5.1 Named Entity Abbreviations . 22 2.6 Related work . 24 2.6.1 Recent developments in NER with BERT . 24 2.6.2 A chronology of Swedish NER . 25 3 Method 28 3.1 Data Preprocessing . 29 3.1.1 Extraction of relevant data from SUC . 29 3.1.2 Wordpiece tokenization . 30 3.1.3 Wordpiece tagging . 30 3.1.4 Data cleaning . 31 5 6 CONTENTS 3.1.5 Padding and Truncating . 32 3.1.6 Formatting . 33 3.2 Generating embeddings with ALBERT . 33 3.3 Model selection . 34 3.3.1 Architecture . 35 3.3.2 Conceptual understanding of the classifier . 36 3.4 Evaluation . 37 3.4.1 Training . 37 3.4.2 Experiments . 38 3.4.3 Metrics . 40 3.5 Resources and code . 41 3.5.1 Resources . 42 3.5.2 Code . 42 4 Results 44 4.1 Average cross validated performance . 45 4.2 Cross validation training . 46 4.3 Entity/Non-entity confusion . 47 4.4 Main Category Performance . 48 4.5 Full Category Performance . 49 4.6 Category confusion . 50 4.7 Model output . 51 4.8 Statistical measures . 55 4.8.1 Confidence intervals . 55 4.8.2 Spread of results . 56 5 Discussion 58 5.1 ALBERT and anonymization . 59 5.2 Class confusion . 61 5.3 Wordpiece results . 62 5.4 Ethical aspects . 62 5.5 Sustainability considerations . 63 5.6 Time and resources used . 64 5.7 Comparison with other results . 64 5.8 Future work . 65 5.8.1 Going more complex . 67 5.8.2 Exploring other datasets . 68 6 Conclusion 69 Chapter 1 Introduction It is common in modern society for personal information to end up stored digitally. This information can be everything from blog posts about your daily life to personal emails and medical history. Opinions on what information needs to be protected and to what degree may vary, but it is reasonable to state that most would prefer, if given the option, for their personal information to not be exploited and collected by others. At the same time the pursuit of perfect confidentiality and privacy should not be taken too far. Issues of feasibility aside there are a lot of useful things that can be done with personal information. Consider the case where one could better study the effects of a viral treatment with full access to patient data, or when one wants to analyze traffic flow with location data from cellphones to avoid congestion in cities, or perhaps look into trends in crime or education on a nation-wide scale. There are without a doubt benefits to this kind of research but the privacy of the individual should take precedence. One way to do this is to simply scrub or replace all information that could be used to identify an individual, a process known as anonymization or pseudonymization that was made compulsory on the 25th of May in 2018 as part of the EUs General Data Protection Regulation (GDPR). Accordingly one is only allowed to freely use private data for research and statistics if it isn’t possible to tie the data to a unique individual [42]. Detecting all such sensitive information is a time-consuming process that eventually becomes completely infeasible to perform manually as the amount of data to scan through grows too large. A natural way of countering this is by automation, and luckily the field of Natural Language Processing (NLP) has made great leaps the past decade. In particular the pieces of data that could be considered as personally identifying (such as names, locations, dates, 7 8 CHAPTER 1. INTRODUCTION objects, organizations, products, etc.) fall under the scope of Named Entity Recognition (NER), an active subfield of NLP. Some of the best results achieved on named entity recognition were ob- tained with a language model released by Google known under the acronym BERT [10]. This model is computationally heavy however, with the base model requiring a GPU with 12-16 GB RAM to run, and larger models going beyond that [10]. In fact most NLP models now require multiple instances of specialized hardware such as GPUs or TPUs, which limits the accessibility of the technology [38]. This constitutes a problem as high-quality NER is useful for all actors who deal with personal information, but not all have access to these resources. In 2020 Lan et al. [25] published a new model that went against the trend of increasing model complexity with their release of ALBERT, a much more compact version of BERT parameter-wise with a minor degradation in performance. This combination of few parameters and high performance in a single model makes state of the art NLP something more people could make use of, though given how recent the model is it has not yet been evaluated for NER at the time of writing [25].

2.2 Named Entity Recognition

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support