
EXAMENSARBETE INOM TEKNIK, GRUNDNIVÅ, 15 HP STOCKHOLM, SVERIGE 2020 Prerequisites for Extracting Entity Relations from Swedish Texts ERIK LENAS KTH SKOLAN FÖR ELEKTROTEKNIK OCH DATAVETENSKAP Authors Erik Lenas University KTH Royal Institute of Technology Supervisor Anders Sjögren Examinor Fadil Galjic Bachelor Thesis Degree Project in Computer Engineering, First Cycle Faculty Electrical Engineering and Computer Science University KTH Royal Institute of Technology Partner Riksarkivet 2 Abstract Natural language processing (NLP) is a vibrant area of research with many practical applications today like sentiment analyses, text labeling, questioning an- swering, machine translation and automatic text summarizing. At the moment, research is mainly focused on the English language, although many other lan- guages are trying to catch up. This work focuses on an area within NLP called information extraction, and more specifically on relation extraction, that is, to ex- tract relations between entities in a text. What this work aims at is to use machine learning techniques to build a Swedish language processing pipeline with part-of- speech tagging, dependency parsing, named entity recognition and coreference resolution to use as a base for later relation extraction from archival texts. The obvious difficulty lies in the scarcity of Swedish annotated datasets. For exam- ple, no large enough Swedish dataset for coreference resolution exists today. An important part of this work, therefore, is to create a Swedish coreference solver using distantly supervised machine learning, which means creating a Swedish dataset by applying an English coreference solver on an unannotated bilingual corpus, and then using a word-aligner to translate this machine-annotated En- glish dataset to a Swedish dataset, and then training a Swedish model on this dataset. Using Allen NLP:s end-to-end coreference resolution model, both for creating the Swedish dataset and training the Swedish model, this work achieves an F1-score of 0.5. For named entity recognition this work uses the Swedish BERT models released by the Royal Library of Sweden in February 2020 and achieves an overall F1-score of 0.95. To put all of these NLP-models within a single Lan- guage Processing Pipeline, Spacy is used as a unifying framework. Keywords Machine Learning, Natural Language Processing, Relation Extraction, Named Entity Recognition, Coreference resolution, BERT 3 Abstract Natural Language Processing (NLP) är ett stort och aktuellt forskningsområde idag med många praktiska tillämpningar som sentimentanalys, textkategoriser- ing, maskinöversättning och automatisk textsummering. Forskningen är för när- varande mest inriktad på det engelska språket, men många andra språkområ- den försöker komma ikapp. Det här arbetet fokuserar på ett område inom NLP som kallas informationsextraktion, och mer specifikt relationsextrahering, det vill säga att extrahera relationer mellan namngivna entiteter i en text. Vad det här ar- betet försöker göra är att använda olika maskininlärningstekniker för att skapa en svensk Language Processing Pipeline bestående av part-of-speech tagging, de- pendency parsing, named entity recognition och coreference resolution. Denna pipeline är sedan tänkt att användas som en bas for senare relationsextrahering från svenskt arkivmaterial. Den uppenbara svårigheten med detta ligger i att det är ont om stora, annoterade svenska dataset. Till exempel så finns det inget till- räckligt stort svenskt dataset för coreference resolution. En stor del av detta arbete går därför ut på att skapa en svensk coreference solver genom att implementera distantly supervised machine learning, med vilket menas att använda en engelsk coreference solver på ett oannoterat engelskt-svenskt corpus, och sen använda en word-aligner för att översätta detta maskinannoterade engelska dataset till ett svenskt, och sen träna en svensk coreference solver på detta dataset. Det här arbetet använder Allen NLP:s end-to-end coreference solver, både för att skapa det svenska datasetet, och för att träna den svenska modellen, och uppnår en F1-score på 0.5. Vad gäller named entity recognition så använder det här arbetet Kungliga Bibliotekets BERT-modeller som bas, och uppnår genom detta en F1- score på 0.95. Spacy används som ett enande ramverk för att samla alla dessa NLP-komponenter inom en enda pipeline. Nyckelord Maskininlärning, Natural Language Processing, Relationsextrahering, Named En- tity Recognition, Coreference resolution, BERT 4 Contents 1 Introduction 7 1.1 Background . 7 1.2 Problem . 7 1.3 Purpose . 8 1.4 Research Method . 8 1.5 Scope . 8 1.6 Disposition . 8 2 Theoretical Background 11 2.1 Machine Learning . 11 2.2 Natural Language Processing . 15 2.3 Related Works . 19 3 Method 21 3.1 Research Method . 21 3.2 Relation extraction methods . 23 3.3 Technological Methods . 27 4 Training the Spacy Language Processing Pipeline 31 4.1 Datasets . 31 4.2 Part-of-speech tagging . 32 4.3 Dependency parsing . 33 4.4 Named Entity Recognition . 33 4.5 Coreference resolution . 36 5 Results 41 5.1 Part-of-Speech Tagging . 41 5.2 Dependency Parsing . 41 5.3 Named Entity Recognition . 42 5.4 Coreference Resolution . 42 6 Discussion 45 6.1 Natural Language Processing on Swedish Texts . 45 6.2 Discussion of the Results . 47 6.3 Future Work . 48 7 Appendix A - Datasets and Spacy Training Format 49 5 6 1 Introduction Natural Language Processing (NLP) is a sub-field of Artificial Intelligence (AI), lin- guistics and computer science that deals with the ability of computers to draw mean- ing from spoken or written language. The field has been around since the 1950:s, with its ups and downs along the way. In the recent 10-15 years remarkable progress has been made, mainly due to new techniques in the field of machine learning, coupled with the explosion of data made available in the last decade. 1.1 Background The aim of this work is to use machine learning techniques to build a base model for extracting relations between entities in Swedish texts. An entity is some sort of noun-phrase that corresponds to an entity in the world, like a person, an organization, a time expression or a building. An n-ary relation between entities is some sort of relation between n entities in a span of text, such as "Erik sold the Downton Abbey in 1968", which would be a 3-ary relation between Erik, Downton Abbey (which would refer to an actual building in the world) and the year 1968.[12] Before you can extract this relation from an unseen text, several prerequisites needs to be met. For example, we need to recognize and categorize the different entities present in the text, focusing on the types of entities that will take part in the relations we want to extract. But extracting entities isn’t enough, we also need to ex- tract grammatical and syntactical information about the text, as well as some kind of numerical representation for the meaning of words. One other thing that needs to be solved before extracting the relations is coreference-clusters, that is, linking different noun-phrases to each other, such as linking "he" to "Erik" and "computer" to "it" in "Erik has just bought a new computer. He is very happy with it." The project was done in collaboration with Riksarkivet (The Swedish National Archives), and the goal is to use this work’s resulting model as a base to use when implementing relation extraction from archival descriptions. You can then structure these relations in a database, and thereby adding search functionality to the archives. 1.2 Problem To extract relations between entities in a text, information about the text first has to be gathered. For example, the relevant entities needs to be extracted before relations between them can be extracted. But other information has to be extracted as well. For example "Erik owns Fårö Herrgård" and "Erik works at Fårö Herrgård" signifies two different relations even though the named entities (Erik, Fårö Herrgård) are the same. What information about this sentence, and the tokens comprising the sentence, could aid in separating these two, and other kinds of relations? What information is needed to extract the particular relation of interest, and not all relations with the same entity types? Another important condition for this work is that the work is focused on 7 Swedish texts. This presents other problems than if the work was focused on English texts. The Swedish datasets needed for supervised machine learning, if they exists at all, are much smaller than their English counterparts. Thus we arrive at the central problem underlying this work: What are prerequisites to extract entity relations from Swedish texts, and how do you meet these prerequisites given that annotated datasets are scarce? 1.3 Purpose The purpose of this work is to produce a model capable of extracting entity relations from archival descriptions. The intent is that the model, with task-specific adjust- ments, can be used for many different kinds of relation extraction tasks within an on- going project at Riksarkivet to use machine learning techniques to make their archives more accessible. 1.4 Research Method The research methods used for this project were literature studies, both of books on the subject, different types of web-material, and research papers for learning the latest developments. Then, for the actual construction of models, observation combined with experimentation was used. In the case of machine learning this means training a model on only a part of the dataset and keeping the other part unseen for later validation and testing. 1.5 Scope This work was limited to creating a base model for relation extraction. The actual re- lation extraction wasn’t included in the work, but instead, the purpose was to design a model that can serve as a base for many different types of relation extraction tasks. 1.6 Disposition Chapter 2 describes the theory behind the work, and explains the concepts and struc- tures relevant to the task at hand.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages57 Page
-
File Size-