Building and Evaluating Universal Named-Entity Recognition English Corpus

Building and Evaluating Universal Named-Entity Recognition English corpus Diego Alves[0000−0001−8311−2240], Gaurish Thakkar[0000−0002−8119−5078], and Marko Tadić[0000−0001−6325−820X] Faculty of Humanities and Social Sciences, University of Zagreb, Zagreb 10000, Croatia fdfvalio,[email protected], [email protected] Abstract. This article presents the application of the Universal Named Entity framework to generate automatically annotated corpora. By using a workflow that extracts Wikipedia data and meta-data and DB- pedia information, we generated an English dataset which is described and evaluated. Furthermore, we conducted a set of experiments to improve the annotations in terms of precision, recall, and F1-measure. The final dataset is available and the established workflow can be applied to any language with existing Wikipedia and DBpedia. As part of future research, we intend to continue improving the annotation process and extend it to other languages. Keywords: named entity recognition · data extraction · multilingual nlp 1 Introduction Named entity recognition and classification (NERC) is an important field inside Natural Language Processing (NLP), being a crucial task of information extraction from texts. It was first defined in 1995 in the 6th Message Understanding Conference (MUC-6)[5] and since then has been used in multiple NLP applica- tions such as events and relations extraction, question answering systems and entity-oriented search. As shown by Alves et al.[1], NERC corpora and tools present an immense variety in terms of annotation hierarchy and formats. NERC hierarchy struc- ture is usually locally defined considering the final NLP application where it will be used at. If certain types like "Person", "Location", and "Organization" are present in almost every NERC system, some corpora are composed of more complex types of annotation. It is the case, for example, of Portuguese Second HAREM[7], Czech Named Entity Corpus 2.0[18] and Romanian Ronec[6]. Mul- tilingual alternatives also exist, it is the case of spaCy software[8] proposing two different single-level hierarchies composed of either 18 or 4 NERC types following OntoNotes 5.0[19] and Wikipedia[15] respectively. Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2 D. Alves et al. Unlike Part-of-Speech and Dependency Parsing tagging which have Universal Dependencies1, there is no universal alternative for NERC in terms of annotation framework and multilingual repository following the same standards. Hence, we use the Universal Named Entity (UNER) framework which is composed of a complex NERC hierarchy inspired by Sekine's work[17] and propose a process which parses data from Wikipedia2, extract named entities through hyperlinks, aligns them with DBpedia3[11] entity classes and translates them into UNER types and subtypes. This process can be applied to any language present in both Wikipedia and DBpedia, therefore generating multilingual NERC corpora following the same procedure and hierarchy. Thus, UNER is useful for multilingual NLP tasks which need recognition and classification of named entities going beyond classical NERC hierarchy involving only a few types. UNER data can be used in its totality or can be easily adapted to specific needs. For example, considering the classic \Location" type usually present in NERC corpora, UNER can be used to get more detailed information: if an entity is more specifically a country, mountain, island, etc. This paper presents the UNER hierarchy and its workflow for data extraction and annotation. It details the application of the proposed process on English language with qualitative and quantitative evaluation of the automatically annotated data. It also presents the evaluation of different alternatives implemented for the improvement of the generated dataset. This article is organized as follows: In Section 2, we present the state-of- the-art concerning NERC automatic data generation workflows; in Section 3, we describe UNER framework and hierarchy, and in Section 4 the details of the data extraction and annotation workflow and evaluation of the generated dataset. In Section 5, we report the experiments that were conducted for improving the dataset quality in terms of Precision and Recall. Section 6 is dedicated to the discussion of the results, and in Section 7, we present our conclusions and possible future directions for research. 2 Related Work UNER framework was first introduced by us in a previous article [2], where we defined its hierarchy. In this article, the framework was revised and a workflow for automatic text annotation was developed and applied to generate an annotated corpus in English with the respective evaluation. Deep learning has been employed in NERC systems in recent years improving stat-of-the-art performance, and, therefore, increasing the need of quality annotated datas-sets as stated by Yadav and Bethard[21] and Li et al.[12]. These authors have provided a large overview of existing techniques for extracting and classifying named entities using machine and deep learning methods. 1 https://universaldependencies.org 2 https://www.wikipedia.org/ 3 https://www.dbpedia.org/ Building and Evaluating Universal Named-Entity Recognition English corpus 3 The problem of composing new NERC datasets has been the object of the study proposed by Lawson et al.[10]. Manual annotation of large corpora is usually very costly, the authors propose, then, the usage of Amazon Mechanical Turk as a low-cost alternative. However, this method still depends on a specific budget for the task and can be very time-consuming. It also depends on the availability of annotators for each specific language which may be problematic if the aim is to generate large multilingual corpora. A generic method for extracting entities from Wikipedia articles was proposed in Bekavac and Tadić[3]and includes Multi-Word Extraction of named entities using local regular grammars. Therefore, for each targeted language, a new set of rules must be defined. Another automatic multilingual solutions has been proposed by Ni et al.[14] and Kim et al.[9] using either annotation projec- tion on comparable corpora or Wikipedia metadata on parallel datasets. Both methods, however, still require manual annotations that are language-dependent and cannot be applied universally. Furthermore, Weber & Vieira[13] use a sim- ilar process to the one presented in this article for annotating Wikipedia texts using DBpedia information. However, their focus is on Portuguese only with a very simple NERC hierarchy. The idea of using Wikipedia metadata to annotate multilingual corpora has also been proposed by Nothman et al.[15] for English, German, Spanish, Dutch, and Russian. Despite the multilingual approach, it also requires manually annotated text. 3 UNER Dataframe Description As mentioned in the previous section, the UNER hierarchy was introduced by Alves et al.[1]. It was built upon the 4-level NERC hierarchy proposed by Sekine[17], which was chosen as it presents a high conceptual hierarchy. The changes comparing both structures have been detailed by the authors. The proposed UNER hierarchy is also composed of 4 levels. Level 0 being the root node from which derives all the other levels. Level 1 consists of three main classes: Name, Time Expression and Numerical Expression. Level 2 is composed of 29 named-entity categories which can be detailed in a third level with 95 types. Additionally, level 4 contains 129 subtypes (Alves et al.[1]). This first version of the UNER hierarchy, therefore, encompasses 215 labels which can contain up to four levels of granularity depending on how detailed is the named-entity type. UNER labels are composed of tags from each level separated by a hyphen "-". As level 0 is the root and common for all entities, it is not present in the label. For example: { UNER label Name-Event-Natural Phenomenon-Earthquake is composed of level 1 Name, level 2 Event, level 3 Natural Phenomenon and level 4 Earth- quake. The idea of using both Wikipedia data and metadata associated with DB- pedia information to generate UNER annotated datasets compelled us to revise 4 D. Alves et al. the first proposed UNER hierarchy. The main reason is that the automatic annotation process is based on a list of equivalences between UNER labels and DBpedia classes. By generating the list of equivalences, it was noticeable that not all UNER labels would have a DBpedia equivalent class. This is case for majority of Time and Numerical expressions. These cases will have to be dealt with by other automatic methods in future work. Therefore, for this article, we consider version 2 of UNER, presented in the GitHub webpage of the project4. It is composed of 124 labels and its hierarchy is detailed in Table 1. Table 1. Description of the number of nodes per level inside UNER v2 hierarchy. Level Number of nodes 0 (root) 1 1 3 2 14 3 53 4 88 In the annotation process, we have decided to use the IOB format[16] as it is widely used by many NERC systems as showed by Alves et al.[1]. Therefore, each annotated entity token also receives in the beginning of the UNER label the letter \B" if the token is the first of the entity or \I" if inside. Non-entity tokens receive only the tag \O". 4 Data Extraction and Annotation The workflow we have developed allows the extraction of texts and metadata from Wikipedia (for any language present in this database), followed by the identification of the DBpedia classes via the hyperlinks associated with certain tokens (entities) and the translation to UNER types and subtypes (these last two steps being language independent). Once the main process of data extraction and annotation is over, the workflow proposes post-processing steps to improve the tokenization, implement the IOB format [16] and gather statistical information concerning the generated corpus.

Load more