Datasets and Baselines for Armenian Named Entity Recognition
Total Page:16
File Type:pdf, Size:1020Kb
pioNER: Datasets and Baselines for Armenian Named Entity Recognition Tsolak Ghukasyan1, Garnik Davtyan2, Karen Avetisyan3, Ivan Andrianov4 Ivannikov Laboratory for System Programming at Russian-Armenian University1,2,3, Yerevan, Armenia Ivannikov Institute for System Programming of the Russian Academy of Sciences4, Moscow, Russia Email: {1tsggukasyan,2garnik.davtyan,3karavet,4ivan.andrianov}@ispras.ru Abstract—In this work, we tackle the problem of Armenian that leverages the information from Wikipedia [5]. However, named entity recognition, providing silver- and gold-standard the latter relies on external tools such as part-of-speech tag- datasets as well as establishing baseline results on popular mod- gers, making it nonviable for the Armenian language. els. We present a 163000-token named entity corpus automatically generated and annotated from Wikipedia, and another 53400- Nothman et al. generated a silver-standard corpus for 9 token corpus of news sentences with manual annotation of people, languages by extracting Wikipedia article texts with outgoing organization and location named entities. The corpora were used links and turning those links into named entity annotations to train and evaluate several popular named entity recognition based on the target article’s type [6]. Sysoev and Andrianov models. Alongside the datasets, we release 50-, 100-, 200-, 300- used a similar approach for the Russian language [7] [8]. dimensional GloVe word embeddings trained on a collection of Armenian texts from Wikipedia, news, blogs, and encyclopedia. Based on its success for a wide range of languages, our choice Index Terms—machine learning, deep learning, natural lan- fell on this model to tackle automated data generation and guage processing, named entity recognition, word embeddings annotation for the Armenian language. Aside from the lack of training data, we also address the absence of a benchmark dataset of Armenian texts for named I. INTRODUCTION entity recognition. We propose a gold-standard corpus with Named entity recognition is an important task of natural manual annotation of CoNLL named entity categories: person, language processing, featuring in many popular text processing location, and organization [9] [10], hoping it will be used to toolkits. This area of natural language processing has been evaluate future named entity recognition models. actively studied in the latest decades and the advent of deep Furthermore, popular entity recognition models were ap- learning reinvigorated the research on more effective and plied to the mentioned data in order to obtain baseline results accurate models. However, most of existing approaches require for future research in the area. Along with the datasets, we large annotated corpora. To the best of our knowledge, no such developed GloVe [11] word embeddings to train and evaluate work has been done for the Armenian language, and in this the deep learning models in our experiments. work we address several problems, including the creation of a The contributions of this work are (i) the silver-standard corpus for training machine learning models, the development training corpus, (ii) the gold-standard test corpus, (iii) GloVe of gold-standard test corpus and evaluation of the effectiveness word embeddings, (iv) baseline results for 3 different models of established approaches for the Armenian language. on the proposed benchmark data set. All aforementioned Considering the cost of creating manually annotated named resources are available on GitHub1. entity corpus, we focused on alternative approaches. Lack of named entity corpora is a common problem for many II. AUTOMATED TRAINING CORPUS GENERATION arXiv:1810.08699v1 [cs.CL] 19 Oct 2018 languages, thus bringing the attention of many researchers We used Sysoev and Andrianov’s modification of the around the globe. Projection based transfer schemes have been Nothman et al. approach to automatically generate data for shown to be very effective (e.g. [1], [2], [3]), using resource- training a named entity recognizer. This approach uses links rich language’s corpora to generate annotated data for the low- between Wikipedia articles to generate sequences of named- resource language. In this approach, the annotations of high- entity annotated tokens. resource language are projected over the corresponding tokens of the parallel low-resource language’s texts. This strategy A. Dataset extraction can be applied for language pairs that have parallel corpora. The main steps of the dataset extraction system are de- However, that approach would not work for Armenian as we scribed in Figure 1. did not have access to sufficiently large parallel corpus witha First, each Wikipedia article is assigned a named entity resource-rich language. class (e.g. the article Qim Qaqayan (Kim Kashkashian) is Another popular approach is using Wikipedia. Klesti classified as PER (person), Azgeri liga(League of Nations) Hoxha and Artur Baxhaku employ gazetteers extracted from as ORG (organization), Siria(Syria) as LOC etc). One of the Wikipedia to generate an annotated corpus for Albanian [4], and Weber and Pötzl propose a rule-based system for German 1https://github.com/ispras-texterra/pioner © 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Fig. 1: Steps of automatic dataset extraction from Wikipedia The classification could be done using solely instance of labels, but these labels are unnecessarily specific for the Classification of Wikipedia articles into NE types task and building a mapping on it would require a more time-consuming and meticulous work. Therefore, we classified articles based on their first instance of attribute’s subclass of values. Table I displays the mapping between these values and Labelling common article aliases to increase coverage named entity types. Using higher-level subclass of values was not an option as their values often were too general, making it impossible to derive the correct named entity category. Extraction of text fragments with outgoing links C. Generated data Using the algorithm described above, we generated 7455 annotated sentences with 163247 tokens based on 20 February Labelling links according to their target article’s type 2018 dump of Armenian Wikipedia. The generated data is still significantly smaller than the manually annotated corpora from CoNLL 2002 and 2003. For comparison, the train set of English CoNLL 2003 corpus Adjustment of labeled entities’ boundaries contains 203621 tokens and the German one 206931, while the Spanish and Dutch corpora from CoNLL 2002 respectively 273037 and 218737 lines. The smaller size of our generated core differences between our approach and Nothman’s system data can be attributed to the strict selection of candidate is that we do not rely on manual classification of articles and sentences as well as simply to the relatively small size of do not use inter-language links to project article classifications Armenian Wikipedia. across languages. Instead, our classification algorithm uses The accuracy of annotation in the generated corpus heavily only an article’s Wikidata entry’s first instance of label’s relies on the quality of links in Wikipedia articles. During gen- parent subclass of labels, which are, incidentally, language eration, we assumed that first mentions of all named entities independent and thus can be used for any language. have an outgoing link to their article, however this was not al- Then, outgoing links in articles are assigned the article’s ways the case in actual source data and as a result the train set type they are leading to. Sentences are included in the training contained sentences where not all named entities are labeled. corpus only if they contain at least one named entity and Annotation inaccuracies also stemmed from wrongly assigned all contained capitalized words have an outgoing link to an link boundaries (for example, in Wikipedia article Arowr article of known type. Since in Wikipedia articles only the first Owelsli Velingon (Arthur Wellesley) there is a link to the mention of each entity is linked, this approach becomes very Napoleon article with the text " Napoleon" ("Napoleon is"), restrictive and in order to include more sentences, additional when it should be "Napoleon" ("Napoleon")). Another kind links are inferred. This is accomplished by compiling a list of of common annotation errors occurred when a named entity common aliases for articles corresponding to named entities, appeared inside a link not targeting a LOC, ORG, or PER and then finding text fragments matching those aliases to article (e.g. "AMN naxagahakan ntrowyownnerowm" assign a named entity label. An article’s aliases include its ("USA presidential elections") is linked to the article AMN title, titles of disambiguation pages with the article, and texts naxagahakan ntrowyownner 2016 (United States presi- of links leading to the article (e.g. Leningrad (Leningrad), dential election, 2016) and as a result [LOC AMN] (USA) is Petrograd (Petrograd), Peterbowrg (Peterburg) are aliases lost). for Sankt Peterbowrg (Saint Petersburg)). The list of III. TEST DATASET aliases is compiled for all PER, ORG, LOC articles. In order to evaluate the models trained on generated data,