A Named Entity Linker

Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 4501–4508 Marseille, 11–16 May 2020 c European Language Resources Association (ELRA), licensed under CC-BY-NC Hedwig: A Named Entity Linker Marcus Klang, Pierre Nugues Lund University Department of Computer Science fmarcus.klang, [email protected] Abstract Named entity linking is the task of identifying mentions of named things in text, such as “Barack Obama” or “New York”, and linking these mentions to unique identifiers. In this paper, we describe Hedwig, an end-to-end named entity linker, which uses a combination of word and character BILSTM models for mention detection, a Wikidata and Wikipedia-derived knowledge base with global information aggregated over nine language editions, and a PageRank algorithm for entity linking. We evaluated Hedwig on the TAC2017 dataset, consisting of news texts and discussion forums, and we obtained a final score of 59.9% on CEAFmC+, an improvement over our previous generation linker Ugglan, and a trilingual entity link score of 71.9%. Keywords: named entity recognition, named entity linking, named entity annotation 1. Introduction will present a named entity linker for the 2017 edition of Named entity linking (NEL) is the task of automatically the Text Analysis Conference (TAC) Entity Discovery and finding and linking mentions of things to unique identifiers. Linking (EDL) task with its provided benchmark (Ji et al., The word thing is too broad for the linkage problem; a more 2017). This task was selected because it provides a multi- concrete definition used in this paper is linking uniquely lingual gold standard. This dataset is diverse in its content separable things, which we can identify by a name, i.e. and is a combination of real-world noisy texts found on the named entities. The classes of named entities we will then internet. This type of dataset presents challenges that all en- try to link are instances of persons, organizations, locations, tity linkers would encounter when applied in the real world. etc. 1.1. TAC EDL 2017 Take for instance the named entity of class location: “New York”. This mention can refer to the state1 of New York The TAC EDL task consists of linking two categories of or the large city2 situated in that particular state. For the mentions: latter, a matching unique identifier could be the English Wikipedia label: New_York_City. • Named mentions divided into five classes: A typical NEL pipeline consists of many phases includ- PER ing a name finding, mention detection (MD) phase (e.g. – , Persons (nonfictional) detecting “New York” in a text), a candidate generation – ORG, Organizations, (companies, institutions, (CD) phase (e.g. state or city), and an entity linking etc.) (EL) phase (e.g. assigning the label). In addition, these – LOC, Location (natural locations, such as moun- phases might be defined independently (Cucerzan, 2007), tains, oceans, lakes etc.) or trained jointly (Ganea and Hofmann, 2017). The MD phase is frequently a named entity recognizer (NER), which – GPE, Geopolitical entities (cities, administrative finds and classifies spans of strings in a set of predefined areas, countries, states, municipalities, etc.) classes such as persons, organization, location, etc. The CD – FAC, Facilities (airports, transportation infras- phase uses the classified mention as input, possibly with tructure, man-made buildings, hospitals etc.) context, and from this information generates a list of entity candidates. Finally, the entity linking phase ranks and • Nominal mentions, involving a limited set of com- selects the most probable or coherent set of candidate enti- mon nouns coreferring with a named mention. For ties. It assigns each mention a label, which corresponds to instance, Barack Obama would be the named men- the unique identifier of the the selected candidate. tion and the nominal mention would be president. The unique identifier can be local or global, and its con- Other common relations are son, wife, daughter, fa- crete format is determined by the linker method, which ther, company, area, etc. The nominal mentions are can span the spectrum of fully supervised to unsupervised. classified and linked in the same manner as named This paper uses a supervised approach by linking to prede- mentions. termined identifiers. These identifiers are provided by an entity repository which we refer to as the knowledge base The corpus is a mixture of discussion forum (DF) and (KB). newswire (NW) text in three languages: English, Spanish, EDL annotated datasets use different kinds of corpora and and Mainland Chinese stored in XML and HTML (2014). come with different evaluation procedures. This paper The gold standard provides links to the Freebase KB and out-of-KB labels. These latter labels start with “NIL” fol- 1https://en.wikipedia.org/wiki/New York (state) lowed by a number which spans all three languages. The 2https://en.wikipedia.org/wiki/New York City Freebase identifiers are connected in the BaseKB provided 4501 by TAC. The final score is based on the performance on all Cucerzan, 2007; Milne and Witten, 2008), end-to-end three languages. linkage with a voting scheme for linkage (Ferragina and We subdivided the corpus into a training and test set based Scaiella, 2010), graphical models (Hoffart et al., 2011; Guo on years. The 2017 dataset is the test set, and 2014-2016 is and Barbosa, 2014), integer linear programming (Cheng our training set. It is important to note that the 2014 edition and Roth, 2013), fully probabilistic models (Ganea et does not have Spanish or Chinese texts. al., 2016) to deeper neural models (Ganea and Hofmann, Nominal mentions were added in 2015, but they were not 2017). Wainwright et al. (2008)4 proved that entity linkage fully annotated until 2016 which means scarce data is avail- which tries to maximize local and global agreement jointly able for training nominal detection using deep models. during linkage is NP-hard to solve, which most authors ap- proximate or simplify to reach feasibility. 1.2. Specifics and Limitations to Hedwig The system described in this paper, Hedwig, builds on Hedwig uses data and statistics from Wikipedia, almost ex- Klang and Nugues (2018) with significant improvements clusively, and as such, it is natural for us to use Wikidata and contributions in four areas: as the primary KB. Wikidata provides unique identifiers in the form of Q-numbers, e.g. “Barack Obama”, the pres- 1. Updated data sources and preprocessing; ident, has identifier: Q76. Wikidata binds the Wikipedia languages editions together and is continuously updated. It 2. A named entity recognizer based on a BILSTM neural is thus more up-to-date than other repositories, making it network architecture; the most logical choice for us. 3. The linker considers a larger candidate graph with To be compliant with the TAC EDL gold standard, we con- more features, and an improved PageRank solver; verted the Q-numbers into Freebase using a mapping3 provided by Google, produced at the archiving and termination 4. Finally, the introduction of an entity type mapping that of Freebase as a public open knowledge base. This map- can remove unwanted entities with no relevance to the ping is not perfect: A few Q-numbers are represented by evaluation. multiple Freebase entries. We resolved them heuristically using the lowest Q-number. 3. Data 2. Related Work In the making of Hedwig, we used these data sources: Multilingual named entity recognition has its modern roots • Nine Wikipedia editions: en, es, fr, de, sv, ru, zh, da, with the CoNLL03 task of Language-Independent Named no, scraped using the Wikipedia REST API5 in Octo- Entity Recognition (Tjong Kim Sang and De Meulder, ber 2018. 2003), where the best model used simple linear classi- fiers (Florian et al., 2003). Neural models, starting with • Wikidata JSON dump from October 2018 feed forward architectures, improved the recognition performance. Examples of such models include Collobert et • TAC EDL Data 2014-2016 al. (2011) and the exponential weight encoding method (Fixed-Size Ordinally-Forgetting Encoding, FOFE) (Zhang • Manually annotated mappings of classes in Wikipedia et al., 2015). These architectures were ultimately surpassed to a set of predefined classes. by deeper recurrent neural models using LSTMs (Hochre- 3.1. Wikidata: Our KB iter and Schmidhuber, 1997) and CRFs in different combi- nations with or without word character encoders (Chiu and The Wikidata JSON dump is delivered as one large gzip or Nichols, 2016; Ma and Hovy, 2016). Strakova´ et al. (2019) bzip2. This file when decompressed is one single JSON proposed a sequence to sequence model with attention, con- object; it does however use “one JSON object per line” ap- verting an input token sequence into a label sequence. This proach for easier processing. Using standard bash tools, sequence-to-sequence model makes use of multiple contex- we split it into multiple parts with 50,000 objects per file tualized word embeddings as input. Finally, a recent system to enable efficient cluster processing. We converted this by Jiang et al. (2019) improved the state-of-the-art perfor- dump with a Wikidata parser, which transformed the JSON mance on the CoNLL03 task with a differential neural net- dump into Parquet files for further processing and infor- work search method. mation extraction. The information we converted was: Q- Word embeddings are a key ingredient to NER; the most number, description, alias, claims also known as properties commonly used word embedding started with Mikolov et and sitelinks. A subset of most common claim datatypes al. (2013), followed by Pennington et al. (2014), to more are supported; the rest is either ignored or encoded as plain recent developments by Mikolov et al. (2018), and deeper strings. models by Peters et al. (2018). Flair (Akbik et al., 2019) is a recent NLP framework that provides a simplified interface 3.2.

Load more