End-To-End Neural Entity Linking

End-to-End Neural Entity Linking Nikolaos Kolitsas ∗ Octavian-Eugen Ganea ∗ ETH Zürich ETH Zürich [email protected] [email protected] Thomas Hofmann ETH Zürich [email protected] Abstract 1) MD may split a larger span into two mentions of less informative entities: Entity Linking (EL) is an essential task for se- B. Obama’s wife gave a speech [...] mantic text understanding and information ex- Federer’s coach [...] traction. Popular methods separately address the Mention Detection (MD) and Entity Dis- 2) MD may split a larger span into ambiguation (ED) stages of EL, without lever- two mentions of incorrect entities: aging their mutual dependency. We here pro- Obama Castle was built in 1601 in Japan. pose thefirst neural end-to-end EL system that The Kennel Club is UK’s official kennel club. jointly discovers and links entities in a text doc- A bird dog is a type of gun dog or hunting dog. ument. The main idea is to consider all possi- Romeo and Juliet by Shakespeare [...] ble spans as potential mentions and learn con- Natural killer cells are a type of lymphocyte textual similarity scores over their entity can- Mary and Max, the 2009 movie [...] didates that are useful for both MD and ED decisions. Key components are context-aware 3) MD may choose a shorter span, mention embeddings, entity embeddings and referring to an incorrect entity: a probabilistic mention - entity map, without The Apple is played again in cinemas. demanding other engineered features. Empir- The New York Times is a popular newspaper. ically, we show that our end-to-end method 4) MD may choose a longer span, significantly outperforms popular systems on referring to an incorrect entity: the Gerbil platform when enough training data Babies Romeo and Juliet were born hours apart. is available. Conversely, if testing datasets follow different annotation conventions com- Table 1: Examples where MD may benefit from ED pared to the training set (e.g. queries/ tweets and viceversa. Each wrong MD decision (under- vs news documents), our ED model coupled with a traditional NER system offers the best lined) can be avoided by proper context understand- or second best EL accuracy. ing. The correct spans are shown in blue. 1 Introduction and Motivation nition (NER) when restricted to named entities – Towards the goal of automatic text understanding, extracts entity references in a raw textual input, machine learning models are expected to accurately and ii) Entity Disambiguation (ED) – links these extract potentially ambiguous mentions of enti- spans to their corresponding entities in a KB. Un- ties from a textual document and link them to a til recently, the common approach of popular sys- knowledge base (KB), e.g. Wikipedia or Freebase. tems Ceccarelli et al. (2013); van Erp et al. (2013); Known as entity linking, this problem is an essen- Piccinno and Ferragina (2014); Daiber et al. (2013); tial building block for various Natural Language Hoffart et al. (2011); Steinmetz and Sack (2013) Processing tasks, e.g. automatic KB construction, was to solve these two sub-problems independently. question-answering, text summarization, or rela- However, the important dependency between the tion extraction. two steps is ignored and errors caused by MD/NER An EL system typically performs two tasks: i) will propagate to ED without possibility of recov- Mention Detection (MD) or Named Entity Recog- ery Sil and Yates (2013); Luo et al. (2015). We ∗Equal contribution. here advocate for models that address the end-to- 519 Proceedings of the 22nd Conference on Computational Natural Language Learning (CoNLL 2018), pages 519–529 Brussels, Belgium, October 31 - November 1, 2018. c 2018 Association for Computational Linguistics end EL task, informally arguing that humans under- come from the same domain. Morever, when stand and generate text in a similar joint manner, testing datasets follow different annotation discussing about entities which are gradually intro- schemes or exhibit different statistics, our duced, referenced under multiple names and evolv- method is still effective in achieving state-of- ing during time Ji et al. (2017). Further, we em- the-art or close performance, but needs to be phasize the importance of the mutual dependency coupled with a popular NER system. between MD and ED. First, numerous and more informative linkable spans found by MD obviously 2 Related Work offer more contextual cues for ED. Second,find- ing the true entities appearing in a specific context With few exceptions, MD/NER and ED are treated encourages better mention boundaries, especially separately in the vast EL literature. for multi-word mentions. For example, in thefirst Traditional NER models usually view the prob- sentence of Table 1, understanding the presence lem as a word sequence labeling that is modeled us- of the entity Michelle Obama helps detecting ing conditional randomfields on top of engineered its true mention "B. Obama’s wife", as opposed features Finkel et al. (2005) or, more recently, us- to separately linking B. Obama and wife to less ing bi-LSTMs architectures Lample et al. (2016); informative concepts. Chiu and Nichols (2016); Liu et al. (2017) capable We propose a simple, yet competitive, model for of learning complex lexical and syntactic features. end-to-end EL. Getting inspiration from the recent In the context of ED, recent neural methods He works of Lee et al. (2017) and Ganea and Hofmann et al. (2013); Sun et al. (2015); Yamada et al. (2017), our modelfirst generates all possible spans (2016); Ganea and Hofmann (2017); Le and Titov (mentions) that have at least one possible entity (2018); Yang et al. (2018); Radhakrishnan et al. candidate. Then, each mention - candidate pair (2018) have established state-of-the-art results, out- receives a context-aware compatibility score based performing engineered features based models. Con- on word and entity embeddings coupled with a text aware word, span and entity embeddings, to- neural attention and a global voting mechanisms. gether with neural similarity functions, are essen- During training, we enforce the scores of gold tial in these frameworks. entity - mention pairs to be higher than all possible End-to-end EL is the realistic task and ultimate scores of incorrect candidates or invalid mentions, goal, but challenges in joint NER/MD and ED mod- thus jointly taking the ED and MD decisions. eling arise from their different nature. Few previous methods tackle the joint task, where errors in one Our contributions are: stage can be recovered by the next stage. One of thefirst attempts, Sil and Yates (2013) use a popu- We address the end-to-end EL task using a • lar NER model to over-generate mentions and let simple model that conditions the "linkable" the linking step to take thefinal decisions. How- quality of a mention to the strongest con- ever, their method is limited by the dependence on text support of its best entity candidate. We a good mention spotter and by the usage of hand- do not require expensive manually annotated engineered features. It is also unclear how link- negative examples of non-linkable mentions. ing can improve their MD phase. Later, Luo et al. Moreover, we are able to train competitive (2015) presented one of the most competitive joint models using little and only partially anno- MD and ED models leveraging semi-Conditional tated documents (with named entities only Random Fields (semi-CRF). However, there are such as the CoNLL-AIDA dataset). several weaknesses in this work. First, the mutual We are among thefirst to show that, with one task dependency is weak, being captured only by • single exception, engineered features can be type-category correlation features. The other engi- fully replaced by neural embeddings automat- neered features used in their model are either NER ically learned for the joint MD & ED task. or ED specific. Second, while their probabilistic graphical model allows for tractable learning and On the Gerbil1 benchmarking platform, we • inference, it suffers from high computational com- empirically show significant gains for the end- plexity caused by the usage of the cartesian prod- to-end EL task when test and training data uct of all possible document span segmentations, 1http://gerbil.aksw.org/gerbil/ NER categories and entity assignments. Another 520 approach is J-NERD Nguyen et al. (2016) that ad- corresponding to the last character concatenated dresses the end-to-end task using only engineered with the hidden state of the backward LSTM cor- features and a probabilistic graphical model on top responding to thefirst character. This is then con- of sentence parse trees. catenated with the pre-trained word embedding, forming the context-independent word-character 3 Neural Joint Mention Detection and embedding of w. We denote the sequence of these Entity Disambiguation vectors as vk k 1,n and depict it as thefirst neural { } ∈ We formally introduce the tasks of interest. For layer in Figure 1. EL, the input is a text document (or a query or Mention Representation. Wefind it crucial to tweet) given as a sequence D= w , . , w of { 1 n} make word embeddings aware of their local context, words from a dictionary, w . The output k ∈W thus being informative for both mention boundary of an EL model is a list of mention - entity pairs detection and entity disambiguation (leveraging (mi, ei) i 1,T , where each mention is a word sub- contextual cues, e.g. "newspaper"). We thus en- { } ∈ sequence of the input document, m=w q, . , wr, code context information into words using a bi- and each entity is an entry in a knowledge base LSTM layer on top of the word-character embed- KB (e.g. Wikipedia), e . For the ED task, ∈E dings vk k 1,n.

Load more