Contextualized End-To-End Neural Entity Linking

Contextualized End-to-End Neural Entity Linking Haotian Chen Andrej Zukov-Gregoric BlackRock BlackRock [email protected] [email protected] Xi (David) Li Sahil Wadhwa BlackRock University of Illinois at Urbana-Champaign∗ [email protected] [email protected] Abstract built using hand-crafted rules which define which entities make it in and which do not. This risks We propose an entity linking (EL) model that (1) skipping out on valid entities which should be jointly learns mention detection (MD) and entity disambiguation (ED). Our model applies in the candidate set and (2) inflating model per- task-specific heads on top of shared BERT formance since often times candidate sets contain contextualized embeddings. We achieve state- only one or two items. These sets are almost al- of-the-art results across a standard EL dataset ways used at prediction time and sometimes even using our model; we also study our model’s during training. Our model has the option of not performance under the setting when hand- relying on them during prediction, and never uses crafted entity candidate sets are not available them during training. and find that the model performs well under such a setting also. We introduce two main contributions: (i) We propose a new end-to-end differentiable 1 Introduction neural EL model that jointly performs MD and ED and achieves state-of-the-art performance. Entity linking (EL)1, in our context, refers to (ii) We study the performance of our model the joint task of recognizing named entity men- when candidate sets are removed to see whether tions in text through mention detection (MD) EL can perform well without them. and linking each mention to a unique entity in a knowledge base (KB) through entity disambigua- 2 Related Work tion (ED)2. For example, in the sentence “The Times began publication under its current name in Neural-network based models have recently 1788,” the span The Times should be detected as a achieved strong results across standard EL named entity mention and then linked to the corre- datasets. Research has focused on learning bet- sponding entity: The Times, a British newspaper. ter entity representations and extracting better lo- However, an EL model which disjointly applies cal and global features through novel model archi- MD and ED might easily mistake this mention tectures. with The New York Times, an American newspa- Entity representation. Good KB entity repre- per. Since our model jointly learns MD and ED sentations are a key component of most ED and from the same contextualized BERT embeddings, EL models. Representation learning has been ad- its final EL prediction is partially informed by dressed by Yamada et al.(2016), Ganea and Hof- both. As a result, it is able to generalize better. mann(2017), Cao et al.(2017) and Yamada et al. Another common approach employed in previ- (2017). Sil et al.(2018) and Cao et al.(2018) ous EL research is candidate generation, where for extend it to the cross-lingual setting. More re- each detected mention, a set of candidate entities cently, Yamada and Shindo(2019) have suggested is generated and the entities within it are ranked learning entity representations using BERT which by a model to find the best match. Such sets are achieves state-of-the-art results in ED. ∗Work done while at BlackRock. Entity Disambiguation (ED). The ED task as- 1 Also known as A2KB task in GERBIL evaluation plat- sumes already-labelled mention spans which are form (Roder¨ et al., 2018) and end-to-end entity linking in some literature then disambiguated. Recent work on ED has fo- 2Also known as D2KB task in GERBIL cused on extracting global features (Ratinov et al., 637 Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 637–642 December 4 - 7, 2020. c 2020 Association for Computational Linguistics 2011; Globerson et al., 2016; Ganea and Hof- 2017) sub-word tokenizer. The text encoder out- mann, 2017; Le and Titov, 2018), extending the puts n contextualized WordPiece embeddings h scope of ED to more non-standard datasets (Es- which are grouped to form the embedding matrix n×m hel et al., 2017), and positing the problem in new H 2 R , where m is the embedding dimen- ways such as building separate classifiers for KB sion. In the case of BERT-BASE, m is equal to entities (Barrena et al., 2018). 768. Entity Linking (EL). Early work by Sil and Yates The transformation from word level to Word- (2013), Luo et al.(2015) and Nguyen et al.(2016) Piece sub-word level labels is handled similarly to introduced models that jointly learn NER and ED the BERT NER task, where the head WordPiece using engineered features. More recently, Kolit- token represents the entire word, disregarding tail sas et al.(2018) propose a neural model that first tokens. generates all combinations of spans as potential BERT comes in two settings: feature-based and mentions and then learns similarity scores over fine-tuned. Under the feature-based setting, BERT their entity candidates. MD is handled implic- parameters are not trainable in the domain task itly by only considering mention spans which have (EL), whereas the fine-tuned setting allows BERT non-empty candidate entity sets. Martins et al. parameters to adapt to the domain task. (2019) propose training a multi-task NER and ED objective using a Stack-LSTM (Dyer et al., 3.2 EL model 2015). Finally, Poerner et al.(2019) and Broscheit MD is modeled as a sequence labelling task. Con- (2019) both propose end-to-end EL models based textualized embeddings h are passed through a on BERT. Poerner et al.(2019) model the similar- feed-forward neural network and then softmaxed ity between entity embeddings and contextualized for classification over IOB: word embeddings by mapping the former onto the latter whereas Broscheit(2019) in essence do the opposite. Our work is different in three important mmd = Wmdh + bmd (1) ways: our training objective is different in that we pmd = softmax(mmd) (2) explicitly model MD; we analyze the performance 3 3×m of our model when candidate sets are expanded to where bmd 2 R is the bias term, Wmd 2 R 3 include the entire universe of entity embeddings; is a weight matrix, and pmd 2 R is the predicted and we outperform both models by a wide mar- distribution across the fI; O; Bg tag set. The pre- gin. dicted tag is then simply: 3 Model Description ^ymd = arg max fpmd(i)g (3) i Given a document containing a sequence of n to- ED is modeled by finding the entity (during in- kens w = fw1; :::; wng with mention label indi- 3 n ference this can be from either the entire entity cators ymd = fI; O; Bg and entity IDs yed = n universe or some candidate set) closest to the pre- fj 2 Z : j 2 [1; k]g which index a pre-trained k×d dicted entity embedding. We do this by applying entity embedding matrix E 2 R of entity universe size k and entity embedding dimension d, an additional ED-specific feed-forward neural net- the model is trained to tag each token with its cor- work to h: rect mention indicator and link each mention with its correct entity ID. med = tanh(Wedh + bed) 3.1 Text Encoder ped = s(med; E) (4) ^yed = arg max fped(j)g The text input to our model is encoded by BERT j (Devlin et al., 2019). We initialize the pre-trained 4 d d×m weights from BERT-BASE. The text input is to- where bed 2 R is the bias term, Wed 2 R is d kenized by the cased WordPiece (Johnson et al., a weight matrix, and med 2 R is the same size as the entity embedding and s is any similarity mea- 3We use standard inside-outside-beginning (IOB) tagging format introduced by (Ramshaw and Marcus, 1995) sure which relates med to every entity embedding 4https://github.com/google-research/bert in E. In our case, we use cosine similarity. Our 638 leicestershire leicestershire somerset somerset somerset somerset - county county - county county county county - cricket_club cricket_club cricket_club cricket_club cricket_club cricket_club O 4123 B 1622318 I 1622318 O 3221 B 1622178 I 2221 I 2221 I 2221 0 1223 B 1622178 Output Layer h[CLS] hLeicester h##shire hbeat hSomerset hCounty hCricket hClub h[SEP] FFNMD FFNED BERT hSomerset [CLS] Leicester ##shire beat Somerset County Cricket Club [SEP] Figure 1: Architecture of the proposed model. WordPiece tokens are passed through BERT forming contextualized embeddings. Each contextualized embedding is passed through two task-specific feed-forward neural networks for MD and ED, respectively. Entity ID prediction on the ‘B’ MD tag is extended to the entire mention span. predicted entity is the index of ped with the high- a hyperparameter λ (in our case λ = 0:1). Note est similarity score. that Lmd is calculated for all non-pad head Word- We use pre-trained entity embeddings from Piece tokens but Led is calculated only for the first wikipedia2vec (Yamada et al., 2018), as pre- WordPiece token of every labeled entity mention training optimal entity representation is beyond with a linkable and valid entity ID label. the scope of this work. Ideally, pre-trained entity embeddings should be from a similar architecture 4 Experiments to our EL model, but experiments show strong re- 4.1 Dataset and Performance Metrics sults even if they are not.

Contextualized End-To-End Neural Entity Linking

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support