Global Entity Disambiguation with Pretrained Contextualized Embeddings of Words and Entities
Total Page:16
File Type:pdf, Size:1020Kb
Global Entity Disambiguation with Pretrained Contextualized Embeddings of Words and Entities Ikuya Yamada1;4 Koki Washio2;4 Hiroyuki Shindo3;4 Yuji Matsumoto4 [email protected] [email protected] [email protected] [email protected] 1Studio Ousia, Tokyo, Japan 2The University of Tokyo, Tokyo, Japan 3Nara Institute of Science and Technology, Nara, Japan 4RIKEN AIP, Tokyo, Japan Abstract the model by predicting randomly masked entities based on words and non-masked entities. We train We propose a new global entity disambigua- the model using texts and their entity annotations tion (ED) model based on contextualized em- retrieved from Wikipedia. beddings of words and entities. Our model is based on a bidirectional transformer en- Furthermore, we introduce a simple extension coder (i.e., BERT) and produces contextual- to the inference step of the model to capture global ized embeddings for words and entities in the contextual information. Specifically, similar to the input text. The model is trained using a new approach used in past work (Fang et al., 2019; masked entity prediction task that aims to train Yang et al., 2019), we address ED as a sequen- the model by predicting randomly masked en- tial decision task that disambiguates mentions one tities in entity-annotated texts obtained from by one, and uses words and already disambiguated Wikipedia. We further extend the model by entities to disambiguate new mentions. solving ED as a sequential decision task to capture global contextual information. We We evaluate the proposed model using six stan- evaluate our model using six standard ED dard ED datasets and achieve new state-of-the-art datasets and achieve new state-of-the-art re- results on all but one dataset. Furthermore, we will sults on all but one dataset. publicize our code and trained embeddings. 1 Introduction 2 Background and Related Work Entity disambiguation (ED) refers to the task of Neural network-based approaches have recently assigning entity mentions in a text to correspond- achieved strong results on ED (Ganea and Hof- ing entries in a knowledge base (KB). This task mann, 2017; Yamada et al., 2017; Le and Titov, is challenging because of the ambiguity between 2018; Cao et al., 2018; Le and Titov, 2019; Yang entity names (e.g., “World Cup”) and the enti- et al., 2019). These approaches are typically based ties they refer to (e.g., FIFA World Cup or on embeddings of words and entities trained us- Rugby World Cup). Recent ED models typi- ing a large KB (e.g., Wikipedia). Such embed- cally rely on two types of contextual information: dings enable us to design ED models that capture local information based on words that co-occur the contextual information required to address ED. with the mention, and global information based These embeddings are typically based on conven- on document-level coherence of the disambigua- tional word embedding models (e.g., skip-gram arXiv:1909.00426v2 [cs.CL] 4 Apr 2020 tion decisions. A key to improve the performance (Mikolov et al., 2013)) that assign a fixed embed- of ED is to combine both local and global infor- ding to each word and entity (Yamada et al., 2016; mation as observed in most recent ED models. Cao et al., 2017; Ganea and Hofmann, 2017). In this study, we propose a novel ED model Shahbazi et al.(2019) and Broscheit(2019) pro- based on contextualized embeddings of words and posed ED models based on contextualized word entities. The proposed model is based on BERT embeddings, namely, ELMo (Peters et al., 2018) (Devlin et al., 2019). Our model takes words and and BERT, respectively. These models predict the entities in the input document, and produces a con- referent entity of a mention using the contextual- textualized embedding for each word and entity. ized embeddings of the constituent or surrounding Inspired by the masked language model (MLM) words of the mention. However, unlike our pro- adopted in BERT, we propose masked entity pre- posed model, these models address the task based diction (MEP), a novel task that aims to train only on local contextual information. Output Embedding h[CLS] hmadonna hlives hin hnew hyork h[SEP] hMadonna hNew_York_City 3.2 Masked Entity Prediction Bidirectional Transformer To train the model, we propose masked entity pre- Token Embedding A[CLS] Amadonna Alives Ain Anew Ayork A[SEP] BMadonna BNew_York_City + + + + + + + + + diction (MEP), a novel task based on MLM. In Token Type Embedding Cword Cword Cword Cword Cword Cword Cword Centity Centity + + + + + + + + + particular, some percentage of the input entities Position E4 + E5 Embedding D0 D1 D2 D3 D4 D5 D6 E1 2 are masked at random; then, the model learns to Input [CLS] madonna lives in new york [SEP] Madonna New York City Words Entities predict masked entities based on words and non- masked entities. We represent masked entities us- Figure 1: Architecture of the proposed contextualized [MASK] embeddings of words and entities. ing the special entity token. We adopt a model equivalent to the one used to predict words in MLM. Specifically, we predict 3 Contextualized Embeddings of Words the original entity corresponding to a masked en- and Entities for ED tity by applying the softmax function over all en- tities in our vocabulary: Figure1 illustrates the architecture of our contex- tualized embeddings of words and entities. Our ^yMEP = softmax(Bm + bo); (1) Ve H model adopts a multi-layer bidirectional trans- where bo 2 R is the output bias, and m 2 R former encoder (Vaswani et al., 2017). is derived as Given a document, the model first constructs m = layer normgelu(W h + b ); (2) a sequence of tokens consisting of words in the f f H document and entities appearing in the document. where h 2 R is the output embedding corre- H×H Then, the model represents the sequence as a se- sponding to the masked entity, Wf 2 R is H quence of input embeddings, one for each token, the weight matrix, bf 2 R is the bias, gelu(·) is and generates a contextualized output embedding the gelu activation function (Hendrycks and Gim- for each token. Both the input and output embed- pel, 2016), and layer norm(·) is the layer normal- dings have H dimensions. Hereafter, we denote ization function (Lei Ba et al., 2016). the number of words and that of entities in the vo- 3.3 Training cabulary of our model by Vw and Ve, respectively. We used the same transformer architecture 3.1 Input Representation adopted in the BERTLARGE model (Devlin et al., Similar to the approach adopted in BERT (Devlin 2019). We initialized the parameters of our model et al., 2019), the input representation of a given that were common with BERT (i.e., parameters in token (word or entity) is constructed by summing the transformer encoder and the embeddings for the following three embeddings of H dimensions: words) using the uncased version of the pretrained 1 BERTLARGE model. Other parameters, namely, • Token embedding is the embedding of the cor- the parameters in the MEP and the embeddings for responding token. The matrices of the word entities, were initialized randomly. and entity token embeddings are represented as The model was trained via iterations over V ×H V ×H A 2 R w and B 2 R e , respectively. Wikipedia pages in a random order for seven • Token type embedding represents the type of epochs. We treated the hyperlinks as entity anno- token, namely, word type (denoted by Cword) tations, and masked 30% of all entities at random. or entity type (denoted by Centity). The input text was tokenized to words using the BERT’s tokenizer with its vocabulary consisting • Position embedding represents the position of of Vw = 30; 000 words. Similar to Ganea and the token in a word sequence. A word and an en- Hofmann(2017), we built an entity vocabulary tity appearing at the i-th position in the sequence consisting of Ve = 128; 040 entities, which were D E are represented as i and i, respectively. If contained in the entity candidates in the datasets an entity name contains multiple words, its po- used in our experiments. We optimized the model sition embedding is computed by averaging the by maximizing the log likelihood of MEP’s pre- embeddings of the corresponding positions. dictions using Adam (Kingma and Ba, 2014). Fur- Following BERT (Devlin et al., 2019), we insert ther details are provided in Appendix A. special tokens [CLS] and [SEP] to the word se- 1 We initialized Cword using BERT’s segment embedding quence as the first and last words, respectively. for sentence A. Name Train Accuracy Algorithm 1: Algorithm of our global ED model. Yamada et al.(2016) X 91.5 Input: Words and mentions m1; : : : mN in the input Ganea and Hofmann(2017) X 92.22±0.14 document Yang et al.(2018) X 93.0 Initialize: ei [MASK]; i = 1 :::N Le and Titov(2018) X 93.07±0.27 repeat N times Cao et al.(2018) 80 For all mentions, obtain entity predictions Fang et al.(2019) X 94.3 e^1 ::: e^N using Eq.(2) and Eq.(3) using words Shahbazi et al.(2019) X 93.46±0.14 and entities e1, ..., eN as inputs Le and Titov(2019) 89.66±0.16 Select a mention mj that has the most confident Broscheit(2019) X 87.9 prediction in all unresolved mentions Yang et al.(2019) (DCA-SL) X 94.64±0.2 Yang et al.(2019) (DCA-RL) 93.73±0.2 ej e^j X end Our (confidence-order) X 95.04±0.24 Our (natural-order) X 94.76±0.26 return fe1; : : : ; eN g Our (local) X 94.49±0.22 Our (confidence-order) 92.42 Our (natural-order) 91.68 Our (local) 90.80 4 Our ED Model Table 1: In-KB accuracy on the CoNLL dataset.