<<

Improving Entity Linking by Modeling Latent Relations between Mentions

Phong Le1 and Ivan Titov1,2 1University of Edinburgh 2University of Amsterdam {ple,ititov}@inf.ed.ac.uk

Abstract sides coreference, there are many other relations between entities which constrain or favor certain Entity linking involves aligning textual alignment configurations. For example, consider mentions of named entities to their corre- relation participant in in Figure 1: if “” sponding entries in a knowledge base. En- is aligned to the entity FIFA WORLD CUP then tity linking systems often exploit relations we expect the second “England” to refer to a foot- between textual mentions in a document ball team rather than a basketball one. (e.g., coreference) to decide if the linking NEL methods typically consider only corefer- decisions are compatible. Unlike previous ence, relying either on off-the-shelf systems or approaches, which relied on supervised some simple heuristics (Lazic et al., 2015), and systems or heuristics to predict these rela- exploit them in a pipeline fashion, though some tions, we treat relations as latent variables (e.g., Cheng and Roth (2013); Ren et al. (2017)) in our neural entity-linking model. We in- additionally exploit a range of syntactic-semantic duce the relations without any supervision relations such as apposition and possessives. An- while optimizing the entity-linking sys- other line of work ignores relations altogether and tem in an end-to-end fashion. Our multi- models the predicted sequence of KB entities as a relational model achieves the best reported bag (Globerson et al., 2016; Yamada et al., 2016; scores on the standard benchmark (AIDA- Ganea and Hofmann, 2017). Though they are able CoNLL) and substantially outperforms its to capture some degree of coherence (e.g., pref- relation-agnostic version. Its training also erence towards entities from the same general do- converges much faster, suggesting that the main) and are generally empirically successful, the injected structural bias helps to explain underlying assumption is too coarse. For example, regularities in the training data. they would favor assigning all the occurrences of “England” in Figure 1 to the same entity. 1 Introduction We hypothesize that relations useful for NEL Named entity linking (NEL) is the task of as- can be induced without (or only with little) domain signing entity mentions in a text to corresponding expertise. In order to prove this, we encode rela- entries in a knowledge base (KB). For example, tions as latent variables and induce them by opti- consider Figure 1 where a mention “World Cup” mizing the entity-linking model in an end-to-end refers to a KB entity FIFA WORLD CUP. NEL fashion. In this way, relations between mentions is often regarded as crucial for natural language in documents will be induced in such a way as to understanding and commonly used as preprocess- be beneficial for NEL. As with other recent ap- ing for tasks such as (Hoff- proaches to NEL (Yamada et al., 2017; Ganea and mann et al., 2011) and question answering (Yih Hofmann, 2017), we rely on representation learn- et al., 2015). ing and learn embeddings of mentions, contexts Potential assignments of mentions to entities are and relations. This further reduces the amount regulated by semantic and discourse constraints. of human expertise required to construct the sys- For example, the second and third occurrences of tem and, in principle, may make it more portable mention “England” in Figure 1 are coreferent and across languages and domains. thus should be assigned to the same entity. Be- Our multi-relational neural model achieves an

1595 Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers), pages 1595–1604 Melbourne, Australia, July 15 - 20, 2018. c 2018 Association for Computational Linguistics England England West_Germany England_national_ England_national Germany_national_ FIFA_World_Cup football_team _football_team football_team FIBA_Basketball_ England_national_ England_national Germany_national_ World_Cup basketball_team _basketball_team basketball_team ......

World Cup 1966 was held in England …. England won… The final saw England beat West Germany .

located_in coreference beat participant_in

Figure 1: Example for NEL, linking each mention to an entity in a KB (e.g. “World Cup” to FIFA WORLD CUP rather than FIBA BASKETBALL WORLD CUP). Note that the first and the sec- ond “England” are in different relations to “World Cup”. improvement of 0.85% F1 over the best re- 2.2 Local and global models ported scores on the standard AIDA-CoNLL Local models rely only on local contexts of men- dataset (Ganea and Hofmann, 2017). Substan- tions and completely ignore interdependencies be- tial improvements over the relation-agnostic ver- tween the linking decisions in the document (these sion show that the induced relations are indeed interdependencies are usually referred to as coher- beneficial for NEL. Surprisingly its training also ence). Let ci be a local context of mention mi and converges much faster: training of the full model Ψ(ei, ci) be a local score function. A local model requires ten times shorter wall-clock time than then tackles the problem by searching for what is needed for estimating the simpler relation- ∗ agnostic version. This may suggest that the in- ei = arg max Ψ(ei, ci) (1) ∈ jected structural bias helps to explain regularities ei Ci in the training data, making the optimization task for each i ∈ {1, ..., n} (Bunescu and Pas¸ca, 2006; easier. We qualitatively examine induced rela- Lazic et al., 2015; Yamada et al., 2017). tions. Though we do not observe direct counter- A global model, besides using local context parts of linguistic relations, we, for example, see within Ψ(ei, ci), takes into account entity co- that some of the induced relations are closely re- herency. It is captured by a coherence score func- lated to coreference whereas others encode forms tion Φ(E,D): of semantic relatedness between the mentions. ∑n ∗ E = arg max Ψ(ei, ci) + Φ(E,D) ∈ × × E C1 ... Cn i=1 2 Background and Related work where E = (e1, ..., en). The coherence score function, in the simplest form, is a sum over 2.1 Named entity linking problem all pairwise scores Φ(ei, ej,D) (Ratinov et al., 2011; Huang et al., 2015; Chisholm and Hachey, Formally, given a document D containing a list of 2015; Ganea et al., 2016; Guo and Barbosa, 2016; mentions m , ..., m , an entity linker assigns to 1 n Globerson et al., 2016; Yamada et al., 2016), re- each m an KB entity e or predicts that there is no i i sulting in: corresponding entry in the KB (i.e., ei = NILL). ∑n Because a KB can be very large, it is stan- ∗ E = arg max Ψ(ei, ci)+ dard to use an heuristic to choose potential can- E∈C ×...×Cn 1 ∑i=1 didates, eliminating options which are highly un- Φ(e , e ,D) (2) likely. This preprocessing step is called candidate i j i≠ j selection. The task of a statistical model is thus re- duced to choosing the best option among a smaller A disadvantage of global models is that exact list of candidates Ci = (ei1, ..., eili ). In what fol- decoding (Equation 2) is NP-hard (Wainwright lows, we will discuss two classes of approaches et al., 2008). Ganea and Hofmann (2017) over- tackling this problem: local and global modeling. come this using loopy belief propagation (LBP),

1596 an approximate inference method based on mes- At the other extreme, feature engineering is al- sage passing (Murphy et al., 1999). Globerson most completely replaced by representation learn- et al. (2016) propose a star model which approxi- ing. These approaches rely on pretrained embed- mates the decoding problem in Equation 2 by ap- dings of words (Mikolov et al., 2013; Penning- proximately decomposing it into n decoding prob- ton et al., 2014) and entities (He et al., 2013; Ya- lems, one per each ei. mada et al., 2017; Ganea and Hofmann, 2017) and often do not use virtually any other hand-crafted 2.3 Related work features. Ganea and Hofmann (2017) showed Our work focuses on modeling pairwise score that such an approach can yield SOTA accuracy functions Φ and is related to previous approaches on a standard benchmark (AIDA-CoNLL dataset). in the two following aspects. Their local and pairwise score functions are

T Relations between mentions Ψ(ei, ci) = ei Bf(ci) A relation widely used by NEL systems is corefer- 1 T Φ(ei, ej,D) = − ei Rej (3) ence: two mentions are coreferent if they refer to n 1 the same entity. Though, as we discussed in Sec- e , e ∈ Rd tion 1, other linguistic relations constrain entity as- where i j are the embeddings of entity e , e B, R ∈ Rd×d signments, only a few approaches (e.g., Cheng and i j, are diagonal matrices. The f(c ) Roth (2013); Ren et al. (2017)), exploit any rela- mapping i applies an attention mechanism to c tions other than coreference. We believe that the context words in i to obtain a feature representa- f(c ) ∈ Rd) reason for this is that predicting and selecting rel- tions of context ( i . evant (often semantic) relations is in itself a chal- Note that the global component (the pairwise lenging problem. scores) is agnostic to any relations between enti- ties or even to their ordering: it models e , ..., e In Cheng and Roth (2013), relations between 1 n simply as a bag of entities. Our work is in line with mentions are extracted using a labor-intensive ap- Ganea and Hofmann (2017) in the sense that fea- proach, requiring a set of hand-crafted rules and a ture engineering plays no role in computing local KB containing relations between entities. This ap- and pair-wise scores. Furthermore, we argue that proach is difficult to generalize to languages and pair-wise scores should take into account relations domains which do not have such KBs or the set- between mentions which are represented by rela- tings where no experts are available to design the tion embeddings. rules. We, in contrast, focus on automating the process using representation learning. 3 Multi-relational models Most of these methods relied on relations pre- dicted by external tools, usually a coreference sys- 3.1 General form tem. One notable exception is Durrett and Klein (2014): they use a joint model of entity linking and We assume that there are K latent relations. Each coreference resolution. Nevertheless their corefer- relation k is assigned to a mention pair (mi, mj) ence component is still supervised, whereas our with a non-negative weight (‘confidence’) αijk. relations are latent even at training time. The pairwise score (mi, mj) is computed as a weighted sum of relation-specific pairwise scores Representation learning (see Figure 2, top): Ψ How can we define local score functions and ∑K pairwise score functions Φ? Previous approaches Φ(ei, ej,D) = αijkΦk(ei, ej,D) employ a wide spectrum of techniques. k=1 At one extreme, extensive feature engineering was used to define useful features. For example, Φk(ei, ej,D) can be any pairwise score func- Ratinov et al. (2011) use cosine similarities be- tion, but here we adopt the one from Equation 3. tween titles and local contexts as a fea- Namely, we represent each relation k by a diago- d×d ture when computing the local scores. For pair- nal matrix Rk ∈ R , and wise scores they exploit information about links T between Wikipedia pages. Φk(ei, ej,D) = ei Rkej 1597 The weights αijk are normalized scores: 3.2 Rel-norm: Relation-wise normalization

For rel-norm, coefficients αijk are normalized { } over relations k, in other words, T 1 f (mi, ci)√Dkf(mj, cj) { } αijk = exp ∑K T Z f (mi, ci)D ′ f(mj, cj) ijk d Z = exp √ k (4) ijk ′ d where Z is a normalization factor, f(m , c ) is k =1 ijk i i ∑ (m , c ) Rd D ∈ K a function mapping i i onto , and k so that αijk = 1 (see Figure 2, middle). We d×d k=1 R is a diagonal matrix. can also re-write the pairwise scores as

T Φ(ei, ej,D) = ei Rijej (5) (general form) α Φ (e ,e ,D) ij1 1 i j ∑ K where Rij = k=1 αijkRk.

αij2Φ2(ei,ej,D) tanh, dropout ei,mi,ci ej,mj,cj

αij3Φ3(ei,ej,D)

(rel-norm)

ei,mi,ci ej,mj,cj In foreign policy Bill Clinton ordered U.S. military

Figure 3: Function f(m , c ) is a single-layer neu- normalize over relations: αij1 + αij2 + αij3 = 1 i i ral network, with tanh activation function and a (ment-norm) e1,m1,c1 layer of dropout on top. ... Intuitively, αijk is the probability of assigning a ei,mi,ci ej,mj,cj ... k-th relation to a mention pair (mi, mj). For ev- ery pair rel-norm uses these probabilities to choose e ,m ,c n n n one relation from the pool and relies on the corre- normalize over mentions: R α + α + … + α + … + α = 1 sponding relation embedding k to compute the i12 i22 ij2 in2 compatibility score. For K = 1 rel-norm reduces (up to a scal- Figure 2: Multi-relational models: general form ing factor) to the bag-of-entities model defined in (top), rel-norm (middle) and ment-norm (bottom). Equation 3. Each color corresponds to one relation. In principle, instead of relying on the linear combination of relation embeddings matrices Rk, we could directly predict a context-specific rela- In our experiments, we use a single-layer neural tion embedding Rij = diag{g(mi, ci, mj, cj)} network as f (see Figure 3) where ci is a concate- where g is a neural network. However, in prelim- nation of the average embedding of words in the inary experiments we observed that this resulted left context with the average embedding of words 1 in overfitting and poor performance. Instead, we in the right context of the mention. choose to use a small fixed number of relations as As αijk is indexed both by mention index j a way to constrain the model and improve gener- and relation index k, we have two choices for alization. Z : normalization over relations and normaliza- ijk 3.3 Ment-norm: Mention-wise normalization tion over mentions. We consider both versions of the model. We can also normalize αijk over j: { } ∑n T f (mi, ci)Dkf(mj′ , cj′ ) Zijk = exp √ 1 d We also experimented with LSTMs but we could not pre- j′=1 vent them from severely overfitting, and the results were poor. j′≠ i 1598 ∑ n This implies that j=1,j≠ i αijk = 1 (see Figure 2, where some relations are inapplicable. For ex- bottom). If we rewrite the pairwise scores as ample, consider applying relation coreference to mention “West Germany” in Figure 1. The men- ∑K tion is non-anaphoric: there are no mentions Φ(e , e ,D) = α eT R e , (6) i j ijk i k j co-referent with it. Still the ment-norm model k=1 has to distribute the weight across the mentions. we can see that Equation 3 is a special case of ∑This problem occurs because of the normalization n ment-norm when K = 1 and D1 = 0. In other j=1,j≠ i αijk = 1. Note that this issue does words, Ganea and Hofmann (2017) is our mono- not affect standard applications of attention: nor- relational ment-norm with uniform α. mally the attention-weighted signal is input to an- The intuition behind ment-norm is that for each other transformation (e.g., a flexible neural model) relation k and mention mi, we are looking for which can then disregard this signal when it is use- mentions related to mi with relation k. For each less. This is not possible within our model, as it pair of mi and mj we can distinguish two cases: simply uses αijk to weight the bilinear terms with- (i) αijk is small for all k: mi and mj are not re- out any extra transformation. lated under any relation, (ii) αijk is large for one Luckily, there is an easy way to circumvent this or more k: there are one or more relations which problem. We add to each document a padding are predicted for mi and mj. mention mpad linked to a padding entity epad. In In principle, rel-norm can also indirectly handle this way, the model can use the padding mention both these cases. For example, it can master (i) by to damp the probability mass that the other men- dedicating a distinct ‘none’ relation to represent tions receive. This method is similar to the way lack of relation between the two mentions (with some mention-ranking coreference models deal the corresponding matrix Rk set to 0). Though with non-anaphoric mentions (e.g. Wiseman et al. it cannot assign large weights (i.e., close to 1) to (2015)). multiple relations (as needed for (ii)), it can in 3.4 Implementation principle use the ‘none’ relation to vary the proba- bility mass assigned to the rest of relations across Following Ganea and Hofmann (2017) we use mention pairs, thus achieving the same effect (up Equation 2 to define a conditional random field to a multiplicative factor). Nevertheless, in con- (CRF). We use the local score function identical trast to ment-norm, we do not observe this behav- to theirs and the pairwise scores are defined as ex- ior for rel-norm in our experiments: the inductive plained above: basis seems to disfavor such configurations.     Ment-norm is in line with the current trend ∑n ∑ q(E|D) ∝ exp Ψ(e , c ) + Φ(e , e ,D) of using the attention mechanism in deep learn-  i i i j  ̸ ing (Bahdanau et al., 2014), and especially related i=1 i=j to multi-head attention of Vaswani et al. (2017). We also use max-product loopy belief propagation For each mention mi and for each k, we can inter- (LBP) to estimate the max-marginal probability pret αijk as the probability of choosing a mention mj among the set of mentions in the document. qˆi(ei|D) ≈ max q(E|D) e1,...,ei−1 Because here we have K relations, each mention ei+1,...,en mi will have maximally K mentions (i.e. heads in terminology of Vaswani et al. (2017)) to focus for each mention mi. The final score function for on. Note though that they use multi-head attention mi is given by: for choosing input features in each layer, whereas ρ (e) = g(ˆq (e|D), pˆ(e|m )) we rely on this mechanism to compute pairwise i i i scoring functions for the structured output (i.e. to where g is a two-layer neural network and pˆ(e|mi) compute potential functions in the corresponding is the probability of selecting e conditioned only undirected graphical model, see Section 3.4). on mi. This probability is computed by mix- Mention padding ing mention-entity hyperlink count statistics from Wikipedia, a large Web corpus and YAGO.2 A potentially serious drawback of ment-norm is that the model uses all K relations even in cases 2See Ganea and Hofmann (2017, Section 6).

1599 We minimize the following ranking loss: Datasets ∑ ∑ ∑ For in-domain scenario, we used AIDA-CoNLL L(θ) = h(m , e) (7) i dataset3 (Hoffart et al., 2011). This dataset D∈D m ∈D e∈C ( i i ) contains AIDA-train for training, AIDA-A for − ∗ h(mi, e) = max 0, γ ρi(ei ) + ρi(e) dev, and AIDA-B for testing, having respectively 946, 216, and 231 documents. For out-domain where θ are the model parameters, D is a training ∗ scenario, we evaluated the models trained on dataset, and e is the ground-truth entity. Adam i AIDA-train, on five popular test sets: MSNBC, (Kingma and Ba, 2014) is used as an optimizer. AQUAINT, ACE2004, which were cleaned and For ment-norm, the padding mention is treated updated by Guo and Barbosa (2016); WNED- like any other mentions. We add f = pad CWEB (CWEB), WNED-WIKI (WIKI), which f(m , c ) and e ∈ Rd, an embedding of pad pad pad were automatically extracted from ClueWeb and e , to the model parameter list, and tune them pad Wikipedia (Guo and Barbosa, 2016; Gabrilovich while training the model. et al., 2013). The first three are small with 20, 50, In order to encourage the models to explore dif- and 36 documents whereas the last two are much ferent relations, we add the following regulariza- larger with 320 documents each. Following previ- tion term to the loss function in Equation 7: ∑ ∑ ous works (Yamada et al., 2016; Ganea and Hof- mann, 2017), we considered only mentions that λ1 dist(Ri, Rj) + λ2 dist(Di, Dj) i,j i,j have entities in the KB (i.e., Wikipedia).

−7 Candidate selection where λ1, λ2 are set to −10 in our experiments, dist(x, y) can be any distance metric. We use: For each mention mi, we selected 30 top candi- dates using pˆ(e|mi). We then kept 4 candidates x y − with the highest pˆ(e|m ) and 3 candidates with the dist(x, y) = ∥ ∥ ∥ ∥ (∑ i ) x 2 y 2 2 highest scores eT w , where e, w ∈ Rd w∈di Using this regularization to favor diversity is are entity and word embeddings, di is the 50-word important as otherwise relations tend to collapse: window context around mi. their relation embeddings Rk end up being very Hyper-parameter setting similar to each other. We set d = 300 and used GloVe (Pennington et al., 4 Experiments 2014) word embeddings trained on 840B tokens for computing f in Equation 4, and entity embed- We evaluated four models: (i) rel-norm proposed dings from Ganea and Hofmann (2017).4 We use in Section 3.2; (ii) ment-norm proposed in Sec- the following parameter values: γ = 0.01 (see K = 1 tion 3.3; (iii) ment-norm ( ): the mono- Equation 7), the number of LBP loops is 10, the relational version of ment-norm; and (iv) ment- dropout rate for f was set to 0.3, the window size norm (no pad): the ment-norm without using men- of local contexts ci (for the pairwise score func- tion padding. Recall also that our mono-relational tions) is 6. For rel-norm, we initialized diag(Rk) (i.e. K = 1) rel-norm is equivalent to the relation- and diag(Dk) by sampling from N (0, 0.1) for all agnostic baseline of Ganea and Hofmann (2017). k. For ment-norm, we did the same except that We implemented our models in PyTorch and diag(R1) was sampled from N (1, 0.1). run experiments on a Titan X GPU. The source To select the best number of relations K, we code and trained models will be publicly avail- considered all values of K ≤ 7 (K > 7 would not able at https://github.com/lephong/ fit in our GPU memory, as some of the documents mulrel-nel. are large). We selected the best ones based on the 4.1 Setup development scores: 6 for rel-norm, 3 for ment- norm, and 3 for ment-norm (no pad). We set up our experiments similarly to those of When training the models, we applied early Ganea and Hofmann (2017), run each model 5 stopping. For rel-norm, when the model reached times, and report average and 95% confidence in- terval of the standard micro F1 score (aggregates 3TAC KBP datasets are no longer available. over all mentions). 4https://github.com/dalab/deep-ed

1600 91% F1 on the dev set, 5 we reduced the learning 4.3 Analysis −4 −5 rate from 10 to 10 . We then stopped the train- Mono-relational v.s. multi-relational ing when F1 was not improved after 20 epochs. For rel-norm, the mono-relational version (i.e., We did the same for ment-norm except that the Ganea and Hofmann (2017)) is outperformed learning rate was changed at 91.5% F1. by the multi-relational one on AIDA-CoNLL, Note that all the hyper-parameters except K and but performs significantly better on all five out- the turning point for early stopping were set to the domain datasets. This implies that multi-relational values used by Ganea and Hofmann (2017). Sys- rel-norm does not generalize well across domains. tematic tuning is expensive though may have fur- For ment-norm, the mono-relational version ther increased the result of our models. performs worse than the multi-relational one on all 4.2 Results test sets except AQUAINT. We speculate that this is due to multi-relational ment-norm being less Methods Aida-B sensitive to prediction errors. Since it can rely on Chisholm and Hachey (2015) 88.7 Guo and Barbosa (2016) 89.0 multiple factors more easily, a single mistake in Globerson et al. (2016) 91.0 assignment is unlikely to have large influence on Yamada et al. (2016) 91.5 its predictions. Ganea and Hofmann (2017) 92.22  0.14  rel-norm 92.41 0.19 Oracle ment-norm 93.07  0.27 ment-norm (K = 1) 92.89  0.21 ment-norm (no pad) 92.37  0.26 94.5 LBP oracle Table 1: F1 scores on AIDA-B (test set). 94

93.5 Table 1 shows micro F1 scores on AIDA-B of the SOTA methods and ours, which all use 93 Wikipedia and YAGO mention-entity index. To 92.5 our knowledge, ours are the only (unsupervis- edly) inducing and employing more than one re- 92 G&H rel-norm ment-norm ment-norm lations on this dataset. The others use only one (K=1) relation, coreference, which is given by simple heuristics or supervised third-party resolvers. All Figure 4: F1 on AIDA-B when using LBP and the four our models outperform any previous method, oracle. G&H is Ganea and Hofmann (2017). with ment-norm achieving the best results, 0.85% higher than that of Ganea and Hofmann (2017). In order to examine learned relations in a more Table 2 shows micro F1 scores on 5 out-domain transparant setting, we consider an idealistic sce- test sets. Besides ours, only Cheng and Roth nario where imperfection of LBP, as well as mis- (2013) employs several mention relations. Ment- takes in predicting other entities, are taken out of norm achieves the highest F1 scores on MSNBC the equation using an oracle. This oracle, when and ACE2004. On average, ment-norm’s F1 score we make a prediction for mention mi, will tell ∗ is 0.3% higher than that of Ganea and Hofmann us the correct entity ej for every other mentions (2017), but 0.2% lower than Guo and Barbosa mj, j ≠ i. We also used AIDA-A (development (2016)’s. It is worth noting that Guo and Barbosa set) for selecting the numbers of relations for rel- (2016) performs exceptionally well on WIKI, but norm and ment-norm. They are set to 6 and 3, substantially worse than ment-norm on all other respectively. Figure 4 shows the micro F1 scores. datasets. Our other three models, however, have Surprisingly, the performance of oracle rel- lower average F1 scores compared to the best pre- norm is close to that of oracle ment-norm, al- vious model. though without using the oracle the difference The experimental results show that ment-norm was substantial. This suggests that rel-norm is outperforms rel-norm, and that mention padding more sensitive to prediction errors than ment- plays an important role. norm. Ganea and Hofmann (2017), even with the 5We chose the highest F1 that rel-norm always achieved help of the oracle, can only perform slightly bet- without the learning rate reduction. ter than LBP (i.e. non-oracle) ment-norm. This

1601 Methods MSNBC AQUAINT ACE2004 CWEB WIKI Avg Milne and Witten (2008) 78 85 81 64.1 81.7 77.96 Hoffart et al. (2011) 79 56 80 58.6 63 67.32 Ratinov et al. (2011) 75 83 82 56.2 67.2 72.68 Cheng and Roth (2013) 90 90 86 67.5 73.4 81.38 Guo and Barbosa (2016) 92 87 88 77 84.5 85.7 Ganea and Hofmann (2017) 93.7  0.1 88.5  0.4 88.5  0.3 77.9  0.1 77.5  0.1 85.22 rel-norm 92.2  0.3 86.7  0.7 87.9  0.3 75.2  0.5 76.4  0.3 83.67 ment-norm 93.9  0.2 88.3  0.6 89.9  0.8 77.5  0.1 78.0  0.1 85.51 ment-norm (K = 1) 93.2  0.3 88.4  0.4 88.9  1.0 77.0  0.2 77.2  0.1 84.94 ment-norm (no pad) 93.6  0.3 87.8  0.5 90.0  0.3 77.0  0.2 77.3  0.3 85.13

Table 2: F1 scores on five out-domain test sets. Underlined scores show cases where the corresponding model outperforms the baseline. suggests that its global coherence scoring com- Complexity ponent is indeed too simplistic. Also note that The complexity of rel-norm and ment-norm is lin- both multi-relational oracle models substantially ear in K, so in principle our models should be outperform the two mono-relational oracle mod- considerably more expensive than Ganea and Hof- els. This shows the benefit of using more than one mann (2017). However, our models converge relations, and the potential of achieving higher ac- much faster than their relation-agnostic model: curacy with more accurate inference methods. on average ours needs 120 epochs, compared to theirs 1250 epochs. We believe that the structural Relations bias helps the model to capture necessary regu- In this section we qualitatively examine relations larities more easily. In terms of wall-clock time, that the models learned by looking at the prob- our model requires just under 1.5 hours to train, abilities αijk. See Figure 5 for an example. In that is ten times faster than the relation agnostic that example we focus on mention “Liege” in the model (Ganea and Hofmann, 2017). In addition, sentence at the top and study which mentions are the difference in testing time is negligible when related to it under two versions of our model: using a GPU. rel-norm (leftmost column) and ment-norm (right- most column). 5 Conclusion and Future work For rel-norm it is difficult to interpret the mean- We have shown the benefits of using relations in ing of the relations. It seems that the first relation NEL. Our models consider relations as latent vari- dominates the other two, with very high weights ables, thus do not require any extra supervision. for most of the mentions. Nevertheless, the fact Representation learning was used to learn rela- that rel-norm outperforms the baseline suggests tion embeddings, eliminating the need for exten- that those learned relations encode some useful in- sive feature engineering. The experimental results formation. show that our best model achieves the best re- For ment-norm, the first relation is similar to ported F1 on AIDA-CoNLL with an improvement coreference: the relation prefers those mentions of 0.85% F1 over the best previous results. that potentially refer to the same entity (and/or Conceptually, modeling multiple relations is have semantically similar mentions): see Figure substantially different from simply modeling co- 5 (left, third column). The second and third rela- herence (as in Ganea and Hofmann (2017)). In tions behave differently from the first relation as this way we also hope it will lead to interest- they prefer mentions having more distant mean- ing follow-up work, as individual relations can be ings and are complementary to the first relation. informed by injecting prior knowledge (e.g., by They assign large weights to (1) “Belgium” and training jointly with relation extraction models). (2) “Brussels” but small weights to (4) and (6) In future work, we would like to use syntac- “Liege”. The two relations look similar in this tic and discourse structures (e.g., syntactic depen- example, however they are not identical in gen- dency paths between mentions) to encourage the eral. See a histogram of bucketed values of their models to discover a richer set of relations. We weights in Figure 5 (right): their α have quite dif- also would like to combine ment-norm and rel- ferent distributions. norm. Besides, we would like to examine whether

1602 60

α ,2 rel-norm on Friday , Liege police said in ment-norm • α ,3 (1) missing teenagers in Belgium . 50 • (2) UNK BRUSSELS UNK (3) UNK Belgian police said on 40 (4) , ” a Liege police official told (5) police official told Reuters . 30 (6) eastern town of Liege on Thursday , (7) home village of UNK . 20 Marc Dutroux (8) link with the case , the 10 (9) which has rocked Belgium in the past

0 0.25 0.30 0.35 0.40 0.45 0.50 0.55 α

Figure 5: (Left) Examples of α. The first and third columns show αijk for oracle rel-norm and oracle ment-norm, respectively. (Right) Histograms of α•k for k = 2, 3, corresponding to the second and third relations from oracle ment-norm. Only α > 0.25 (i.e. high attentions) are shown.

the induced latent relations could be helpful for re- Octavian-Eugen Ganea, Marina Ganea, Aurelien Luc- lation extract. chi, Carsten Eickhoff, and Thomas Hofmann. 2016. Probabilistic bag-of-hyperlinks model for entity linking. In Proceedings of the 25th International Acknowledgments Conference on World Wide Web, pages 927–938. In- ternational World Wide Web Conferences Steering We would like to thank anonymous reviewers for Committee. their suggestions and comments. The project was supported by the European Research Council Octavian-Eugen Ganea and Thomas Hofmann. 2017. Deep joint entity disambiguation with local neural (ERC StG BroadSem 678254), the Dutch National attention. In Proceedings of the 2017 Conference Science Foundation (NWO VIDI 639.022.518), on Empirical Methods in Natural Language Pro- and an Amazon Web Services (AWS) grant. cessing, pages 2609–2619. Association for Compu- tational Linguistics. Amir Globerson, Nevena Lazic, Soumen Chakrabarti, References Amarnag Subramanya, Michael Ringaard, and Fer- nando Pereira. 2016. Collective entity resolution Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- with multi-focal attention. In Proceedings of the gio. 2014. Neural machine translation by jointly 54th Annual Meeting of the Association for Compu- learning to align and translate. arXiv preprint tational Linguistics (Volume 1: Long Papers), pages arXiv:1409.0473. 621–631. Association for Computational Linguis- tics. Razvan Bunescu and Marius Pas¸ca. 2006. Using en- cyclopedic knowledge for named entity disambigua- Zhaochen Guo and Denilson Barbosa. 2016. Robust tion. In 11th Conference of the European Chapter of named entity disambiguation with random walks. the Association for Computational Linguistics. Semantic Web, (Preprint).

Xiao Cheng and Dan Roth. 2013. Relational inference Zhengyan He, Shujie Liu, Mu Li, Ming Zhou, Longkai for wikification. In Proceedings of the 2013 Con- Zhang, and Houfeng Wang. 2013. Learning entity ference on Empirical Methods in Natural Language representation for entity disambiguation. In Pro- Processing, pages 1787–1796, Seattle, Washington, ceedings of the 51st Annual Meeting of the Associa- USA. Association for Computational Linguistics. tion for Computational Linguistics (Volume 2: Short Papers), pages 30–34, Sofia, Bulgaria. Association Andrew Chisholm and Ben Hachey. 2015. Entity dis- for Computational Linguistics. ambiguation with web links. Transactions of the As- Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bor- sociation of Computational Linguistics, 3:145–156. dino, Hagen Furstenau,¨ Manfred Pinkal, Marc Span- iol, Bilyana Taneva, Stefan Thater, and Gerhard Greg Durrett and Dan Klein. 2014. A joint model for Weikum. 2011. Robust disambiguation of named entity analysis: Coreference, typing, and linking. entities in text. In Proceedings of the 2011 Con- Transactions of the Association for Computational ference on Empirical Methods in Natural Language Linguistics, 2:477–490. Processing, pages 782–792. Association for Com- putational Linguistics. Evgeniy Gabrilovich, Michael Ringgaard, and Amar- nag Subramanya. 2013. Facc1: Freebase annotation Raphael Hoffmann, Congle Zhang, Xiao Ling, of clueweb corpora. Luke Zettlemoyer, and Daniel S. Weld. 2011.

1603 Knowledge-based weak supervision for information H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar- extraction of overlapping relations. In Proceedings nett, editors, Advances in Neural Information Pro- of the 49th Annual Meeting of the Association for cessing Systems 30, pages 6000–6010. Curran As- Computational Linguistics: Human Language Tech- sociates, Inc. nologies, pages 541–550, Portland, Oregon, USA. Association for Computational Linguistics. Martin J Wainwright, Michael I Jordan, et al. 2008. Graphical models, exponential families, and varia- Hongzhao Huang, Larry Heck, and Heng Ji. 2015. tional inference. Foundations and Trends⃝R in Ma- Leveraging deep neural networks and knowledge chine Learning, 1(1–2):1–305. graphs for entity disambiguation. arXiv preprint arXiv:1504.07678. Sam Wiseman, Alexander M. Rush, Stuart Shieber, and Jason Weston. 2015. Learning anaphoricity and an- Diederik Kingma and Jimmy Ba. 2014. Adam: A tecedent ranking features for coreference resolution. method for stochastic optimization. arXiv preprint In Proceedings of the 53rd Annual Meeting of the arXiv:1412.6980. Association for Computational Linguistics and the 7th International Joint Conference on Natural Lan- Nevena Lazic, Amarnag Subramanya, Michael Ring- guage Processing (Volume 1: Long Papers), pages gaard, and Fernando Pereira. 2015. Plato: A selec- 1416–1426. Association for Computational Linguis- tive context model for entity resolution. Transac- tics. tions of the Association for Computational Linguis- tics, 3:503–515. Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda, and Yoshiyasu Takefuji. 2016. Joint learning of the em- Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- bedding of words and entities for named entity dis- frey Dean. 2013. Efficient estimation of word ambiguation. In Proceedings of The 20th SIGNLL representations in vector space. arXiv preprint Conference on Computational Natural Language arXiv:1301.3781. Learning, pages 250–259. Association for Compu- tational Linguistics. David Milne and Ian H Witten. 2008. Learning to link with wikipedia. In Proceedings of the 17th ACM Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda, and conference on Information and knowledge manage- Yoshiyasu Takefuji. 2017. Learning distributed rep- ment, pages 509–518. ACM. resentations of texts and entities from knowledge base. arXiv preprint arXiv:1705.02494. Kevin P Murphy, Yair Weiss, and Michael I Jordan. 1999. Loopy belief propagation for approximate in- Wen-tau Yih, Ming-Wei Chang, Xiaodong He, and ference: An empirical study. In Proceedings of the Jianfeng Gao. 2015. Semantic parsing via staged Fifteenth conference on Uncertainty in artificial in- query graph generation: Question answering with telligence, pages 467–475. Morgan Kaufmann Pub- knowledge base. In Proceedings of the 53rd Annual lishers Inc. Meeting of the Association for Computational Lin- Jeffrey Pennington, Richard Socher, and Christopher guistics and the 7th International Joint Conference Manning. 2014. Glove: Global vectors for word on Natural Language Processing (Volume 1: Long representation. In Proceedings of the 2014 Con- Papers), pages 1321–1331, Beijing, China. Associ- ference on Empirical Methods in Natural Language ation for Computational Linguistics. Processing (EMNLP), pages 1532–1543. Associa- tion for Computational Linguistics. Lev Ratinov, Dan Roth, Doug Downey, and Mike An- derson. 2011. Local and global algorithms for dis- ambiguation to wikipedia. In Proceedings of the 49th Annual Meeting of the Association for Com- putational Linguistics: Human Language Technolo- gies, pages 1375–1384. Association for Computa- tional Linguistics. Xiang Ren, Zeqiu Wu, Wenqi He, Meng Qu, Clare R Voss, Heng Ji, Tarek F Abdelzaher, and Jiawei Han. 2017. Cotype: Joint extraction of typed entities and relations with knowledge bases. In Proceedings of the 26th International Conference on World Wide Web, pages 1015–1024. International World Wide Web Conferences Steering Committee. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio,

1604