Capturing Semantic Similarity for Entity Linking with Convolutional Neural Networks

Capturing Semantic Similarity for Entity Linking with Convolutional Neural Networks Matthew Francis-Landau, Greg Durrett and Dan Klein Computer Science Division University of California, Berkeley mfl,gdurrett,klein @cs.berkeley.edu f g Abstract al., 2011), but such heuristics are hard to calibrate and they capture structure in a coarser way than A key challenge in entity linking is making ef- learning-based methods. fective use of contextual information to dis- In this work, we model semantic similarity be- ambiguate mentions that might refer to differ- tween a mention’s source document context and its ent entities in different contexts. We present a model that uses convolutional neural net- potential entity targets using convolutional neural works to capture semantic correspondence be- networks (CNNs). CNNs have been shown to be ef- tween a mention’s context and a proposed tar- fective for sentence classification tasks (Kalchbren- get entity. These convolutional networks oper- ner et al., 2014; Kim, 2014; Iyyer et al., 2015) and ate at multiple granularities to exploit various for capturing similarity in models for entity linking kinds of topic information, and their rich pa- (Sun et al., 2015) and other related tasks (Dong et rameterization gives them the capacity to learn al., 2015; Shen et al., 2014), so we expect them to be which n-grams characterize different topics. effective at isolating the relevant topic semantics for We combine these networks with a sparse linear model to achieve state-of-the-art perfor- entity linking. We show that convolutions over mul- mance on multiple entity linking datasets, out- tiple granularities of the input document are useful performing the prior systems of Durrett and for providing different notions of semantic context. Klein (2014) and Nguyen et al. (2014).1 Finally, we show how to integrate these networks with a preexisting entity linking system (Durrett and 1 Introduction Klein, 2014). Through a combination of these two distinct methods into a single system that leverages One of the major challenges of entity linking is re- their complementary strengths, we achieve state-of- solving contextually polysemous mentions. For ex- the-art performance across several datasets. ample, Germany may refer to a nation, to that nation’s government, or even to a soccer team. Past 2 Model approaches to such cases have often focused on col- Our model focuses on two core ideas: first, that topic lective entity linking: nearby mentions in a docu- semantics at different granularities in a document ment might be expected to link to topically-similar are helpful in determining the genres of entities for entities, which can give us clues about the identity of entity linking, and second, that CNNs can distill a the mention currently being resolved (Ratinov et al., block of text into a meaningful topic vector. 2011; Hoffart et al., 2011; He et al., 2013; Cheng Our entity linking model is a log-linear model and Roth, 2013; Durrett and Klein, 2014). But an that places distributions over target entities t given even simpler approach is to use context information a mention x and its containing source document. > from just the words in the source document itself to For now, we take P (t x) exp w fC (x; t; θ), j / make sure the entity is being resolved sensibly in where fC produces a vector of features based on context. In past work, these approaches have typi- CNNs with parameters θ as discussed in Section 2.1. cally relied on heuristics such as tf-idf (Ratinov et Section 2.2 describes how we combine this simple 1Source available at model with a full-fledged entity linking system. As github.com/matthewfl/nlp-entity-convnet shown in the middle of Figure 1, each feature in fC Mention Pink Floyd = sment sment ttitle,e Entity Title m scontext t ttitle,e = Context doc,e t Gavin Floyd The others are The sdoc Entity Article Beatles, Led Gavin Christopher Floyd (born January 27, 1983) is a Zeppelin, Pink professional baseball starting pitcher for the Atlanta Braves of = scontext Major League Baseball (MLB). He previously pitched in MLB for the Philadelphia Phillies and Chicago White Sox. Floyd Floyd and Van c tdoc,e = stands 6' 5" tall, weighs 220 pounds, and throws and bats e right-handed. Professional career Draft and Minor Leagues Halen. The Philadelphia Phillies selected Floyd with the fourth overall selection of the 2001 draft. In his first professional Document season (2002), Floyd pitched for the Class A Lakewood This includes the band members: Cambell, Savage, Elliott, Collen, Allen. As of 1992, the band consisted of Elliott (vocals), Collen (guitar) Campbell (guitar), Savage (bass Target Entity Links guitar), and Allen (drums). The band = sdoc fC (s, te) has sold over 65 million albums wordwide, and have two albums with d RIAA diamond certification, Pyromania n and Hysteria. History Early years − := cosim(s, t) g := max 0,M w { j:j+} j=1 Source Document w1w2w3...wn Figure 1: Extraction of convolutional vector space features fC (x; te). Three types of information from the input document and two types of information from the proposed title are fed through convolutional networks to produce vectors, which are systematically compared with cosine similarity to derive real-valued semantic similarity features. is a cosine similarity between a topic vector asso- in the source document (the mention, that mention’s ciated with the source document and a topic vector immediate context, and the entire document) and associated with the target entity. These vectors are two text granularities on the target entity side (title computed by distinct CNNs operating over different and Wikipedia article text), we produce vector rep- subsets of relevant text. resentations with CNNs as follows. We first embed Figure 1 shows an example of why different kinds each word into a d-dimensional vector space using of context are important for entity linking. In this standard embedding techniques (discussed in Sec- case, we are considering whether Pink Floyd might tion 3.2), yielding a sequence of vectors w1; : : : ; wn. link to the article Gavin Floyd on Wikipedia We then map those words into a fixed-size vector (imagine that Pink Floyd might be a person’s nick- using a convolutional network parameterized with a k×d` name). If we look at the source document, we see filter bank M R . We put the result through a 2 that the immediate source document context around rectified linear unit (ReLU) and combine the results the mention Pink Floyd is referring to rock groups with sum pooling, giving the following formulation: (Led Zeppelin, Van Halen) and the target entity’s n−` Wikipedia page is primarily about sports (baseball convg(w1:n) = max 0;Mgwj:j+` (1) starting pitcher). Distilling these texts into succinct f g j=1 topic descriptors and then comparing those helps tell X us that this is an improbable entity link pair. In where wj:j+` is a concatenation of the given word this case, the broader source document context actu- vectors and the max is element-wise.2 Each con- ally does not help very much, since it contains other volution granularity (mention, context, etc.) has a generic last names like Campbell and Savage that distinct set of filter parameters Mg. might not necessarily indicate the document to be This process produces multiple representative in the music genre. However, in general, the whole topic vectors sment, scontext, and sdoc for the source document might provide a more robust topic esti- document and ttitle and tdoc for the target entity, as mate than a small context window does. shown in Figure 1. All pairs of these vectors between the source and the target are then compared 2.1 Convolutional Semantic Similarity using cosine similarity, as shown in the middle of Figure 1. This yields the vector of features f (s; t ) Figure 1 shows our method for computing topic vec- C e which indicate the different types of similarity; this tors and using those to extract features for a potential Wikipedia link. For each of three text granularities 2For all experiments, we set ` = 5 and k = 150. vector can then be combined with other sparse fea- fC is as discussed in Section 2.1. Note that fC has tures and fed into a final logistic regression layer its own internal parameters θ because it relies on (maintaining end-to-end inference and learning of CNNs with learned filters; however, we can compute the filters). When trained with backpropagation, the gradients for these parameters with standard back- convolutional networks should learn to map text into propagation. The whole model is trained to maxi- vector spaces that are informative about whether the mize the log likelihood of a labeled training corpus document and entity are related or not. using Adadelta (Zeiler, 2012). The indicator features fQ and fE are described in 2.2 Integrating with a Sparse Model more detail in Durrett and Klein (2014). fQ only The dense model presented in Section 2.1 is effec- impacts which query is selected and not the disam- tive at capturing semantic topic similarity, but it is biguation to a title. It is designed to roughly cap- most effective when combined with other signals ture the basic shape of a query to measure its de- for entity linking. An important cue for resolving sirability, indicating whether suffixes were removed a mention is the use of link counts from hyperlinks and whether the query captures the capitalized sub- in Wikipedia (Cucerzan, 2007; Milne and Witten, sequence of a mention, as well as standard lexical, 2008; Ji and Grishman, 2011), which tell us how POS, and named entity type features.

Load more