Aggregated Semantic Matching for Short Text Entity Linking

Feng Nie1,∗ Shuyan Zhou2,∗ Jing Liu3,∗ Jinpeng Wang4, Chin-Yew Lin4, Rong Pan1∗ 1Sun Yat-Sen University 2Harbin Institute of Technology 3Baidu Inc. 4Microsoft Research Asia {fengniesysu, alexisxy0418}@gmail.com, [email protected], {jinpwa, cyl}@microsoft.com, [email protected]

Abstract Tweet The vile #Trump humanity raises its gentle face The task of entity linking aims to identify con- in Canada ... chapeau to #Trudeau cepts mentioned in a text fragments and link them to a reference . Entity Candidates linking in long text has been well studied in Donald Trump, Trump (card games), ... previous work. However, short text entity link- Table 1: An illustration of short text entity linking, ing is more challenging since the texts are noisy and less coherent. To better utilize the with mention Trump underlined. local information provided in short texts, we propose a novel neural network framework, Aggregated Semantic Matching (ASM), in One of the major challenges in entity link- which two different aspects of semantic infor- ing task is ambiguity, where an entity mention mation between the local context and the can- could denote to multiple entities in a knowledge didate entity are captured via representation- base. As shown in Table1, the mention Trump based and interaction-based neural semantic matching models, and then two matching sig- can refer to U.S. president Donald Trump and nals work jointly for disambiguation with a also the card name Trump (card games). Many rank aggregation mechanism. Our evaluation of recent approaches for long text entity linking shows that the proposed model outperforms take the advantage of global context which cap- the state-of-the-arts on public tweet datasets. tures the coherence among the mapped entities for a set of related mentions in a single docu- 1 Introduction ment (Cucerzan, 2007; Han et al., 2011; Glober- son et al., 2016; Heinzerling et al., 2017). How- The task of entity linking aims to link a men- ever, short texts like tweets are often concise and tion that appears in a piece of text to an entry less coherent, which lack the necessary informa- (i.e. entity) in a knowledge base. For example, tion for the global methods. In the NEEL dataset as shown in Table1, given a mention Trump in (Weller et al., 2016), there are only 3.4 mentions in a tweet, it should be linked to the entity Donald each tweet on average. Several studies (Liu et al., Trump1 in Wikipedia. Recent research has shown 2013; Huang et al., 2014) investigate collective that entity linking can help better understand the tweet entity linking by pre-collecting and consid- text of a document (Schuhmacher and Ponzetto, ering multiple tweets simultaneously. However, 2014) and benefits several tasks, including named multiple texts are not always available for collec- entity recognition (Luo et al.) and information re- tion and the process is time-consuming. Thus, we trieval (Xiong et al., 2017b). The research of entity argue that an efficient entity disambiguation which linking mainly considers two types of documents: requires only a single short text (e.g., a tweet) and long text (e.g. news articles and web documents) can well utilize local contexts is better suited in and short text (e.g. tweets). In this paper, we focus real word applications. on short text, particularly tweet entity linking. In this paper, we investigate entity disambigua- ∗ Correspondence author is Rong Pan. This work was tion in a setting where only local information is done when the first and second author were interns and the third author was an employee at Research Asia. available. Recent neural approaches have shown 1https://en.wikipedia.org/wiki/Donald Trump their superiority in capturing rich semantic sim-

476 Proceedings of the 22nd Conference on Computational Natural Language Learning (CoNLL 2018), pages 476–485 Brussels, Belgium, October 31 - November 1, 2018. c 2018 Association for Computational ilarities from mention contexts and entity con- by leveraging only the local information. Specif- tents. Sun et al.(2015); Francis-Landau et al. ically, we propose using both representation- (2016) proposed using convolutional neural net- focused model and interaction-focused model for works (CNN) with Siamese (symmetric) archi- semantic matching and view them as complemen- tecture to capture the similarity between texts. tary to each other. To overcome the issue of the These approaches can be viewed as represen- static weights in linear regression, we apply rank tation-focused semantic matching models. The aggregation to combine multiple semantic match- representation-focused model first builds a rep- ing signals captured by two neural models on mul- resentation for a single text (e.g., a context or tiple text pairs. We conduct extensive experiments an entity description) with a neural network, and to examine the effectiveness of our proposed ap- then conducts matching between the abstract rep- proach, ASM, on both NEEL dataset and MSR resentation of two pieces of text. Even though tweet entity linking (MSR-TEL for short) dataset. such models capture distinguishable information from both mention and entity side, some con- 2 Background crete matching signals are lost (e.g., exact match), 2.1 Notations since the matching between two texts happens af- ter their individual abstract representations have Given a tweet t, it contains a set of identified been obtained. To enhance the representation- queries Q = (q1, ..., qn). Each query q in a tweet t focused models, inspired by recent advances in in- consists of m and ctx, where m denotes an entity formation retrieval (Lu and Li, 2013; Guo et al., mention and ctx denotes the context of the men- 2016; Xiong et al., 2017a), we propose using in- tion, i.e., a piece of text surrounding m in the tweet teraction-focused approach to capture the con- t. An entity is an unambiguous page (e.g., Donald crete matching signals. The interaction-focused Trump) in a referent Knowledge Base (KB). Each method tries to build local interactions (e.g., co- entity e consists of ttl and desc, where ttl denotes sine similarity) between two pieces of text, and the title of e and desc denotes the description of e then uses neural networks to learn the final match- (e.g., the article defining e). ing score based on the local interactions. 2.2 An Overview of the Linking System The representation- and interaction-focused ap- Typically, an entity linking system consists of proach capture abstract- and concrete-level match- three components: mention detection, candidate ing signal respectively, they would be comple- generation and entity disambiguation. In this sec- ment each other if designed appropriately. One tion, we will briefly presents the existing solutions straightforward way to combine multiple seman- for the first two components. In next section, we tic matching signals is to apply a linear regres- will introduce our proposed aggregated semantic sion layer to learn a static weight for each match- matching for entity disambiguation. ing signal(Francis-Landau et al., 2016). However, we observe that the importance of different sig- 2.2.1 Mention Detection nals can be different case by case. For example, Given a tweet t with a sequence of words as shown in Table1, the context word Canada w1, ..., wn, our goal is to identify the possible en- is the most important word for the disambiguation tity mentions in the tweet t. Specifically, every of Trudeau. In this case, the concrete-level match- word wi in tweet t requires a label to indicate ing signal is required. While for the tweet “#Star- that whether it is an entity mention word or not. Wars #theForceAwakens #StarWarsForceAwakens Therefore, we view it as a traditional named entity @StarWars”, @StarWars is linked to the entity recognition (NER) problem and use BIO tagging 2 Star Wars . In this case, the whole tweet describes schema. Given the tweet t, we aim to assign labels the same topic “Star Wars”, thus the abstract-level y = (y1, ..., yn) for each word in the tweet t. matching signal is helpful. To address this issue, we propose using a rank aggregation   B wi is a begin word of a mention, method to dynamically combine multiple seman- yi = I wi is a non-begin word of a mention, tic matching signals for disambiguation.  O wi is not a mention word. In summary, we focus on entity disambiguation In our implementation, we apply an LSTM-CRF 2https://en.wikipedia.org/wiki/Star Wars based NER tagging model which automatically

477 Model Overview

Knowledge Base Tweet Data ing signals captured by the two neural models on Mention Detection and Candidate Generation four text pairs.

Semantic Matching 3.1 Semantic Matching

Convolution Neural Neural Relevance Model Formally, given two texts T1 and T2, the with Max-Pooling with Kernel-Pooling similarity of the two texts is measured as a score produced by a matching function based on the rep- Rank Aggregation resentation of each text:

Linking Results match(T ,T ) = F (Φ(T ), Φ(T )) (1) Figure 1: An overview of aggregated semantic 1 2 1 2 matching for entity disambiguation. where Φ is a function to learn the text representa- tion, and F is the matching function based on the learns contextual features for sequence tagging via interaction between the representations. recurrent neural networks (Lample et al., 2016). Existing neural semantic matching models can be categorized into two types: (a) the 2.2.2 Candidate Generation representation-focused model which takes a com- Given a mention m, we use several heuristic rules plex representation learning function and uses to generate candidate entities similar to (Bunescu a relatively simple matching function, (b) the and Pasca, 2006; Huang et al., 2014; Sun et al., interaction-focused model which usually takes a 2015). Specifically, given a mention m, we re- simple representation learning function and uses trieve an entity as a candidate from KB, if it a complex matching function. In the remaining matches one of the following conditions: (a) the of this section, we will present the details of a entity title exactly matches the mention, (b) the representation-focused model (M-CNN) and an anchor text of the entity exactly matches the men- interaction-focused model (K-NRM). We will also tion, (c) the title of the entity’s redirected page ex- discuss the advantages of these two models in the actly matches the mention Additionally, we add entity linking task. a special candidate NIL for each mention, which refers to a new entity out of KB. Given a mention, 3.1.1 Convolution Neural Matching with multiple candidates can be retrieved. Hence, we Max Pooling (M-CNN) 1 1 need to do entity disambiguation. Given two pieces of text T1 = {w1, ..., wn} and 2 2 T2 = {w1, ..., wm}, M-CNN aims to learn com- 3 Aggregated Semantic Matching Model positional and abstract representations (Φ) for T1 We investigate entity disambiguation using only and T2 using a convolution neural network with a local information provided in short texts in this max pooling layer(Francis-Landau et al., 2016). paper. Here, the local information includes a men- Figure 2a illustrates the architecture of M-CNN tion and its context in a tweet. Similar to (Francis- model. Given a sequence of words w1, ..., wn, Landau et al., 2016), given a query q and an en- we embed each word into a d dimensional vector, tity e, we consider semantic matching on the four which yields a set of word vectors v1, ..., vn. We text pairs for disambiguation: (1) the similarity then map those word vectors into a fixed-size vec- sim(m, ttl) between the mention and entity ti- tor using a convolution network with a filter bank u×d tle, (2) the similarity sim(m, desc) between the M ∈ R , where window size is l and u is the mention and entity description, (3) the similarity number of filters. The convolution feature matrix H ∈ k×(n−l+1) is obtained by concatenating the sim(ctx, desc) between the context and entity de- R −→ scription, (4) the similarity sim(ctx, ttl) between convolution outputs h i: the context and entity description. Fig.1 illustrates −→ an overview of our proposed Aggregated Semantic hj = max{0, Mvj:(j+l)} (2) Matching for entity disambiguation. First, we use −→ −→ H = [ h 1, ..., h n−l+1] a representation-focused model and an interaction- focused neural model for semantic matching on where vj:j+l is a concatenation of the given word four text pairs. Then, we introduce a pairwise rank vectors and the max is element-wise. In this way, aggregation to combine multiple semantic match- we extract word-level n-gram features of T1 and

478 Mention Contexts CNN Mention Contexts #Trump CNN 0.2, …, 0.5 #Trump 0.2, …, 0.5 ℎ1 Max Soft-TF is 0.1, …, 0.4 푧Ԧ1 Semantic is 0.1, …, 0.4 ℎ1 Pooling features Semantic visiting ℎ2 ℎ Interaction 0.3, …, 0.8 similarity visiting 0.3, …, 0.8 2 similarity 푀11 푀21 퐾1 푠 Kernel Entity Description Entity Description 푀12 푀22 … 푠 CNN Pooling CNN 푀 푀 퐾푛 Donald 0.1, …, 0.3 ′ Donald 0.1, …, 0.3 ′ 13 23 ℎ1 ℎ1 John 0.1, …, 0.5 ′ Max 푧Ԧ2 John 0.1, …, 0.5 ′ ℎ2 ℎ2 ′ Pooling 0.2, …, 0.5 ′ Trump 0.2, …, 0.5 ℎ3 Trump ℎ3 born 0.4, …, 0.5 born 0.4, …, 0.5 (a) M-CNN Model (b) K-NRM Model

Figure 2: The Architecture of models.

−→ T2 respectively. To capture the distinguishable in- where K (Mi) applies K kernels to the i-th formation of T1 and T2, a max-pooling layer is ap- row of the translation matrix, and generates a −→ −→ plied and yields a fixed-length vector z1 and z2 K−dimensional scoring feature vector for the i- for T1 and T2. The semantic similarity between th n-gram feature in the query. The sqrt-sum of T1 and T2 is measured using a cosine similarity the scoring feature vectors of all n-gram features −→ −→ match(T1,T2) = cosine(z1 , z2 ). in query forms the scoring feature vector φ for the In summary, M-CNN extracts distinguishable whole query, where the sqrt reduces the range of information representing the overall semantics the value in each kernel vector. Note that the effect −→ (i.e. representations) of a string text by using of K depends on the kernel used. We use the RBF a convolution neural network with max-pooling. kernel in this paper. However, the concrete matching signals (e.g., ex- 2 X −(Mij − µk) act match) are lost, as the matching happens after K (M ) = exp( ) k i 2σ2 (4) their individual representation. We therefore intro- j duce an interaction-focused model to better cap- ture the concrete matching in the next section. The RBF kernel Kk calculates how pairwise sim- ilarities between n-gram feature vectors are dis- 3.1.2 Neural Relevance Model with Kernel tributed around its mean µk: the more similarities Pooling (K-NRM) closed to its mean µk, the higher the output value As shown in Fig. 2b, K-NRM captures the local is. The kernel functions act as ‘soft-TF’ bins, interactions between T1 and T2 , and then uses a where µ defines the similarity level that ‘soft-TF’ kernel-pooling layer (Xiong et al., 2017a) to softly focuses on and σ defines the range of its ‘soft-TF’ count the frequencies of the local patterns. The fi- count. Then the semantic similarity is captured T nal matching score is conducted based on the pat- with a linear layer match(T1,T2) = w φ(M)+b, terns. Therefore, the concrete matching informa- where φ(M) is the scoring feature vector. tion is captured. In summary, K-NRM captures the concrete Different from M-CNN, K-NRM builds the lo- matching signals based on word-level n-gram fea- cal interactions between T1 and T2 based on the ture interactions between T1 and T2. In contrast, word-level n-gram feature matrix calculated in M-CNN captures the compositional and abstract Eq.2. Formally, we construct a translation matrix meaning of a whole text. Thus, we produce the M, where each element in M is the cosine simi- −→q semantic matching signals using both models to larity between an n-gram feature vector h i in T1 capture different aspect of semantics that are use- −→e and an n-gram feature vector h j in T2, calculated ful for entity linking. −→q −→e as M = cosine( h , h ). ij i j 3.2 Normalization Scoring Layer Then, a scoring feature vector φ(M) is gener- ated by a kernel-pooling technique. We compute 4 types of semantic similarities be- tween the query q and the candidate entity e n−l+1 (e.g., sim(m, tit), sim(m, desc), sim(ctx, tit), X q−→ φ(M) = K (Mi) sim(ctx, desc)) with the above two semantic (3) i=1 matching models. We obtain 8 semantic match- −→ K (Mi) = {K1(Mi), ..., KK (Mi)} ing signals, denoted as f1(q, e), ..., f8(q, e) in to- 479 tal. The normalized ranking score for each seman- aging Bayes’ theorem. Given the current esti- tic matching signals fi(q, e) is calculated as mated skill levels of two players (prior probabil- ity) and the outcome of a new game between them exp(fi(q, e)) (likelihood), TrueSkill model updates its estima- si(q, e, f) = P 0 (5) e0 exp(fi(q, e )) tion of player skill levels (posterior probability).

0 TrueSkill updates the skill level µ and the un- where e stands for any of the candidate entities for certainty σ intuitively: (a) if the outcome of a the given mention m. We then produce 8 semantic new competition is expected, i.e., the player with matching scores for each candidate entity of m, higher skill level wins the game, it will cause small denoted as Sq,e = {s1, ..., s8}. updates in skill level µ and uncertainty σ; (b) if the outcome of a new competition is unexpected, i.e., 3.3 Rank Aggregation the player with lower skill level wins the game, it Given a query q, we obtain multiple semantic will cause large updates in skill level µ and uncer- matching signals for each entity candidate after tainty σ. According to these intuitions, the equa- the last step. To take advantage of different se- tions to update the skill level µ and uncertainty σ mantic matching models on different text pairs, a are as follows: straightforward approach is using a linear regres- 2 sion layer to combine multiple semantic matching σwinner t ε µwinner = µwinner + ∗ v( , ) signals (Francis-Landau et al., 2016). The linear c c c 2 combination learns a static weight for each match- σloser t ε µloser = µloser − ∗ v( , ) ing signal. However, as we pointed out previously, c c c the importance of different signals varies for dif- σ2 t ε σ2 = σ2 ∗ [1 − winner ∗ w( , )] ferent queries. In some cases, the abstract-level winner winner c2 c c signals are important. While the concrete-level σ2 t ε σ2 = σ2 ∗ [1 − loser ∗ w( , )] signals are more important in other cases. To ad- loser loser c2 c c dress this issue, we introduce a pairwise rank ag- (6) 2 2 gregation method to aggregate multiple semantic where t = µwinner − µloser and c = 2β + 2 2 matching signals. σwinner + σloser. Here, ε is a parameter repre- In the area of information retrieval, rank ag- senting the probability of a draw in one game, and gregation is combining rankings from multiple re- v(t, ε) and w(t, ε) are weighting factors for skill trieval systems and producing a better new rank- level µ and standard deviation σ respectively. β ing (Carterette and Petkova, 2006). In our prob- is a parameter representing the range of skills. In lem, given a query q, we have one ranking of the this paper, we set the initial values of the skill level entity candidates for each semantic matching sig- µ and the standard deviation σ of each player the nal. We aim to find the final ranking by aggregat- same as the default values used in (Herbrich et al., ing multiple rankings. Specifically, given a rank- 2006). We use µ − 3β to rank entities following ing of entities for one semantic matching signal, (Herbrich et al., 2006). e e e i j i 1 2 3 . . . , where means entity is 4 Experiments ranked above j, we extract all entity pairs (ei, ej) from the ranking and assume that if ei ej, then In this section, we describe our experimental re- ei is preferred to ej. We union all pairwise prefer- sults on tweet entity linking. Particularly, we ences generated from multiple rankings as a single investigate the difference between two semantic set, from which the final ranking is learned. In this matching models and the effectiveness of jointly paper, we apply TrueSkill (Herbrich et al., 2006) combining these two semantic matching signals. which is a Bayesian skill rating model. We present a two-layer version of TrueSkill with no-draw. 4.1 Datasets & Evaluation Metric TrueSkill assumes that the practical perfor- In our experiments, we evaluate our proposed mance of each player in a game follows a nor- model ASM on the following two datasets. mal distribution N(µ, σ2), where µ means the NEEL Weller et al.(2016) . We use the dataset skill level of the player and σ stands for the un- of Named Entity Extraction & Linking Challenge certainty of the estimated skill level. Basically, 2016. The training dataset consists of 6,025 tweets TrueSkill learns the skill levels of players by lever- and includes 6,374 non-NIL queries and 2,291

480 NIL queries. The validation dataset consists of embeddings on the whole English Wikipedia 100 tweets and includes 253 non-NIL queries and Dump. The dimensionality of the word embed- 85 NIL queries. The testing dataset consists of 300 dings is set to 400. Note that we do not update the tweets and includes 738 non-NIL queries and 284 word embeddings during training. NIL queries. MSR-TEL Guo et al.(2013) 3. This dataset 4.3 Experimental Setup consists of 428 tweets and 770 non-NIL queries. In our main experiment, we compare our proposed Since the NEEL test dataset has distribution bias approaches with the following baselines: (a) The problem, we add MSR-TEL as another dataset for officially ranked 1st and 2nd systems in NEEL the evaluation. In the NEEL testing dataset, 384 2016 challenge. We denote these two systems as out of 1022 queries refer to three entities: ‘Don- Rank1 and Rank2. (b) TagMe. Ferragina and ald Trump’, ‘Star Wars’ and ‘Star Wars (The Force Scaiella(2010) is an end-to-end linking system, Awakens)’. which jointly performs mention detection and en- In this paper, we use accuracy as the major eval- tity disambiguation. It focuses on short texts, in- uation metric for entity disambiguation. Formally, cluding tweets. (c) Cucerzan.(Cucerzan, 2007) we denote N as the number of queries and M as is a supervised entity disambiguation system that the number of correctly linked mentions given the won TAC KBP competition in 2010. (d) M-CNN. gold mention (the top-ranked entity is the golden To the best of our knowledge, (Francis-Landau M entity), accuracy = N . Besides, we use preci- et al., 2016) is the state-of-the-art neural disam- sion, recall and F1 measure to evaluate the end-to- biguation model. (e) Ensemble. The rank ag- 0 end system. Formally, we denote N as the num- gregated combination of two M-CNN models with 0 ber of mentions identified by a system and M as different random seeds. the correctly linked mentions. Thus, precision = To fairly compare with the baselines of 0 0 M M precision∗recall Cucerzan and M-CNN, we use the same mention 0 , recall = N and F 1 = 2 ∗ precision+recall . N detection and candidate generation for them and 4.2 Data Preprocessing our approaches. We train an LSTM-CRF based Tweet data All tweets are normalized in the tagger (Lample et al., 2016) for mention detection following way. First, we use the Twitter-aware by using the NEEL training dataset. The preci- tokenizer in NLTK4 to tokenize words in a tweet. sion, recall, and F1 of mention detection on NEEL We convert each hyperlink in tweets to a special testing dataset are 96.1%, 89.2%, 92.6% respec- token URL. Since hashtags usually does not tively. The precision, recall, and F1 of mention contain any space between words, we use a web detection on MSR-TEL dataset are 80.3% 83.8% service5 to break hastags into tokens (e.g., the and 82% respectively. As we described in the pre- service will break ‘#TheForceAwakens’ into ‘the vious section, we use the heuristic rules for can- force awakens’) by following (Guo et al., 2013). didate generation. The recall of candidate gen- Regarding to usernames (@) in tweets, we replace eration on NEEL and MSR-TEL is 88.7% and them with their screen name (e.g., the screen name 92.5%. of the user ‘@jimmyfallon’ is ‘jimmy fallon’). When training our model, we use the stochastic Wikipedia data We use the Wikipedia Dump on gradient descent algorithm and the AdaDelta opti- December 2015 as the reference knowledge base. mizer (Zeiler, 2012). The gradients are computed Since the most important information of an entity via back-propagation. The dimensionality of the is usually at the beginning of its Wikipedia article, hidden units in convolution neural network is set we utilize only the first 200 words in the article as to 300. All the parameters are initialized with a its entity description. We use the default English uniform distribution U(−0.01, 0.01). Since there word tokenizer in NLTK to do the tokenization is NIL entity in the dataset, we tune a NIL thresh- for each Wikipedia article. old for the prediction of NIL entities according to Word embedding We use the word2vec the validation dataset. toolkit (Mikolov et al., 2013) to pre-train word 4.4 Main Results 3 Guo et al.(2013) only used a subset of this dataset for The end-to-end performance of various ap- evaluation. Instead, we test on the full dataset. 4Natural Language Toolkit. http://www.nltk.org proaches on the two datasets is shown in Table2. 5http://web-ngram.research.microsoft.com/info/break.html Since there are no publicly available codes of

481 NEEL MSR-TEL6 Methods Precision Recall F1 Precision Recall F1 Rank 1 - - 50.1 --- Rank 2 - - 39.6 --- TagMe 25.3 62.9 36.2 14.5 69.2 23.8 Cucerzan 65.4 57.9 61.4 62.6 63.3 62.9 M-CNN 69.5 64.9 67.1 61.6 62.3 62.1 +pre-train 69.7 65.1 67.3 64.5 65.2 64.8 Ensemble 69.7 65.1 67.3 63.5 64.2 63.8 +pre-train 70.2 65.5 67.8 64.9 65.6 65.2 ASM 70.6 65.9 68.2 64.2 64.9 64.5 +pre-train 72.2 67.4 69.7 66.2 66.9 66.5 Table 2: End-to-end performance of the systems on the two datasets

Methods NEEL MSR-TEL (m, ttl) (ctx, desc) All Pairs Cucerzan 65.4 75.5 M-CNN 64.8 66.7 72.8 M-CNN 72.8 74.7 K-NRM 64.1 66.8 72.7 +pre-train 72.9 77.6 ASM 65.1 69.7 73.9 Ensemble 72.9 76.4 Table 4: The performance of two semantic match- +pre-train 73.5 78.1 ing models and their combinations on NEEL ASM 73.9 77.4 dataset. +pre-train 75.5 79.4 Table 3: The accuracy of entity disambiguation with golden mentions on the two datasets. that our proposed ASM is still superior to M-CNN and Ensemble in the setting of pre-training. Since entity disambiguation is our focus, we Rank1 and Rank2, we give only the F1 scores of also give the disambiguation accuracy of differ- these two systems on NEEL dataset according to ent approaches by using the golden mentions in Weller et al.(2016). Note that the baseline systems Table3. Similarly, we observe that our proposed Rank1, Rank2 and TagMe use different mention ASM outperforms baseline systems. detection. The systems of Rank1, Rank2, TagMe and 4.5 Model Analysis Cucerzan are feature engineering based ap- In this section, we discuss several key observa- proaches. The systems of M-CNN and ASM are tions based on the experimental results, and we neural based approaches. From Table2, we mainly report the entity disambiguation accuracy can observe that neural based approaches are when given the golden mentions. superior to the feature engineering based ap- proaches. Table2 also shows that ASM out- 4.5.1 Effect of Different Semantic Matching performs the neural based method M-CNN. Our Methods proposed method ASM also shows improvements over Ensemble, which indicates the neces- We empirically analyze the difference between sity of combining representation- and interaction- the two semantic matching models (M-CNN and focused models in entity disambiguation. K-NRM) and show the benefits when combing the Moreover, we pre-train both M-CNN, semantic matching signals from these two models. Ensemble and ASM by using 0.5 million 6Note that the performance of all systems on MSR-TEL anchors in Wikipedia, and fine-tune the model pa- dataset might be under estimated, since not all mentions in rameters using non-NIL queries in NEEL training each tweet were manually annotated. For example, a cor- dataset. From Table2, we can observe that the rectly identified mention given by a system, which was not manually annotated, will be judged as wrong. But we still performance of neural models will be improved give the comparisons of different approaches on MSR-TEL by using pre-training. The results in Table2 show dataset.

482 M-CNN win M-CNN loss Without Pre-Train With Pre-Train Method K-NRM win 58.3% 6.3% NEEL MSR-TEL NEEL MSR-TEL K-NRM loss 5.8% 29.6% Linear 73.1 75.7 73.8 78.1 ASM 73.9 77.4 75.5 79.4 Table 5: The win-loss analysis of M-CNN and K- NRM on the pair (ctx, desc). Table 7: Comparison of rank aggregation and lin- ear combination on two datasets. Query: the vile #Trump humanity raises its gentle face in Canada ... chapeau to To further investigate the difference between #Trudeau,URL the two semantic matching models on short text, M-CNN: Kevin Trudeau we did case study. Table6 gives two examples. K-NRM: Justin Trudeau In the first example, the correct answer is ‘Justin Query: RT @ MingNa : What is my plan to Trudeau’ which contains the words of ‘Canada’ avoid spoiler about #theForceAwak- and ‘trump’ in its entity description. However, ens ? No except to post my M-CNN fails to capture this concrete matching in- @StarWars formation, since the concrete information of text M-CNN: Star Wars might be lost after the convolution layer and max- K-NRM: Comparison of Star Trek and Star pooling layer. In contrast, K-NRM builds the n- Wars gram level local interactions between texts, and Table 6: The top-1 results of M-CNN and K-NRM thus successfully captures the concrete matching using (ctx,desc) pair for two queries. Mention is information (e.g. exact match) that results in a cor- in bold and the golden answer is underlined. rect linking result. In the second example, both candidate entities ‘Star Wars’ and ‘Comparison of Star Trek and Star Wars’ contains the phrase ‘Star We first compare the performance of two se- Wars’ for multiple times in their entity descrip- mantic matching models over the two text pairs: tions. In this case, K-NRM fails to distinguish the (a) (m, ttl) and (b) (ctx, desc). These two pairs correct entity ‘Star Wars’ from the wrong entity presents two extreme of the information used in ‘Comparision of Star Trek and Star Wars’, because the systems: (m, ttl) consumes the minimum it relies too much on the soft-TF information for amount of information from a query and an entity, matching. However, the soft-TF information in while (ctx, desc) consumes the maximum amount the descriptions of the two entities is similar. In of information from a query and an entity. From contrast, M-CNN captures the whole meaning of the first two columns in Table4, we can observe the text and links the mention to the correct entity. that M-CNN performs comparably with K-NRM on A detailed analysis of n-grams extracted from the the two text pairs. ASM that combines the two M-CNN is provided in the Appendix. models obtains performance gains on the two indi- vidual text pairs. The third column in Table4 also 4.6 Effect of Rank Aggregation shows that ASM gives performance gains when Table4 shows that the combination of multiple using all text pairs. This indicates that M-CNN semantic matching signals yields the best perfor- and K-NRM capture complementary information mance. Table7 compares two different combi- for entity disambiguation. nation of M-CNN and K-NRM models, the result Moreover, we observe that the performance shows that the rank aggregation method outper- gains are different on the two pairs (m, ttl) and forms the linear combination. The rank aggrega- (ctx, desc). The gain on (ctx, desc) is relatively tion method dynamically summarizes win-loss re- larger. This indicates that M-CNN and K-NRM cap- sults for each signal and generates the final overall ture more different information when the text is ranking by considering all win-loss results. The long. Additionally, we show the win-loss analy- improvement of our method over the linear com- sis of the two semantic matching model for non- bination confirms that the importance of different NIL queries on (ctx, desc) in Table5. The 12.1% semantic signals varies for different queries, and (=6.3% + 5.8%) difference between these two our method is more suitable for combining multi- models confirms the necessity of combination. ple semantic signals.

483 5 Related Work semantic matching method and the interaction- focused semantic matching method capture both Existing entity linking methods can roughly fall compositional and concrete matching signals (e.g. into two categories. Early work focus on local ap- exact match). Moreover, the pairwise rank aggre- proaches, which identifies one mention each time, gation is applied to better combine multiple se- and each mention is disambiguated separately us- mantic signals. We have shown the effectiveness ing hand-crafted features (Bunescu and Pasca, of ASM over two datasets through comprehensive 2006; Ji and Grishman, 2008; Milne and Witten, experiments. In the future, we will try our model 2008; Zheng et al., 2010). While recent work on for long text entity linking. entity linking has largely focus on global methods, which takes the mentions in the document as in- 7 Acknowledgement puts and find their corresponding entities simul- taneously by considering the coherency of entity We thank the anonymous reviewers for their help- assignments within a document. (Cucerzan, 2007; ful comments. We also thank Jin-Ge Yao, Zhirui Hoffart et al., 2011; Globerson et al., 2016; Ganea Zhang, Shuangzhi Wu and Yin Lin for helpful con- and Hofmann, 2017). versations and comments on the work. Global models can tap into highly discrimina- tive semantic signals (e.g. coreference and en- tity relatedness) that are unavailable to local meth- References ods, and have significantly outperformed the lo- R Bunescu and M Pasca. 2006. Using encyclope- cal approach on standard datasets(Globerson et al., dic knowledge for named entity disambiguation. In 2016). However, global approaches are difficult to EACL, Trento, Italy. apply in domains where only short and noisy text is available (e.g. tweets). Many techniques have Ben Carterette and Desislava Petkova. 2006. Learning a ranking from pairwise preferences. In Proceedings been proposed to short texts including tweets. Liu of the 29th annual international ACM SIGIR confer- et al.(2013) and Huang et al.(2014) investigate ence on Research and development in information the collective tweet entity linking by considering retrieval, pages 629–630. ACM. multiple tweets simultaneously. Meij et al.(2012) S Cucerzan. 2007. Large-scale named entity disam- and Guo et al.(2013) perform joint detection and biguation based on wikipedia data. In EMNLP- disambiguation of mentions for tweet entity link- CoNLL, volume 2007. ing using feature based learning methods. Recently, some neural network methods have Yotam Eshel, Noam Cohen, and Kira Radinsky. 2017. been applied to entity linking to model the local Named entity disambiguation for noisy text. In CoNLL, volume 2017. contextual information. He et al.(2013) inves- tigate Stacked Denoising Auto-encoders to learn Paolo Ferragina and Ugo Scaiella. 2010. TAGME: entity representation. Sun et al.(2015); Francis- on-the-fly annotation of short text fragments (by Landau et al.(2016) apply convolutional neural wikipedia entities). In Proceedings of CIKM 2010, pages 1625–1628. networks for entity linking. Eshel et al.(2017) use recurrent neural networks to model the men- Matthew Francis-Landau, Greg Durrett, and Dan tion contexts. Nie et al.(2018) uses a co-attention Klein. 2016. Capturing semantic similarity for en- mechanism to select informative contexts and en- tity linking with convolutional neural networks. In tity description for entity disambiguation. How- In Proceedings of NAACL-HLT 2016, pages 1256– 1261. ever, none of these methods consider combining representation- and interaction-focused semantic Octavian-Eugen Ganea and Thomas Hofmann. 2017. matching methods to capture the semantic simi- Deep joint entity disambiguation with local neural larity for entity linking, and use rank aggregation attention. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Process- method to combine multiple semantic signals. ing.

6 Conclusion Amir Globerson, Nevena Lazic, Soumen Chakrabarti, Amarnag Subramanya, Michael Ringgaard, and Fer- We propose an aggregated semantic matching nando Pereira. 2016. Collective entity resolution framework, ASM, for short text entity linking. with multi-focal attention. In Proceedings of ACL The combination of the representation-focused 2016.

484 Jiafeng Guo, Yixing Fan, Qingyao Ai, and W. Bruce Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- Croft. 2016. A deep relevance matching model for rado, and Jeff Dean. 2013. Distributed representa- ad-hoc retrieval. In Proceedings of CIKM 2016, tions of words and phrases and their composition- pages 55–64. ality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Ad- Stephen Guo, Ming-Wei Chang, and Emre Kiciman. vances in Neural Information Processing Systems 2013. To link or not to link? a study on end-to-end 26, pages 3111–3119. tweet entity linking. In NAACL-HLT 2013. David N. Milne and Ian H. Witten. 2008. Learning to Xianpei Han, Le Sun, and Jun Zhao. 2011. Collective link with wikipedia. In Proceedings of CIKM 2008. entity linking in web text: a graph-based method. In Proceeding of the SIGIR 2011, pages 765–774. Feng Nie, Yunbo Cao, Jinpeng Wang, Chin-Yew Lin, and Rong Pan. 2018. Mention and entity description Zhengyan He, Shujie Liu, Mu Li, Ming Zhou, Longkai co-attention for entity disambiguation. In Proceed- Zhang, and Houfeng Wang. 2013. Learning entity ings of the Thirty-Second AAAI Conference on Ar- representation for entity disambiguation. In Pro- tificial Intelligence, New Orleans, Louisiana, USA, ceedings of ACL 2013. February 2-7, 2018.

Benjamin Heinzerling, Michael Strube, and Chin-Yew Michael Schuhmacher and Simone Paolo Ponzetto. Lin. 2017. Trust, but verify better entity linking 2014. Knowledge-based graph document modeling. through automatic verification. EACL. In Proceedings of CIKM 2014, pages 543–552.

Ralf Herbrich, Tom Minka, and Thore Graepel. 2006. Yaming Sun, Lei Lin, Duyu Tang, Nan Yang, Zhenzhou Trueskilltm: A bayesian skill rating system. In In Ji, and Xiaolong Wang. 2015. Modeling mention, Proceedings of NIPS 2006., pages 569–576. context and entity with neural networks for entity disambiguation. In Proceedings of IJCAI 2015. Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bor- Katrin Weller, Aba-Sah Dadzie, and Danica dino, Hagen Furstenau,¨ Manfred Pinkal, Marc Span- Radovanovic. 2016. Making sense of microp- iol, Bilyana Taneva, Stefan Thater, and Gerhard osts (#microposts2016) social sciences track. In Weikum. 2011. Robust disambiguation of named Proceedings of the 6th Workshop on ’Making Sense entities in text. In Proceedings of EMNLP 2011. of Microposts’ co-located with the 25th Interna- Hongzhao Huang, Yunbo Cao, Xiaojiang Huang, Heng tional Conference (WWW 2016), Ji, and Chin-Yew Lin. 2014. Collective tweet wiki- Montreal,´ Canada, April 11, 2016., pages 29–32. fication based on semi-supervised graph regulariza- Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Proceedings of ACL 2014 tion. In . Liu, and Russell Power. 2017a. End-to-end neural ad-hoc ranking with kernel pooling. In Proceedings Heng Ji and Ralf Grishman. 2008. Refining event ex- of SIGIR 2017, pages 55–64. traction through cross-document inference. In Pro- ceedings of ACL 2008. Chenyan Xiong, Russell Power, and Jamie Callan. 2017b. Explicit semantic ranking for academic Guillaume Lample, Miguel Ballesteros, Sandeep Sub- search via knowledge graph embedding. In Pro- ramanian, Kazuya Kawakami, and Chris Dyer. 2016. ceedings of the 26th International Conference on Neural architectures for named entity recognition. World Wide Web, WWW 2017, Perth, Australia, In Proceedings of NAACL-HLT 2016 In , pages 260– April 3-7, 2017, pages 1271–1279. 270. Matthew D Zeiler. 2012. Adadelta: an adaptive learn- Xiaohua Liu, Yitong Li, Haocheng Wu, Ming Zhou, ing rate method. Furu Wei, and Yi Lu. 2013. Entity linking for tweets. In ACL (1), pages 1304–1311. Zhicheng Zheng, Fangtao Li, Minlie Huang, and Xi- aoyan Zhu. 2010. Learning to link entities with Zhengdong Lu and Hang Li. 2013. A deep architecture knowledge base. In NAACL-HLT 2010. for matching short texts. In In Proceedings of NIPS 2006., pages 1367–1375.

Gang Luo, Xiaojiang Huang, Chin-Yew Lin, and Za- iqing Nie. Joint entity recognition and disambigua- tion. In Proceedings of the 2015 Conference on Em- pirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015.

Edgar Meij, Wouter Weerkamp, and Maarten de Rijke. 2012. Adding semantics to microblog posts. In Pro- ceedings of the WSDM, 2012, pages 563–572.

485