Language and Domain Independent Entity Linking with Quantified Collective Validation

Han Wang∗,1, Jin Guang Zheng∗,2, Xiaogang Ma1, Peter Fox1, and Heng Ji2 {Tetherless World Constellation1, Computer Science Department2} Rensselaer Polytechnic Institute Troy, NY, USA {wangh17, zhengj6, max7, pfox, jih}@rpi.edu

Abstract because not only are there thousands of people with different professions named “Caldwell”, but Linking named mentions detected in a even if as an player, as most source document to an existing knowl- people would recognize it from the context, there edge base provides disambiguated entity are several “Caldwell”s who are/were associated referents for the mentions. This al- with either “the Patriots” or “the Jets”. An EL lows better document analysis, knowl- system should be able to disambiguate the men- edge extraction and knowledge base pop- tion by carefully examining the context and then ulation. Most of the previous research identify the correct KB referent, which is Reche extensively exploited the linguistic fea- Caldwell in this case. tures of the source documents in a su- Although EL has attracted a lot of community pervised or semi-supervised way. These attention in the recent years, most research efforts systems therefore cannot be easily ap- have been focused on developing systems only ef- plied to a new language or domain. In fective for generic English corpora. When these this paper, we present a novel unsuper- systems are migrated to a new language or do- vised algorithm named Quantified Col- main, their performance will usually suffer from lective Validation that avoids excessive a noticeable decline due to the following reasons: linguistic analysis on the source docu- 1) State-of-the-art EL systems have developed ments and fully leverages the knowledge comprehensive linguistic features from the source base structure for the entity linking task. documents to generate advanced representations We show our approach achieves state- of the mentions and their context. While this of-the-art English entity linking perfor- methodology has been proved rewarding for a mance and demonstrate successful de- resource-rich language such as English, it prevents ployment in a new language (Chinese) the systems from being adopted to a new language, and two new domains (Biomedical and especially to one with limited linguistic resources. Earth Science). Experiment datasets One can imagine that it would be very difficult, and system demonstration are available if not impossible, for an English EL system that at http://tw.rpi.edu/web/doc/ benefits from the part-of-speech tagging, depen- hanwang_emnlp_2015 for research dency parsing, and named entity recognition to be purpose. deployed to a new language such as Chinese that 1 Introduction and Motivation has quite different linguistic characteristics. 2) The current EL approaches mostly target at The entity linking (EL) task aims at analyzing people, organizations, and geo-political entities each named entity mention in a source document which are widely present in a general KB such and linking it to its referent in a knowledge base as Wikipedia. However, domain-specific EL tends (KB). Consider the following example: “One day to pay more attention to entities beyond the above after released by the Patriots, Florida born Cald- three types. For instance, in the biomedical sci- well visited the Jets...... The ence domain, protein is a major class of entities have six receivers on the roster: Cotchery, Coles, that greatly interest scientists. Conventional EL ...”. Here “Caldwell” is an ambiguous mention systems are very likely to fail in linking protein ∗These authors contributed equally to this work. mentions in the text due to the lack of labeled training data. Moreover, their reliance on general guages and domains. reference KBs seems insufficient for a specific do- The main novel contributions of this paper are main. Take “A20”, a type of protein as an example. summarized as follows: 1) We design an unsuper- Wikipedia has more than a few items listed un- vised EL algorithm, namely, Quantified Collec- der the name of “A20” and their types range from tive Validation (QCV) that builds KB entity can- aircrafts to roads. This diversified information in- didate graphs with quantified relations for the pur- evitably introduces noise for a biomedical EL ap- pose of collective disambiguation and inference. plication. 2) We develop a procedure of building language One potential solution to tackle these limita- and domain independent EL systems by incorpo- tions is, instead of concentrating on the source rating various ontologies into the QCV compo- documents, to conduct more deliberate study on nent. 3) We demonstrate that our system is able the KB. Structured KBs such as DBpedia1 typi- to achieve state-of-the-art performance in English cally offer detailed descriptions about entities, a EL, and it can also produce promising results for large collection of named relations between enti- Chinese EL as well as EL in Biomedical Science ties, and a growing number of multi-lingual en- and Earth Science. tity surface forms. By embracing these ready- for-use information and linked structures, we will 2 Baseline Collective EL be able to obtain sufficient contextual information As a baseline, we adopt a competitive unsuper- for disambiguation without generating a full list vised collective EL system (Zheng et al., 2014) of linguistic features from the source documents, utilizing structured KBs. It defines entropy based and therefore eliminate the language dependency. weights for the KB relations, and embeds them in Moreover, currently there exist numerous pub- a two-step candidate ranking process to produce licly available domain ontology repositories such the EL results. 2 3 as BioPortal and OBO Foundry which provide Structured KB Terminologies: In a structured significantly more domain knowledge than general KB, a fact is usually expressed in the form of a KBs for EL to leverage. By incorporating these triple: (eh, r, et) where eh, et are called the head domain ontologies, we can easily increase the en- entity and the tail entity, respectively, and r is the tity coverage and reduce noise for deploying EL in relation between eh and et. various new domains. Entropy Based KB Relation Weights: The goal In order to make the most of the KB structure, is to leverage various levels of granularity of KB the mention context should be matched against relations. The calculation of the relation weight the KB such that the relevant KB information H(r) is given in Equation (1): can be extracted. A collective way of aligning X co-occurred mentions to the KB graph has been H(r) = − P (et) log(P (et)) (1) proved to be a successful strategy to better rep- et∈Et(r) resent the source context (Pennacchiotti and Pan- tel, 2009; Fernandez et al., 2010; Cucerzan, 2011; where Et(r) is the tail entity set for r in the KB, Han et al., 2011; Ratinov et al., 2011; Dalton and and P (et) is the probability of et appearing as the Dietz, 2013; Zheng et al., 2014; Pan et al., 2015). tail entity for r in the KB. We take a further step to consider quantitatively Salience Ranking: As the first ranking step, we differentiating entity relations in the KB in or- examine the candidates without the context and der to evaluate entity candidates more precisely. prefers those with higher importance in the KB. Meanwhile, we jointly validate these candidates Equation (2) computes the salience score Sa(c) for by aligning them back to the source context and in- a candidate c: tegrating multiple ranking results. This novel EL X Sa(et) framework deeply exploits the KB structure with a Sa(c) = H(r) (2) L(et) light weight representation of the source context, r∈R(c),et∈Et(r) and thus enables a smooth migration to new lan- where R(c) is the relation set for c in the KB; H(r) is given by Equation (1); E (r) is the tail entity 1http://wiki.dbpedia.org t 2http://bioportal.bioontology.org set with c being the head entity and r being the 3 http://www.obofoundry.org connecting relation in the KB; L(et) denotes the cardinality of the tail entity set with et being the course according to the one sense per discourse head entity in the KB. Sa(c) is recursively com- assumption (Gale et al., 1992), but for simplic- puted until convergence. ity we heuristically set wm to be 7-sentence wide Collective Ranking: The similarity SimF (m, c) as a hyperparameter. Two mention vertices will between a candidate c and its mention m is defined be connected via a dashed edge if they are coref- using Equation (3) as the final ranking score: erential but are not located in the same context

F window. Here we determine the coreference by Sim (m, c) = α · JS(m, c) · Sa(c) performing substring matching and abbreviation X X (3) + β · H(r) · Sa(n) expansion. The dashed edge indicates the out-

r∈R(c) n∈Et(r)∩C(m) of-context coreferential mention together with its neighbors will be indirectly included in Gm as where JS(m, c) is the Jaccard similarity between extended context to later facilitate the candidate the string surface forms of m and c; Sa(c) and graph collective validation. Note that all of these Sa(n) are both evaluated by Equation (2); C(m) loose settings comply with our intention of gener- denotes the candidate set for mention m; α and β ating a light-weight source context representation are hyperparameters. born with domain and language independence.

3 Quantified Collective Validation Florida Incorporating the KB relation weighing mecha- Caldwell Cotchery nism of the baseline system, our QCV algorithm constructs a number of candidate graphs for a Patriots Jets New York Jets given set of collaborative mentions, and then per- forms a two-level ranking followed by a collective Coles validation on those candidate graphs to acquire the linking results. Because this procedure minimally Figure 1: Mention context graph for the Caldwell relies on linguistic analysis of the source docu- example. ments while mainly uses the KB structure which by nature keeps detached from any specific lan- KB Graph: A structured KB such as DBpedia guage or domain, we claim that QCV comes with can be represented as a weighted graph Gk that language and domain independence. consists of a set of vertices representing the en- tities and a set of directed edges labeled with re- 3.1 Candidate Graph Construction lations between entities. The weights of relations The KB entity candidate graphs are constructed are computed using Equation (1). In order to fur- based on a mention context graph and a KB graph. ther enrich the KB relations, we add a type of re- We will introduce them in order as follows. lation named “wiki link” between two entities if Mention Context Graph: To avoid abusing lin- one of them appears in the Wikipedia article of the guistic knowledge from the source documents, we other. Figure 2 presents a subgraph of the DBpe- construct a mention context graph Gm simply in- dia KB graph containing the relevant entities in the volving mention co-occurrence. Figure 1 depicts Caldwell example. a constructed Gm for the Caldwell example at Candidate Graph: The candidate graph is a i the beginning of Section 1. In this figure, men- set of graphs Gc (i = 1, 2, ...) used for com- tions “New York Jets”, “Cotchery” and “Coles” puting ranking scores for the KB entity candi- are brought into Gm through the coreference be- dates. For each of the mentions extracted from tween “Jets” and “New York Jets” since the three the source context, we first select a list of entity of them are outside the context window of “Cald- candidates from Gk with heuristic rules such as well”, “Florida”, “Patriots”, and “Jets”. Gm con- fuzzy string matching, synonyms, Wikipedia redi- tains a set of vertices representing the mentions rect, etc. Then we pick one candidate from each i extracted from the source document and a set of of the mentions to constitute the vertices of a Gc. i undirected edges. There will be an edge between In each Gc, we add an edge between two vertices two mention vertices if both of them fall into a if they are connected in Gk by some relation r and context window with width wm in the source doc- their mentions are connected in Gm. The edge la- i ument. Ideally, wm should cover a single dis- bel r from Gk is transferred to Gc. Upon comple- Patriots Patriot Act (American Revolution) …

wiki link Jerricho Cotchery ……0.21

Florida, Ohio wiki link wiki link Jet aircraft 0.21 0.21 former team former team Florida City, Florida Newcastle Jets FC 0.61 0.61 wiki link Florida 0.21 New York Jets birth place birth place 0.93 former team 0.93 0.61 birth place wiki link 0.93 0.21 wiki link Reche Caldwell 0.21

Andre Caldwell James Caldwell Jim Caldwell … Danny Coles (clergyman) (American Football) …

Figure 2: KB graph for the Caldwell example.

i tion, every Gc represents a collective linking solu- 3.2 Candidate Ranking tion to the given mention set. Figure 3 shows three of the constructed candidate graphs for the Cald- With the constructed candidate graphs, QCV per- well example. One can see that the first two graphs forms two levels of ranking. First, it uses Equa- are very likely to be good solutions since they in- tion (2) to compute the candidates’ salience scores as a priori ranking. Then it compares each can- herit many of the relation edges from GK , while the third one is probably a poor collection as the didate graph with the mention context graph, and candidates barely connect to one another. In the evaluates their vertex set similarity for context next section, we will more formally reveal how to similarity ranking. Finally, by considering the re- rank these candidate graphs to obtain the optimal lation weights in the candidate graphs as well as linking results. previous ranking scores, QCV collectively vali- dates all the candidates and assembles the linking

Jerricho Cotchery Laveranues Coles results. Below we will focus on introducing the context similarity ranking and the collective vali- former team former team New York Jets dation since the salience ranking resembles that of wiki link Reche Caldwell our baseline system. birth place former team Context Similarity Ranking: As shown in Fig- Florida New England Patriots ure 3, among the constructed candidate graphs, A some of them contain many connected vertices

Jerricho Cotchery Laveranues Coles while some are otherwise quite disconnected. In-

former team former team tuitively we would like to measure this structure New York Jets i difference by comparing each candidate graph Gc wiki link Andre Caldwell with its mention context graph G . Granted, we birth place wiki link m can only assert co-occurrence between two con- Florida New England Patriots nected mentions in G , but it should be of great B m probability that two co-occurring mentions have

Jerricho Cotchery Danny Coles their entity referents connected by some relation i in the KB. In other words, the more a Gc is struc- Newcastle Jets FC turally similar to its Gm, the better the candidates James Caldwell in this Gi represent their mentions in G . There- (clergyman) wiki link c m fore, we define the context similarity Sm(mc, c) Patriots Florida City, Florida (American Revolution) between a candidate c and its mention mc using C Jaccard similarity in Equation (4): Figure 3: Candidate graphs for the Caldwell ex- i ample. |ΘGm (m ) ∩ ΘGc (c)| S (m , c) = c (4) m c G Gi |Θ m (mc) ∪ Θ c (c)| i Gm G where Θ (mc) and Θ c (c) denote mc’s neigh- weight, since the relation “former team” has a i bor set in Gm and c’s neighbor set in Gc, re- greater weight than “wiki link”, the candidate spectively. The intersection takes the candidates graph in Figure 3A outweighs that in Figure 3B, Gm of those mentions in Θ (mc) that appear in and therefore is ranked to the top. i G Gm Θ c (c), and the union is equivalent to Θ (mc) i i due to the way we construct Gc. We rank Gc using 4 Experiments the summation of the context similarity of every c i In this section, we first show QCV’s performance in Gc. Note that our baseline system uses Jaccard similarity to achieve approximate string match be- on generic English corpora and compare it with tween the surface forms of a mention and a can- our baseline together with other state-of-the-art didate, while we alternatively use it to capture the EL systems. Then we move to a new language graph’s structural similarity. After ranking with (Chinese) and two new domains (Biomedical Sci- i ence and Earth Science) to demonstrate the lan- the context similarity, those Gc with more con- nected vertices such as Figure 3A and Figure 3B guage and domain independent nature of our algo- will get closer to the top of the ranked candidate rithm. graph list. 4.1 EL on Generic English Corpora Candidate Graph Collective Validation: Be- sides the salience, the context similarity provides For this evaluation, we used the TAC-KBP2013 i 1 another ranking score for each candidate c in Gc, EL dataset , which contains 2,190 mentions ex- and it promotes those candidates remaining con- tracted from English newswire, web blogs, and i nected in Gc. However, it fails to differenti- discussion forums. We selected a subset of 1,090 ate how two candidates are connected. In Fig- linkable mentions that have entity referents in the ure 3A, Reche Caldwell is a former player KB for our experiment. DBpedia 3.9, which was of New England Patriots, and in Fig- generated from the Wikipedia dump in early 2013 ure 3B, Andre Caldwell’s Wikipedia article and includes more than 4 million entities and more includes a hyperlink pointing to New England than 470 million facts2, was used as our KB. We Patriots. The former seems a “tighter” rela- followed the KBP EL track using B-Cubed+ (Ji tion than the latter. Although these two distinct et al., 2011) as the evaluation metric. Table 1 relations imply that these two candidate pairs are presents the results of QCV, our baseline system, related with different relation types, the context as well as the top 3 supervised participant sys- similarity rankings for these two candidate graphs tems3and the top 3 unsupervised participant sys- are identical. Based on this observation, assuming tems3 of the TAC-KBP2013 EL track. that a “tighter” relation between two candidates is 3 more likely to be an appropriate representation of System B + F1 4 the relation between their co-occurring mentions Supervised 1st 0.724 4 in the source context, we propose a novel valida- Supervised 2nd 0.721 tion step that not only considers the two previous Supervised 3rd 0.7184 ranking scores of each candidate but also quantita- Unsupervised 1st 0.6324 tively examines the relations between candidates. Unsupervised 2nd 0.5764 We transfer the calculated relation weights from Unsupervised 3rd 0.5734 i Gk to Gc as positive indicators of how tightly two Baseline (unsupervised) 0.697 candidates are related, and then define the com- QCV (unsupervised) 0.749 i i posite graph weight W (Gc) for each Gc in Equa- tion (5) as the final ranking metric: Table 1: Performance on the TAC-KBP2013 EL Dataset (1,090 linkable mentions). i X X W (Gc) = Sa(c)Sm(mc, c) + H(r) (5) i i c∈V (Gc) r∈E(Gc) 1http://www.nist.gov/tac/2013/KBP/data.html i i 2 where V (Gc) and E(Gc) are the vertex set and the http://wiki.dbpedia.org/services-resources/datasets/data- i set-39 edge set of G ; Sa(c), Sm(mc, c), and H(r) are c 3Due to NIST policy, the names of the TAC-KBP2013 given by Equation (2), Equation (4), and Equa- participant systems are not revealed. tion (1), respectively. With this composite graph 4http://www.nist.gov/tac/publications/2013/papers.html As shown in Table 1, QCV not only substan- integrating their SR and CS scores as well as the tially outperforms the best unsupervised systems weights of the KB relations between them. There- but also beats the best supervised systems from fore this improvement is reasonably substantial. the KBP participants. In order to understand this By investigating the remaining errors, we iden- notable advancement, we broke down our system tified several potential causes: 1) Our system occa- into components and evaluated them accumula- sionally could not capture enough context for the tively using the same dataset as above. The ex- target mention. This happened more frequently periment results are summarized in Table 2. for web blogs and discussion forums, where the language was informal and casual. Without any 3 3 3 Components B + P B + R B + F1 linguistic analysis on the source documents, it SR 0.680 0.598 0.636 was difficult for us to extract additional context SR + CS 0.699 0.624 0.659 words. 2) Our simple coreference rules some- SR + CS + CV 0.789 0.712 0.749 times failed to work correctly and introduced false candidates, which, without clear context to dis- Table 2: QCV Performance by Component. ambiguate, could lead to linking errors. 3) Our KB had limited knowledge about some entities in In Table 2, SR, CS, and CV correspond to the a way that certain relations were missing. This Salience Ranking, the Context Similarity Rank- kept us from creating necessary links in the can- ing, and the Collective Validation in our QCV didate graphs and further effectively validating the algorithm, respectively. It can be seen that SR graphs. already outperforms the best KBP unsupervised systems from Table 1. This is mainly attributed 4.2 EL on Generic Chinese Corpora to the engagement of the entropy based relation Using Chinese as a case study, we evaluate the lan- weights which injects the impact of different re- guage portability of our approach. We used the lations into the entity salience. Notwithstanding TAC-KBP2012 Chinese EL dataset1, and selected being somewhat effective, SR solely depends on a subset of 1,240 linkable mentions out of the total the KB and plays its role without the source con- 2,122 mentions extracted from Chinese newswire, text. It should be straightforward that the sys- web blogs, and discussion forums. For KB, we tem performance gets improved after enabling CS still used DBpedia because it contains multilingual since the source context has been incorporated. surface forms for its entities. For instance, the en- However, it was a little puzzling that the perfor- tity Barack Obama has surface forms in over 30 mance boost by enabling CS turned out to be rel- languages including the Chinese one: “贝拉克·奥 atively small. We took a careful look at the in- 巴马”. This cross-lingual surface form mapping termediate experiment results and discovered that naturally provides us with a convenient translation although CS did not produce a lot more correct tool. Table 3 shows the linking performance com- linking results than SR did, it did promote a great parison among QCV, our baseline system, and the number of good candidates to the top of the rank- top 3 participant systems of the KBP Chinese EL ing list. For example, in the Caldwell case, CS track. Again, we employed the B-Cubed+ met- successfully raised the rankings of the context- ric. related candidates such as Reche Caldwell, 3 Andre Caldwell, and Jim Caldwell, de- System B + F1 spite the fact that it delivered Andre Caldwell Clarke et al. (2012) (supervised) 0.493 Monahan and Carpenter (2012) (supervised) 0.660 instead of Reche Caldwell as the final linking Fahrni et al. (2012) (supervised) 0.736 result. This convincingly implies that CS is able to Basline (unsupervised) 0.648 well capture the context of the target mentions, but QCV (unsupervised) 0.671 meanwhile it is deficient in recognizing the subtle contextual difference among similar candidates. Table 3: Performance on the TAC-KBP2012 Chi- In Table 2 there is a significant performance gain nese EL Dataset (1240 linkable mentions). after enabling CV. As described in Section 3.2, CV collectively validates the candidates of the target As shown in Table 3, the best performance is mention “Caldwell” and the mentions in its con- text such as “Florida”, “Patriots”, and “Jets” by 1http://www.nist.gov/tac/2012/KBP/data.html achieved by Fahrni et al. (2012), a supervised sys- System Correct Total Accuracy tem using over 20 fine-tuned features and many Baseline 173 208 83.17% linguistic resources. In contrast, our QCV is an QCV 177 208 85.10% unsupervised approach without using any labeled data or linguistic resources. During the error anal- Table 4: Biomedical Science EL Performance. ysis, we found that in this dataset multiple men- tions are often the variants of the surface form of a single KB entity. For example, “奥巴马” and similar performance to our baseline system which “欧巴马”, being just different Chinese transliter- is the state-of-the-art to our knowledge. However, ations, both refer to “Obama”. This fact tends to we were curious why QCV did not improve the result in a low recall for our system because one baseline system in the biomedical domain as much or more of the mention variants may not exist in as it did in the general domain. After some in- the KB. We decided to heuristically apply a sub- depth analysis of the experiment results, we dis- string matching in addition to the Wikipedia redi- covered that in this dataset the candidates of the rection mapping to boost the recall. However, as related mentions (i.e. those mentions within the one can imagine, this simple strategy will impair same context window) mostly have similar rela- the system precision due to the introduced noise. tions in the KB. In other words, for each men- Take “奥巴马” again for example. If we only tion, the candidate entity types are not as diverse match its second and third characters, “欧巴马” as those in the general domain. As a consequence, will be correctly picked, but “巴马镇” (a small the collective validation step in QCV does not take town in China) will also be falsely included. For- much effect since the weights of the involved re- tunately, our QCV algorithm was able to select and lations are quite close to one another. On such a rank candidates complying with the source con- dataset, the context similarity ranking will play a text. Consequently most of this kind of noise got major part for the disambiguation, and QCV will filtered out, and we thus could produce balanced not be able to function at its full power. Nonethe- precision and recall. less, from the results we can see that our approach We acknowledge that, without performing can be efficiently and effectively adapted to this deeper linguistic analysis on the source docu- new domain. ments, the cross-language surface form mapping 4.4 EL in Earth Science of the KB plays a crucial role in our approach. One can replace it with any machine translation prod- Now we move to another new domain, Earth Sci- uct which, however, is not always available espe- ence. As far as we know, we are the first to study cially for a low-resource language. We should take EL in this domain. In order to create an evaluation advantage of the existing KBs where such cross- dataset, our domain expert selected three scientific lingual mapping has already been widely created. papers about Early Triassic discovery, Global Stra- The latest DBpedia provides localized versions in totype Section, and Triassic crisis, which are three 125 languages1, for instance. different aspects of Earth Science related discov- ery, and then identified 296 mentions that can be 4.3 EL in Biomedical Science linked to DBpedia entities. Table 5 presents the linking accuracy comparison between QCV and To demonstrate the domain portability of our ap- our baseline system. We can see that QCV pro- proach, we first take the biomedical science do- vided significant gains. main as a case study. We conducted our ex- periment using the evaluation dataset created by System Correct Total Accuracy Zheng et al. (2014) which contains 208 linkable Baseline 221 296 74.66% mentions extracted from several biomedical pub- QCV 236 296 79.73% lications. We built our KB with over 300 domain ontologies downloaded from BioPortal. Table 4 Table 5: Earth Science EL Performance. compares the linking accuracy of QCV and our baseline system. The linking errors were mainly caused by the As shown in Table 4, our approach achieves following reasons: 1) As a general KB, DBpe- 1http://wiki.dbpedia.org/about dia has introduced certain noise for our domain- specific EL. For example, in Geology, the term tures from the source documents in order to pre- “Beds” mostly refers to “Geology Bed”, which is cisely select collaborator mentions for collective a division of a geologic formation. But in general, inference. These features include topic model- “Beds” usually means the beds people sleep on. ing (Xu et al., 2012; Cassidy et al., 2012), re- Much more common in the KB, the latter had such lation constraint (Cheng and Roth, 2013), coref- a significantly higher salience score than the for- erential chaining (Nguyen et al., 2012; Huang et mer that the final ranking score of our system got al., 2014), and dependency restriction (Ling et al., biased. 2) Some relations between Earth Science 2014). Some recent work utilized multi-layer lin- related entities are not clearly defined in DBpe- guistic analysis integration to capture contextual dia. For instance, in geology time scale, the period properties for better mention collection (Pan et al., “Chattian” is immediately preceded by the period 2015). While many of these approaches have been “Rupelian”. An explicit relation such as “preceded proved to be effective, the dependency on deep by” should be inserted between these two period linguistic knowledge makes it difficult to migrate entities. Instead, only a vague “wiki link” relation them to a new language or domain. In contrast to is present in our KB. This directly diminishes the these methods, we establish a very loose setting differentiating power of our system on the KB re- for the mention selection, and rely on the quanti- lations. fied information computed from the structured KB It is worth mentioning that there exists a large to collectively evaluate and validate the entity can- number of well established ontologies for different didates. Since the KB is relatively universal to sub-domains of Earth Science. SWEET ontolo- languages and domains, our approach inherently gies1, for example, widely capture Earth and En- is language and domain independent. vironmental terminologies. By adopting these on- Recent cross-lingual EL approaches can be di- tologies, we will be able to considerably improve vided into two types. The first type (McNamee et our domain EL performance, and the benefits of al., 2011; Cassidy et al., 2011; McNamee et al., EL in the domain will further get revealed. 2012; Guo et al., 2012; Miao et al., 2013) trans- 4.5 System Complexity lated entity mentions and source documents from the new language into English and then ran En- We indexed our KB and ontologies in the format of glish mono-lingual EL to link to English KB. The 2 triples using Apache Lucene such that retrieving second type (Monahan et al., 2011; Fahrni and entity candidates of a mention is O(1). We pre- Strube, 2011; Fahrni et al., 2012; Monahan and computed all the entropy-based relation weights Carpenter, 2012; Clarke et al., 2012; Fahrni et al., and entity salience scores with complexities of 2013) developed EL systems on the new language O(nr · ne) and O(ne · k), respectively, where nr and used cross-lingual KB links to map the link re- is the number of KB relations, ne is the number sults back to English KB. While the bottleneck of of KB entities, and k is the number of iterations it the former method usually is on translation errors, took for the salience score to get converged. For the latter approach heavily relies on the linguis- the final QCV score computation, the upper bound tic resources and the KB of the new language. In of the computing time to link all the mentions in a comparison, our system mainly uses the English O(n · n · n · n ) n document is m c nc nm , where m is KB and a mention surface form mapping that can the number of linkable mentions in the document, either come from translation or cross-lingual KB nc is the number of candidates for each mention, links, and requires minimal linguistic resources n and nc is the number of neighbor nodes of a can- from the new language. didate, and nnm is the number of neighbors of a mention. There is a limited amount of research work in the literature that focused solely on domain- 5 Related Work specific EL (Zheng et al., 2014). In the biomed- ical domain, a few studies have been found on In recent years, collective inference methods for EL-related tasks such as scientific name discov- EL have become increasingly popular. Many ef- ery (Akella et al., 2012), gene name normaliza- forts have been devoted to encoding linguistic fea- tion (Hirschman et al., 2005; Fang et al., 2006; 1http://sweet.jpl.nasa.gov Dai et al., 2010), biomedical named entity recog- 2https://lucene.apache.org/ nition (Usami et al., 2011; Van Landeghem et al., 2012) and concept mention extraction (Tsai et al., T. Cassidy, H. Ji, L. Ratinov, A. Zubiaga, and 2013). The baseline system (Zheng et al., 2014) H. Huang. 2012. Analysis and Enhancement of in this paper is the work most similar to ours in Wikification for Microblogs with Context Expan- sion. In Proceedings of the 25th International Con- a sense of collectively aligning mentions to struc- ference on Computational Linguistics. tured KBs. However, our system differs by inte- grating a context similarity ranking and a candi- X. Cheng and D. Roth. 2013. Relational Inference date validation to conduct a two-way collective in- for Wikification. In Proceedings of the 2013 Con- ference on Empirical Methods in Natural Language ference with better performance. Processing.

6 Conclusions and Future Work J. Clarke, Y. Merhav, G. Suleiman, S. Zheng, and D. Murgatroyd. 2012. Basis Technology at TAC Language and domain independence is a new re- 2012 Entity Linking. In Proceedings of Text Analy- quirement to EL systems and this capability is par- sis Conference 2012. ticularly welcome by low-resource language re- lated applications and domain scientists. In this S. Cucerzan. 2011. TAC Entity Linking by Performing Full-Document Entity Extraction and Disambigua- paper we demonstrated a high-performance EL tion. In Proceedings of Text Analysis Conference approach that can be easily migrated to new lan- 2011. guages and domains due to the minimal reliance on linguistic analysis and the deep utilization of H. Dai, P. Lai, and R. T. Tsai. 2010. Multistage Gene Normalization and SVM-Based Ranking for structured KBs. In the future, we plan to improve Protein Interactor Extraction in Full-Text Articles. the source document processing such that the sys- IEEE/ACM Transactions on Computational Biology tem can better extract the mention context without and Bioinformatics, 7(3):412–420. involving extensive linguistic knowledge. We are J. Dalton and L. Dietz. 2013. A Neighborhood Rel- also experimenting with our collective validation evance Model for Entity Linking. In Proceedings algorithm to incorporate the impact of more dis- of the 10th Conference on Open Research Areas in tant KB entities other than just the neighbors. Information Retrieval.

7 Acknowledgement A. Fahrni and M. Strube. 2011. HITS’ Cross-lingual Entity Linking System at TAC2011: One Model for This work was supported by the U.S. DARPA All Languages. In Proceedings of Text Analysis Conference 2011 DEFT Program No. FA8750-13-2-0041, ARL . NS-CTA No. W911NF-09-2-0053, NSF CA- A. Fahrni, T. Gockel,¨ and M. Strube. 2012. HITS’ REER Award IIS-1523198, DARPA LORELEI, Monolingual and Cross-lingual Entity Linking Sys- AFRL DREAM project, gift awards from IBM, tem at TAC 2012: A Joint Approach. In Proceedings Google, Disney and Bosch. The views and con- of the Text Analysis Conference 2012. clusions contained in this document are those of A. Fahrni, B. Heinzerling, T. Gockel, and M. Strube. the authors and should not be interpreted as rep- 2013. HITS’ Monolingual and Cross-lingual En- resenting the official policies, either expressed or tity Linking System at TAC 2013. In Proceedings implied, of the U.S. Government. The U.S. Gov- of Text Analysis Conference 2013. ernment is authorized to reproduce and distribute H. Fang, K. Murphy, Y. Jin, J. S. Kim, and P. S. White. reprints for Government purposes notwithstanding 2006. Human Gene Name Normalization Using any copyright notation here on. Text Matching with Automatically Extracted Syn- onym Dictionaries. In Proceedings of the Workshop on Linking Natural Language Processing and Biol- References ogy: Towards Deeper Biological Literature Analy- sis, pages 41–48. L. M. Akella, C. N. Norton, and H. Miller. 2012. NetiNeti: Discovery of Scientific Names from Text N. Fernandez, J. A. Fisteus, L. Sanchez, and E. Martin. Using Machine Learning Methods. BMC Bioinfor- 2010. WebTLab: A Cooccurence-Based Approach matics, 13:211. to KBP 2010 Entity-Linking Task. In Proceedings of Text Analysis Conference 2010. T. Cassidy, Z. Chen, J. Artiles, H. Ji, H. Deng, L. Rati- nov, J. Zheng, J. Han, and D. Roth. 2011. CUNY- W. A. Gale, K. W. Church, and D. Yarowsky. 1992. UIUC-SRI TAC-KBP2011 Entity Linking System One Sense per Discourse. In Proceedings of the Description. In Proceedings of Text Analysis Con- Fifth DARPA Speech and Natural Language Work- ference 2011. shop. Z. Guo, Y. Xu, F. Mesquita, D. Barbosa, and G. Kon- M. Pennacchiotti and P. Pantel. 2009. Entity Extrac- drak. 2012. ualberta at TAC-KBP 2012: English tion via Ensemble Semantics. In Proceedings of the and Cross-Lingual Entity Linking. In Proceedings 2009 Conference on Empirical Methods in Natural of Text Analysis Conference 2012. Language Processing EMNLP2009.

X. Han, L. Sun, and J. Zhao. 2011. Collective Entity L. Ratinov, D. Roth, D. Downey, and M. Anderson. Linking in Web Text: A Graph-Based Method. In 2011. Local and Global Algorithms for Disam- Proceedings of the 34th Annual ACM SIGIR Con- biguation to Wikipedia. In Proceedings of the 49th ference. Annual Meeting of the Association for Computa- tional Linguistics: Human Language Technologies. L. Hirschman, M. Colosimo, A. Morgan, and A. Yeh. 2005. Overview of BioCreAtivE task 1B: Normal- C. Tsai, G. Kundu, and D. Roth. 2013. Concept-based ized gene lists. BMC Bioinformatics, 6. analysis of scientific literature. In CIKM.

H. Huang, Y. Cao, X. Huang, H. Ji, and C. Lin. Y. Usami, H. Cho, N. Okazaki, and J. Tsujii. 2011. 2014. Collective Tweet Wikification based on Semi- Automatic Acquisition of Huge Training Data for supervised Graph Regularization. In Proceedings Bio-medical Named Entity Recognition. In Pro- of the 52nd Annual Meeting of the Association for ceedings of BioNLP 2011 Workshop. Computational Linguistics. S. Van Landeghem, J. Bjorne,¨ T. Abeel, B. De Baets, H. Ji, R. Grishman, and H.T. Dang. 2011. Overview T. Salakoski, and Y. Van de Peer. 2012. Se- of the TAC 2011 Knowledge Base Population Track. mantically Linking Molecular Entities in Literature In Proceedings of Text Analysis Conference 2011. through Entity Relationships. BMC Bioinformatics, 13. X. Ling, S. Singh, and D. S. Weld. 2014. Context Representation for Named Entity Linking. In Pro- J. Xu, Q. Lu, J. Liu, and R. Xu. 2012. NLPComp in tac ceedings of the 3rd Pacific Northwest Regional NLP 2012 entity linking and slot-filling. In Proceedings of Text Analysis Conference 2012 Workshop. . J. Zheng, D. Howsmon, B. Zhang, J. Hahn, P. McNamee, J. Mayfield, D. Lawrie, D. W. Oard, and D. McGuinness, J. Hendler, and H. Ji. 2014. En- D. Doermann. 2011. Cross-Language Entity Link- tity Linking for Biomedical Literature. In Proceed- Proceedings of the 5th International Joint ing. In ings of the ACM 8th International Workshop on Data Conference on Natural Language Processing. and Text Mining in Bioinformatics, pages 3–4, New P. McNamee, V. Stoyanov, J. Mayfield, T. Finin, York, NY, USA. ACM. T. Oates, T. Xu, D. W. Oard, and D. Lawrie. 2012. HLTCOE Participation at TAC 2012: Entity Link- ing and Cold Start Knowledge Base Construction. In Proceedings of Text Analysis Conference 2012.

Q. Miao, R. Fang, Y. Meng, and S. Zhang. 2013. FRDC’s Cross-lingual Entity Linking System at TAC 2013. In Proceedings of Text Analysis Con- ference 2013.

S. Monahan and D. Carpenter. 2012. Lorify: A Knowledge Base from Scratch. In Proceedings of Text Analysis Conference 2012.

S. Monahan, J. Lehmann, T. Nyberg, J. Plymale, and A. Jung. 2011. Cross-Lingual Cross-Document Coreference with Entity Linking. In Proceedings of Text Analysis Conference 2011.

H. Nguyen, H. Minha, T. Cao, and T. Nguyenb. 2012. JVN-TDT Entity Linking Systems at TAC- KBP2012. In Proceedings of Text Analysis Confer- ence 2012.

X. Pan, T. Cassidy, U. Hermjakob, H. Ji, and K. Knight. 2015. Unsupervised Entity Linking with Abstract Meaning Representation. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics – Hu- man Language Technologies.