Language and Domain Independent Entity Linking with Quantified

Language and Domain Independent Entity Linking with Quantified Collective Validation Han Wang∗;1, Jin Guang Zheng∗;2, Xiaogang Ma1, Peter Fox1, and Heng Ji2 fTetherless World Constellation1, Computer Science Department2g Rensselaer Polytechnic Institute Troy, NY, USA fwangh17, zhengj6, max7, pfox, [email protected] Abstract because not only are there thousands of people with different professions named “Caldwell”, but Linking named mentions detected in a even if as an American football player, as most source document to an existing knowl- people would recognize it from the context, there edge base provides disambiguated entity are several “Caldwell”s who are/were associated referents for the mentions. This al- with either “the Patriots” or “the Jets”. An EL lows better document analysis, knowl- system should be able to disambiguate the men- edge extraction and knowledge base pop- tion by carefully examining the context and then ulation. Most of the previous research identify the correct KB referent, which is Reche extensively exploited the linguistic fea- Caldwell in this case. tures of the source documents in a su- Although EL has attracted a lot of community pervised or semi-supervised way. These attention in the recent years, most research efforts systems therefore cannot be easily ap- have been focused on developing systems only ef- plied to a new language or domain. In fective for generic English corpora. When these this paper, we present a novel unsuper- systems are migrated to a new language or do- vised algorithm named Quantified Col- main, their performance will usually suffer from lective Validation that avoids excessive a noticeable decline due to the following reasons: linguistic analysis on the source docu- 1) State-of-the-art EL systems have developed ments and fully leverages the knowledge comprehensive linguistic features from the source base structure for the entity linking task. documents to generate advanced representations We show our approach achieves state- of the mentions and their context. While this of-the-art English entity linking perfor- methodology has been proved rewarding for a mance and demonstrate successful de- resource-rich language such as English, it prevents ployment in a new language (Chinese) the systems from being adopted to a new language, and two new domains (Biomedical and especially to one with limited linguistic resources. Earth Science). Experiment datasets One can imagine that it would be very difficult, and system demonstration are available if not impossible, for an English EL system that at http://tw.rpi.edu/web/doc/ benefits from the part-of-speech tagging, depen- hanwang_emnlp_2015 for research dency parsing, and named entity recognition to be purpose. deployed to a new language such as Chinese that 1 Introduction and Motivation has quite different linguistic characteristics. 2) The current EL approaches mostly target at The entity linking (EL) task aims at analyzing people, organizations, and geo-political entities each named entity mention in a source document which are widely present in a general KB such and linking it to its referent in a knowledge base as Wikipedia. However, domain-specific EL tends (KB). Consider the following example: “One day to pay more attention to entities beyond the above after released by the Patriots, Florida born Cald- three types. For instance, in the biomedical sci- well visited the Jets. ...... The New York Jets ence domain, protein is a major class of entities have six receivers on the roster: Cotchery, Coles, that greatly interest scientists. Conventional EL ...”. Here “Caldwell” is an ambiguous mention systems are very likely to fail in linking protein ∗These authors contributed equally to this work. mentions in the text due to the lack of labeled training data. Moreover, their reliance on general guages and domains. reference KBs seems insufficient for a specific do- The main novel contributions of this paper are main. Take “A20”, a type of protein as an example. summarized as follows: 1) We design an unsuper- Wikipedia has more than a few items listed un- vised EL algorithm, namely, Quantified Collec- der the name of “A20” and their types range from tive Validation (QCV) that builds KB entity can- aircrafts to roads. This diversified information in- didate graphs with quantified relations for the pur- evitably introduces noise for a biomedical EL ap- pose of collective disambiguation and inference. plication. 2) We develop a procedure of building language One potential solution to tackle these limita- and domain independent EL systems by incorpo- tions is, instead of concentrating on the source rating various ontologies into the QCV compo- documents, to conduct more deliberate study on nent. 3) We demonstrate that our system is able the KB. Structured KBs such as DBpedia1 typi- to achieve state-of-the-art performance in English cally offer detailed descriptions about entities, a EL, and it can also produce promising results for large collection of named relations between enti- Chinese EL as well as EL in Biomedical Science ties, and a growing number of multi-lingual en- and Earth Science. tity surface forms. By embracing these ready- for-use information and linked structures, we will 2 Baseline Collective EL be able to obtain sufficient contextual information As a baseline, we adopt a competitive unsuper- for disambiguation without generating a full list vised collective EL system (Zheng et al., 2014) of linguistic features from the source documents, utilizing structured KBs. It defines entropy based and therefore eliminate the language dependency. weights for the KB relations, and embeds them in Moreover, currently there exist numerous pub- a two-step candidate ranking process to produce licly available domain ontology repositories such the EL results. 2 3 as BioPortal and OBO Foundry which provide Structured KB Terminologies: In a structured significantly more domain knowledge than general KB, a fact is usually expressed in the form of a KBs for EL to leverage. By incorporating these triple: (eh; r; et) where eh, et are called the head domain ontologies, we can easily increase the en- entity and the tail entity, respectively, and r is the tity coverage and reduce noise for deploying EL in relation between eh and et. various new domains. Entropy Based KB Relation Weights: The goal In order to make the most of the KB structure, is to leverage various levels of granularity of KB the mention context should be matched against relations. The calculation of the relation weight the KB such that the relevant KB information H(r) is given in Equation (1): can be extracted. A collective way of aligning X co-occurred mentions to the KB graph has been H(r) = − P (et) log(P (et)) (1) proved to be a successful strategy to better rep- et2Et(r) resent the source context (Pennacchiotti and Pan- tel, 2009; Fernandez et al., 2010; Cucerzan, 2011; where Et(r) is the tail entity set for r in the KB, Han et al., 2011; Ratinov et al., 2011; Dalton and and P (et) is the probability of et appearing as the Dietz, 2013; Zheng et al., 2014; Pan et al., 2015). tail entity for r in the KB. We take a further step to consider quantitatively Salience Ranking: As the first ranking step, we differentiating entity relations in the KB in or- examine the candidates without the context and der to evaluate entity candidates more precisely. prefers those with higher importance in the KB. Meanwhile, we jointly validate these candidates Equation (2) computes the salience score Sa(c) for by aligning them back to the source context and in- a candidate c: tegrating multiple ranking results. This novel EL X Sa(et) framework deeply exploits the KB structure with a Sa(c) = H(r) (2) L(et) light weight representation of the source context, r2R(c);et2Et(r) and thus enables a smooth migration to new lan- where R(c) is the relation set for c in the KB; H(r) is given by Equation (1); E (r) is the tail entity 1http://wiki.dbpedia.org t 2http://bioportal.bioontology.org set with c being the head entity and r being the 3 http://www.obofoundry.org connecting relation in the KB; L(et) denotes the cardinality of the tail entity set with et being the course according to the one sense per discourse head entity in the KB. Sa(c) is recursively com- assumption (Gale et al., 1992), but for simplic- puted until convergence. ity we heuristically set wm to be 7-sentence wide Collective Ranking: The similarity SimF (m; c) as a hyperparameter. Two mention vertices will between a candidate c and its mention m is defined be connected via a dashed edge if they are coref- using Equation (3) as the final ranking score: erential but are not located in the same context F window. Here we determine the coreference by Sim (m; c) = α · JS(m; c) · Sa(c) performing substring matching and abbreviation X X (3) + β · H(r) · Sa(n) expansion. The dashed edge indicates the out- r2R(c) n2Et(r)\C(m) of-context coreferential mention together with its neighbors will be indirectly included in Gm as where JS(m; c) is the Jaccard similarity between extended context to later facilitate the candidate the string surface forms of m and c; Sa(c) and graph collective validation. Note that all of these Sa(n) are both evaluated by Equation (2); C(m) loose settings comply with our intention of gener- denotes the candidate set for mention m; α and β ating a light-weight source context representation are hyperparameters. born with domain and language independence. 3 Quantified Collective Validation Florida Incorporating the KB relation weighing mecha- Caldwell Cotchery nism of the baseline system, our QCV algorithm constructs a number of candidate graphs for a Patriots Jets New York Jets given set of collaborative mentions, and then per- forms a two-level ranking followed by a collective Coles validation on those candidate graphs to acquire the linking results.

Load more