Entity Linking with Effective Acronym Expansion, Instance Selection and Topic Modeling

Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence Entity Linking with Effective Acronym Expansion, Instance Selection and Topic Modeling † ‡ ‡ † Wei Zhang Yan Chuan Sim Jian Su Chew Lim Tan †School of Computing ‡ Institute for Infocomm Research National University of Singapore {ycsim, sujian}@i2r.a-star.edu.sg {z-wei, tancl}@comp.nus.edu.sg Abstract the assumption that two variants in the same document refer to the same entity. For example, TSE in Wikipedia refers to Entity linking maps name mentions in the docu- 33 entries, but with its full name Tokyo Stock Exchange, ments to entries in a knowledge base through re- which is unambiguous, we can directly link it to the correct solving the name variations and ambiguities. In this entry without the needs of disambiguation. However, only paper, we propose three advancements for entity two simple rule based approaches have been used on entity linking. Firstly, expanding acronyms can effectively linking so far. Han and Zhao [2009] only allow expansions reduce the ambiguity of the acronym mentions. adjacent to the acronym in parenthesis (e.g. …Israeli Air However, only rule-based approaches relying Force (IAF) ...). The other N-Gram approach [Varma et al., heavily on the presence of text markers have been 2009] suffers from a low precision, because only one rule is used for entity linking. In this paper, we propose a used, that is, “N” continuous tokens have the same initials as supervised learning algorithm to expand more the acronym. complicated acronyms encountered, which leads to Beyond entity linking, previous work on finding acronym 15.1% accuracy improvement over state-of-the-art expansions relies heavily on the presence of text markers acronym expansion methods. Secondly, as entity such as “long (short)”, “short (long)” [Pustejovsky et al., linking annotation is expensive and labor intensive, 2001; Taghva and Gilbreth, 1999; Schwartz and Hearst, 2003; to automate the annotation process without com- Nadeau and Turney, 2005; Chang et al., 2002], the same as promise of accuracy, we propose an instance selec- what is used by Han et al. [2009], or linguistic cues [Larkey tion strategy to effectively utilize the automatically et al., 2000; Park and Byrd, 2001] which are keywords like: generated annotation. In our selection strategy, an also called, known as, short for. These systems are mainly informative and diverse set of instances are selected built for biomedical literature, where acronyms are intro- for effective disambiguation. Lastly, topic modeling duced in a more “formal” manner. However, in the newswire is used to model the semantic topics of the articles. domain, acronym-expansion pairs often do not occur in the These advancements give statistical significant im- same sentence, nor do they follow the familiar pattern of provement to entity linking individually. Collec- being formed from its full form's leading characters. There tively they lead the highest performance on are acronyms such as CCP (Communist Party of China), KBP-2010 task. MOD/MINDEF/MD either of which can stand for Ministry of Defense. Leading characters may be dropped; multiple letters 1 Introduction from a full name could be used; or the order of the letters may Knowledge bases population (KBP) involves gathering in- be swapped, adding to the complexity of decoding acronyms. formation scattered among the documents to populate a Only the following two exceptions use supervised learning knowledge base (KB) (e.g. Wikipedia). This requires either approaches. Chang et al. [2002] use features that describe the linking entity mentions in the documents with entries in the alignments between the acronym and candidate expansion KB or highlighting these mentions as new entries to the (i.e. whether acronym letters are aligned at the beginning of a current KB. word, syllable boundary, etc.). Nadeau and Turney [2005] Entity linking [McNamee and Dang, 2009] involves both incorporate part of speech information in addition to align- finding name variants (e.g. both “George H. W. Bush” and ment information. However, the supervised learning ap- “George Bush Senior” refer to the 41st U.S. president) and proaches have the constraint that acronym and its candidate name disambiguation (e.g. given “George Bush” and its expansion are located in the same sentence. When extended context, we should be able to disambiguate which president to the rest of the document, they are greatly affected by noisy in KB it is referring to). data and do not perform as well. In the name variant finding stage, expanding an acronym For the name disambiguation stage, Varma et al. [2009] (all capitalized short-form word) from its context can effec- rank the entries in KB through a Vector Space Model, which tively reduce the ambiguities of the acronym mention, under cannot combine bag of words with other useful features 1909 effectively. In contrast, current state-of-the-art entity linking 2.2 Name Disambiguation systems are based on supervised learning approach [Dredze First, using a learning to rank method, we rank all the re- et al., 2010; Zheng et al., 2010]. However, they require many trieved KB candidates to identify the most likely candidate. annotated training examples to achieve good performance. In this learning to rank method, each name mention and the Entity linking annotation is expensive and labor intensive associated candidates are formed by a list of feature vectors. because of the large size of referred KB. Meanwhile, entity The instances linked with NIL (no corresponding entry in KB) linking annotation is highly dependent on the KB. When a and the instances with only one KB candidate are removed new KB comes, entity linking annotation process needs to be from the training set. During linking, the score for each can- repeated. Zhang et al. [2010] have tried to automate this didate entry is given by the ranker. The learning algorithm entity linking annotation process. However, as discussed in we used is ranking SVM [Herbrich et al., 2000]. the paper, the distribution of the automatically generated data Next, the preferred KB candidate is presented to a binary is not consistent with the real data set, because only some classifier to determine if it is believed as the target entry for a types of training instances can be generated. name mention. To determine the likelihood of a correct map In this paper, we propose three advancements for entity from the name mention to the top candidate, we employ a linking problem: (1) a supervised learning algorithm for SVM classifier [Vapnik, 1995], which returns a class label. finding flexible acronym's expanded forms in newswire From here, we can decide whether the mention and top can- articles, without relying solely on text markers or linguistic didate are linked. If not, the mention has no corresponding cues, (2) using an instance selection strategy to effectively entry in KB (NIL). utilize the auto-generated annotation and reduce the effect of The base features adopted for both learning to rank and distribution problem mentioned above and (3) effectively classification include 15 feature groups divided to 3 catego- capturing the semantic information between document and ries. A summary of the features is listed in Table 1. The KB entry by a topic model. We conduct empirical evaluation feature details can be found in [Zheng et al., 2010]. on KBP-2010 data [Ji et al., 2010]. The evaluation results show that our acronym expansion method produces a 15.1% improvement over state-of-the-art methods. Meanwhile, Categories Feature Names these three advancements achieve statistical significant im- Surface Exact Equal Surface, Start With Query, End provement to entity linking result individually. Collectively With Query, Equal Word Num, Miss Word Num Contextual TF Sim Context, Similarity Rank, All Words in they give the highest performance on KBP-2010 task. Text, NE Number Match, Country in Text The remainder of this paper is organized as follows. Sec- Match, Country in Text Miss, Country in Title tion 2 introduces the frame work for entity linking. We Match, Country in Title Miss, City in Title present our machine learning method for acronym expansion Match and instance selection strategy for the automatically anno- Others NE Type tated data in Sections 3 and 4 respectively, and the usage of topic information feature in Section 5. Section 6 shows the Table 1: Base Feature Set experiments and discussions. Section 7 concludes our works. 3 Acronym Expansion 2 Entity Linking Framework In this section, we describe our algorithm for finding an Name variation resolution finds variants for each entry in KB acronym’s expansion from a given document. and then generates the possible KB candidates for the given name mention by string matching. Name disambiguation is to 3.1 Identifying Candidate Expansions map a name mention to the correct entry in candidate set. Given a document D and acronym A=a1a2...am, which are 2.1 Name Variation Resolution usually capitalized words in the document, we want to find all possible candidate expansions. First, we add all strings Wikipedia contains many name variants of entities like found in the pattern “A (string)” to our set of candidates, C. confusable names, spelling variations, nick names etc. we Next, we find patterns of “(A)”, and extract the longest con- extract the name variants of an entity in KB by leveraging the tiguous sequence of tokens, E, before “(A)” that do not knowledge sources in Wikipedia: “titles of entity pages”, contain punctuations or more than 2 stop words. Our stop “disambiguation pages” “redirect pages” and “anchor texts” words are from a predefined list of commonly used preposi- [Zhang et al., 2010]. With the acquired name variants for tions, conjunctions, determiners and auxiliary. For example, entries in KB, the possible KB candidates for a given name in the sentence John received an award from the Association mention can be retrieved by string matching.

Entity Linking with Effective Acronym Expansion, Instance Selection and Topic Modeling

Predicting the Correct Entities in Named-Entity Linking

Three Birds (In the LLOD Cloud) with One Stone: Babelnet, Babelfy and the Wikipedia Bitaxonomy

Cross Lingual Entity Linking with Bilingual Topic Model

Babelfying the Multilingual Web: State-Of-The-Art Disambiguation and Entity Linking of Web Pages in 50 Languages

Named Entity Extraction for Knowledge Graphs: a Literature Overview

Entity Linking

Named-Entity Recognition and Linking with a Wikipedia-Based Knowledge Base Bachelor Thesis Presentation Niklas Baumert [[email protected]]

KORE 50^DYWC: an Evaluation Data Set for Entity Linking Based On

Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions Wei Shen, Jianyong Wang, Senior Member, IEEE, and Jiawei Han, Fellow, IEEE

Capturing Semantic Similarity for Entity Linking with Convolutional Neural Networks

Satellitener: an Effective Named Entity Recognition Model for the Satellite Domain

BENNERD: a Neural Named Entity Linking System for COVID-19