Wikipedia Search As Effective Entity Linking Algorithm

Wikipedia Search as Effective Entity Linking Algorithm Milan Dojchinovski1,2, Toma´sˇ Kliegr1, Ivo Lasekˇ 1,2 and Ondrejˇ Zamazal1 1 Department of Information and Knowledge Engineering Faculty of Informatics and Statistics University of Economics, Prague, Czech Republic [email protected] 2 Web Engineering Group Faculty of Information Technology Czech Technical University in Prague, Czech Republic [email protected] Abstract we have developed the Entityclassifier.eu 1(Dojchinovski and Kliegr, 2013) and SemiTags2 This paper reports on the participation of the (Lasekˇ and Vojta´s,ˇ 2013) NER systems. Both sys- LKD team in the English entity linking task tems can spot entities in text, link them with entities at the TAC KBP 2013. We evaluated various in Wikipedia (resp. DBpedia) and finally, classify modifications and combinations of the Most- 3 Frequent-Sense (MFS) based linking, the En- them with concepts from the DBpedia Ontology. tity Co-occurrence based linking (ECC), and While the Entityclassifier.eu system performs entity the Explicit Semantic Analysis (ESA) based linking based on the Most-Frequent-Sense (MFS) linking. We employed two our Wikipedia- method and does not use the text context around based NER systems, the Entityclassifier.eu the entities, the SemiTags utilizes more advanced and the SemiTags. Additionally, two Lucene- entity linking method which is based on Entity Co- based entity linking systems were developed. occurrence (ECC) in Wikipedia and uses the entity For the competition we submitted 9 submis- text context. sions in total, from which 5 used the textual context of the entities, and 4 submissions did In this paper, we report on the evaluation of the not. Surprisingly, the MFS method based on entity linking methods of the Entityclassifier.eu and the Wikipedia Search has proved to be the Semitags NER systems on the TAC KBP 2013 En- most effective approach – it achieved the best tity Linking task.4 For the task we also devel- 3+ 0:555 B F1 score from all our submissions oped and evaluated additional entity linking method 0:677 3+ and it achieved high B F1 score for which is based on the Explicit Semantic Analy- Geo-Political (GPE) entities. In addition, the ESA based method achieved best 0:483 B3+ sis (ESA) approach (Gabrilovich and Markovitch, F1 for Organization (ORG) entities. 2007). We also evaluated two additional variants of the MFS method used by the Entityclassifier.eu system. 1 Introduction The reminder of the paper is structured as fol- In the last decade the number of Named Entity lows. Section 2 describes how the provided TAC Recognition (NER) systems which recognize, clas- KBP knowledge base was prepared and linked with sify and link entities in text with entities in other our Wikipedia (resp. DBpedia) knowledge base. knowledge bases is constantly increasing. One of Section 3 describes our entity linking methodology, the key tasks of a NER system is the entity linking evaluated entity linking methods and provides de- task. Its ultimate goal is to enable linkage of text cor- 1 pora (not structured information) with other knowl- http://entityclassifier.eu/ 2http://ner.vse.cz/SemiTags/ edge bases (structured information). 3http://dbpedia.org/Ontology In order to enable linking of entities in text 4http://www.nist.gov/tac/2013/KBP/ with other knowledge bases, in our previous work EntityLinking/ scription for each of the 9 submissions. Section 4 of the methods, the MFS method and the ECC presents and discusses our achieved results. Finally, method, are already used for entity linking in the Section 5 concludes the paper. Entityclassifier.eu and SemiTags NER systems re- spectively. A third, novel method based on the ESA 2 Task Description and Data Preparation approach was additionally developed for the entity 2.1 The Entity Linking Task linking challenge. All three methods follow the three-steps approach The entity linked task, as described by the chal- defined as follows. First, it applies candidate selec- lenge organizers5 is described as task of linking en- tion, where a set of entity candidates are retrieved tity mentions in a document corpora to entities in for the given entity. Second, it performs the disam- a reference KB. If the entity is not already in the biguation, where one candidate from the candidates reference KB, a new entity node should be added list is selected as the correct one. Finally, selected to the KB. Each participation team was given a set entity candidate is linked, i.e. a reference in the TAC of 2190 queries consisting of a queryID, docID, KBP knowledge base is identified. If the entity is name (name mention of an entity) and beg and end not found in the KB, then, a new entity NIL node is entity offsets in the document. added. Further, the system performing the entity linking Bellow we describe each used method for entity task had to output the results providing information linking followed by a description of each submis- about the queryID, KB link (or NIL entity iden- sion. tifier, if the entity was not present in the KB) and a confidence value. 3.1 Most Frequent Sense Method 2.2 Data Preparation The MFS method is a context independent method which does not use the context text around the entity, Since the TAC KBP reference knowledge base uses but it uses only the entity name when performing the custom identifiers of the entities (e.g. E0522900), linking. In this approach the entity is linked with and our systems identify the entities with DBpedia the most-frequent-sense entity found in the refer- URIs, it was necessary to perform mapping of these ence knowledge base. To realize the MFS approach identifiers. we used the available English Wikipedia Search API In the TAC KBP knowledge base each entity entry and we also used a specialized Lucene index6 which provides information about the custom identifier of extends the Apache Lucene search API. It primarily the entity, and the path URL segment of a Wikipedia ranks pages based on number of backlinks and the article describing the entity (e.g. entity with URL Wikipedia articles’ titles. Note that the Wikipedia Sam Butler and id E0522900). Since DBpedia Search API is build on top of the Lucene index and also derives its URIs from the URIs of the Wikipedia it offers some more functionalities. This approach articles, we used the URLs of the Wikipedia ar- was considered as a baseline. ticles describing the entities to map them to DB- pedia. For example, the entity in the TAC KBP 3.2 Entity Co-occurrence Method KB with identifier E0522900 was mapped to DBpe- By ECC method, we aim at capturing relations be- dia URI http://dbpedia.org/resource/ tween entities rather than their textual representa- Sam_Butler. This way we could relate the DB- tion. An example of a similar structural representa- pedia URI identifiers of our systems with the entity tion is described in (Milne and Witten, 2008), where identifiers in the TAC reference knowledge base. each entity (represented as a Wikipedia article) is 3 Methodology characterized by a structure of incoming links in- stead of its textual description. In our Entity Linking approach we have developed Contrary to the approach presented in (Milne and used three different independent methods. Two and Witten, 2008) our ECC based model does not 5http://www.nist.gov/tac/2013/KBP/ 6http://www.mediawiki.org/wiki/ EntityLinking/ Extension:Lucene-search compare similarities of individual entities. We are method uses an inverted index to retrieve Wikipedia searching for the best combination of candidates articles c1; : : : ; cn containing wi. The semantic re- (possible meanings) for individual surface forms in latedness of the word wi with concept cj is com- an analysed text, where individual paragraphs repre- puted such that the strength of association between sent the context. wi and cj is multiplied with the TF-IDF weight of wi For example, let us consider the following sen- in T . The relatedness score for any two documents tence: Michael Bloomberg was the mayor of New is determined by computing the cosine similarity be- York. Simple observation shows that the entity tween the vectors of document-concept semantic re- Michael Bloomberg (former mayor of New York) latedness. co-occurs in the same paragraph in our knowledge ESA has a number of a follow-up papers describ- base together with the correct entity New York City ing particularly its applications in various areas of in United States much more often (88 times) than information retrieval, including cross-language in- with the New York in England (0 times). formation retrieval (cf. (Gottron et al., 2011) for an Because generating all candidate combinations is overview). The use of ESA for disambiguation of a a very demanding task, we developed a heuristic that surface form to a Wikipedia URL was proposed in quantifies an impact of co-occurrences in the same (Fernandez et al., 2011). paragraph. We construct an incidence matrix I of the size 3.4 Submissions Description jCj × jCj, where C is the set of candidate entities (possible meanings of the identified name). The For the TAC KBP 2013 Entity Linking task we have weights in the matrix are the co-occurrence mea- submitted 9 runs based on different variations of the sures. MFS, ECC and ESA methods. Bellow we provide In our case we measure number of paragraphs detailed description of each individual submission. where the two candidates occur in the same para- The first four runs rely on the MFS linking approach, graph in our knowledge base (Wikipedia). the fifth run rely on the entity ECC linking approach, the sixth on the ESA linking approach, the seventh Then we compute a score ei;s for each candidate as a sum of a line of the matrix representing the can- is a merged submission of the ECC and ESA linking didate (Equation 1).

Wikipedia Search As Effective Entity Linking Algorithm

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support