AIDArabic A Named-Entity Disambiguation Framework for Text

Mohamed Amir Yosef, Marc Spaniol, Gerhard Weikum Max-Planck-Institut fur¨ Informatik, Saarbrucken,¨ Germany mamir|mspaniol|weikum @mpi-inf.mpg.de { }

Abstract tion (NED) is essential for many application in the domain of Information Retrieval (such as informa- There has been recently a great progress in tion extraction). It also enables producing more the field of automatically generated knowl- useful and accurate analytics. The problem has edge bases and corresponding disambigua- been exhaustively studied in the literature. The tion systems that are capable of mapping essence of all NED techniques is using background text mentions onto canonical entities. Ef- information extracted from various sources (e.g. forts like the before mentioned have en- ), and use such information to know the abled researchers and analysts from vari- correct/intended meaning of the mention. ous disciplines to semantically “understand” The Arabic content is enormously growing on contents. However, most of the approaches the Internet, nevertheless, background ground in- have been specifically designed for the En- formation is clearly lacking behind other languages glish language and - in particular - sup- such as English. Consider Wikipedia for example, port for Arabic is still in its infancy. Since while the contains more than 4.5 the amount of Arabic Web contents (e.g. million articles, the Arabic version contains less in ) has been increasing dra- than 0.3 million ones 1. As a result, and up to our matically over the last years, we see a knowledge, there is no serious work that has been great potential for endeavors that support done in the area of performing NED for Arabic an entity-level analytics of these data. To input text. this end, we have developed a framework called AIDArabic that extends the existing AIDA system by additional components 1.2 Problem statement that allow the disambiguation of Arabic texts based on an automatically generated NED is the problem of mapping ambiguous names knowledge base distilled from Wikipedia. of entities (mentions) to canonical entities regis- Even further, we overcome the still exist- tered in an entity catalog (knowledgebase) such as ing sparsity of the Arabic Wikipedia by ex- Freebase (www.freebase.com), DBpedia (Auer et ploiting the interwiki links between Arabic al., 2007), or Yago (Hoffart et al., 2013). For ex- and English contents in Wikipedia, thus, ample, given the text “I like to visit Sheikh Zayed. Despite being close to Cairo, it is known to be a enriching the entity catalog as well as dis- ambiguation context. quiet district”, or in Arabic,“    qJ ‚Ë@ èPA KPI . k@ á ÓAîE.Q¯ á ÓÑ«QËAK. ZðYêËAK. Q ÒJK úæê¯ .Y K@P 1 Introduction èQ ëA®Ë@”. When processing this text automatically, we need to be able to tell that Sheikh Zayed de- 1.1 Motivation notes the the city in Egypt2, not the mosque in Internet data including news articles and web pages, Abu Dhabi3 or the President of the United Arab contain mentions of named-entities such as people, places, organizations, etc. While in many cases 1as of July 2014 the intended meanings of the mentions is obvi- 2http://en.wikipedia.org/wiki/Sheikh Zayed City ous (and unique), in many others, the mentions  http://ar.wikipedia.org/wiki/Y K@P _ qJ ‚.Ë@_ éJ KYÓ are ambiguous and have many different possible 3http://en.wikipedia.org/wiki/Sheikh Zayed Mosque meanings. Therefore, Named-Entity Disambigua- http://ar.wikipedia.org/wiki/Y K@P _ qJ ‚.Ë@_ ©ÓAg.

187 Proceedings of the EMNLP 2014 Workshop on Arabic Natural Langauge Processing (ANLP), pages 187–195, October 25, 2014, Doha, . c 2014 Association for Computational Linguistics Emirates4. In order to automatically establish such order of magnitude smaller than the English mappings, the machine needs to be aware of the one. In addition, many entities in the Arabic characteristic description of each entity, and try to Wikipedia are specific to the Arabic culture find the most suitable one given the input context. with no corresponding English counterpart. In our example, knowing that the input text men- As a consequence, even many prominent enti- tioned the city of Cairo favors the Egyptian city ties are missing from the Arabic Wikipedia. over the mosque in Abu Dhabi, for example. In Name-Entity Dictionary: Most of the name- principle, state-of-the-art NED frameworks require • main four ingredients to solve this problem: entity dictionary entries originate from man- ual input (e.g. anchor links). Like outlined Entity Repository: A predefined universal before, Arabic Wikipedia has fewer resources • catalog of all entities known to the NED to extract name-entity mappings, caused by framework. In other words, each mention in the lack of entities and lack of manual input. the input text must be mapped to an entity in Entity-Descriptions: As already mentioned, the repository, or to null indicating the correct • there is a scarcity of anchor links in the Arabic entity is not included in the repository. Wikipedia. Further, the categorization system Name-Entity Dictionary: It is a many-to- of entities is insufficient, Both are essential • many relation between possible mentions and sources of building the entities descriptions. the entities in the repository. It connects an Hence, it is more challenging to produce com- entity with different possible mentions that prehensive description of each entity. might be used to refer to this entity, as well as Entity-Entity Relatedness Model: Related- connecting a mention with all potential candi- • ness estimation among entities is usually com- date entity it might denote. puted using the overlap in the entities descrip- Entity-Descriptions: It keeps per entity a tion and/or link structure of Wikipedia. Due to • bag of characteristic keywords or keyphrases the previously mentioned scarcity of contents that distinguishes an entity from another. In in the Arabic Wikipedia, it is also difficult to addition, they come with scoring scheme that accurately estimate the entity-entity related- signify the specificity of such keyword to that ness. entity. As a consequence, the main challenge in per- Entity-Entity Relatedness Model: For co- forming NED on Arabic text is the lack of a com- • herent text, the entities that are used for map- prehensive entity catalog together with rich descrip- ping all the mentions in the input text, should tions of each entity. We considered our open source 5 be semantically related. For that reason, an AIDA system (Hoffart et al., 2011)- mentioned as entity-entity relatedness model is required to state-of-the-art NED System by (Ferrucci, 2012) - asses the coherence. as a starting point and modified its data acquisition pipeline in order to generate a schema suitable for For the English language, all of the ingredi- performing NED on Arabic text. ents mentioned above are richly available. For instance, the English Wikipedia is a comprehen- 1.3 Contribution sive up-to-date resource. Many NED systems We developed an approach to exploit and fuse cross- use Wikipedia as their entity repository. Further- lingual evidences to enrich the background informa- more, many knowledge bases are extracted from tion we have about entities in Arabic to build a com- Wikipedia as well. When trying to apply the exist- prehensive entity catalog together with their con- ing NED approaches on the Arabic text, we face text that is not restricted to the Arabic Wikipedia. the following challenges: Our contributions can be summarized in the follow- ing points: Entity Repository: There is no such compre- • hensive entity catalog. Arabic Wikipedia is an Entity Repository: We switched to • YAGO3(Mahdisoltani et al., 2014), the 4http://en.wikipedia.org/wiki/Zayed bin Sultan Al Nahyan  5 http://ar.wikipedia.org/wiki/àA JîE _ È@_ àA ¢Êƒ_ áK ._ Y K@P https://www.github.com/yago-naga/aida

188 multilingual version of YAGO2s. YAGO3 mentions and the entities are the contextual similar- comes with a more comprehensive catalog ity between mention’s context and entity’s context. that covers entities from different languages The weights on the edges between the entities are (extracted from different Wikipedia dumps). the semantic relatedness among those entities. In a While we selected YAGO3 to be our back- subsequent process, the graph is iteratively reduced ground knowledge base, any multi-lingual to achieve a dense sub-graph where each mention knowledge base such as Freebase could be is connected to exactly one entity. used as well. The CSAW system uses local scores computed from 12 features extracted from the context sur- Name-Entity Dictionary: We compiled a • rounding the mention, and the candidate entities dictionary from YAGO3 and Freebase to pro- (Kulkarni et al., 2009). In addition, it computes vide the potential candidate entities for each global scores that captures relatedness among anno- mention string. While the mention is in Ara- tations. The NED is then formulated as a quadratic bic, the entity can belong to either the English programming optimization problem, which nega- or the Arabic Wikipedia. tively affects the performance. The software, how- ever, is not available. Entity-Descriptions: We harnessed different • ingredients in YAGO3, and Wikipedia to pro- DBpedia Spotlight uses Wikipedia anchors, ti- duce a rich entity context schema. For the tles and redirects to search for mentions in the input sake of precision, we did not employ any au- text (Mendes et al., 2011). It casts the context of the tomated translation. mention and the entity into a vector-space model. Cosine similarity is then applied to identify the Entity-Entity Relatedness Model: We candidate with the highest similarity. Nevertheless, • fused the link structure of both the English their model did not incorporate any semantic relat- and Arabic Wikipedia’s to compute a com- edness among entities. The software is currently prehensive relatedness measure between the available as a service. entities. TagMe 2 exploits the Wikipedia link structure to estimate the relatedness among entities (Ferragina 2 Related Work and Scaiella, 2010). It uses the measure defined by NED is one of the classical NLP problems that (Milne and Witten, 2008) and incorporates a voting is essential for many Information Retrieval tasks. scheme to pick the right mapping. According to Hence, it has been extensively addressed in NLP the authors, the system is geared for short input research. Most of NED approaches use Wikipedia text with limited context. Therefore, the approach as their knowledge repository. (Bunescu and Pasca, favors coherence among entities over contextual 2006) defined a similarity measure that compared similarity. TagMe 2 is available a service. the context of a mention to the Wikipedia cate- Illinois Wikifier formulates NED as an opti- gories of the entity candidate. (Cucerzan, 2007; mization problem with an objective function de- Milne and Witten, 2008; Nguyen and Cao, 2008) signed for higher global coherence among all men- extended this framework by using richer features tions (Ratinov et al., 2011). In contrast to AIDA for similarity comparison. (Milne and Witten, and TagMe 2, it does not incorporate the link struc- 2008) introduced the notion of semantic related- ture of Wikipedia to estimate the relatedness among ness and estimated it using the the co-occurrence entities. Instead, it uses normalized Google sim- counts in Wikipedia. They used the Wikipedia link ilarity distance (NGD) and pointwise mutual in- structure as an indication of occurrence. Below, formation. The software is as well available as a we give a brief overview on the most recent NED service. systems: Wikipedia Miner is a machine-learning based The AIDA system is an open source system approach (Milne and Witten, 2008). It exploits that employs contextual features extracted from three features in order to train the classifier. The Wikipedia (Hoffart et al., 2011; Yosef et al., 2011). features it employs are prior probability that a men- It casts the NED problem into a graph problem tion refers to a specific entity, properties extracted with two types of nodes (mention nodes, and en- from the mention context, and finally the entity- tity nodes). The weights on the edges between the entity relatedness. The software of Wikipedia

189 Freebase

Freebase-to-YAGO Standard AIDA Mixed Mixed Translator Filter Dictionary Builder AIDA AIDA Schema Schema

Original AIDA Pipeline YAGO3

Arabic English AIDA Wikipedia Schema YAGO Entities Extractor Dictionary

Arabic Wikipedia Categories Dictionary

Extraction AIDA Schema Building Translation Filtration

Figure 1: AIDArabic Architecture

Miner is available on their Website. bic Wikipedia’s. In addition, the correspond- The approaches mentioned before have been de- ing anchortexts, categories as well as links veloped for English language NED. As such, none from and/to each entity. of them is ready to handle Arabic input without Entity Dictionary: This is an automatically major modification. • As of now, no previous research exploits cross- compiled mappings that captures the inter- lingual resources to enable NED for Arabic text. wiki links among the English and the Arabic Nevertheless, cross-lingual resources have been Wikipedia’s. used to improve Arabic NER (Darwish, 2013). Categories Dictionary: This is also an auto- • They used Arabic and English Wikipedia together matically harvested list of mappings between with DBpedia in order to build a large Arabic- the English and Arabic Wikipedia categories. English dictionary for names. This augments the Arabic names with a capitalization feature, which More details about data generated by each and is missing in the Arabic language. every extractor will be given in Section 4.

3 Architecture AIDA Schema Building In this stage we invoke the original AIDA schema In order to build AIDArabic, we have extended the builder without any language information. How- pipeline used for building an English AIDA schema ever, we additionally add the Freebase knowledge from the YAGO knowledge base. The new archi- base to AIDA and map Freebase entities to YAGO3 tecture is shown in Figure 1 and indicates those entities. Freebase is used here solely to harness its components, that have been added for AIDArabic. coverage of multi-lingual names of different enti- These are pre- and post-processing stages to the ties. It is worth noting that Freebase is used merely original AIDA schema extractor. The new pipeline to enrich YAGO3, but the set of entities are gath- can be divided into the following stages: ered from YAGO. In other words, if there is an Extraction entity in Freebase without a YAGO counter part, it We have configured a dedicated YAGO3 extrac- gets discarded. tor to provide the data necessary for AIDAra- Translation bic. To this end, we feed the English and Arabic Although it is generally viable to use machine trans- Wikipedia’s into YAGO3 extractor to provide three lation or “off the shelf” English-Arabic dictionaries major outputs: to translate the context of entities. However, we Entity Repository: A comprehensive set of confine ourselves to the dictionaries extracted from • entities that exist in both, the English and Ara- Wikipedia that maps entities as well as categories

190 from English to Arabic. This is done in order to This dictionary translates any language specific achieve a high precision derived from the manual entity into the one that is used in YAGO3 (whether labor inherent in interwiki links and assigned cate- the original one, or the English counter part). gories. Based on the the previous example, the following entries are created in the dictionary: Filtration This is a final cleaning stage. Despite translating the context of entities using the Wikipedia-based ar/ Qå”Ó → dictionaries as comprehensive as possible, a con- ar/   ar/   èQëA®Ë@ HQ« → èQëA®Ë@ HQ« siderable amount of context information remains . . in English (e.g. those English categories that do Such a dictionary is essential for all further pro- not have an Arabic counterpart). To this end, any cessing we do over YAGO3 to enrich the Arabic remaining leftovers in English are being discarded. knowledge base using the English one. It is worth noting here, that this dictionary is completely au- 4 Implementation tomatically harvested from the inter-wiki links in This section explains the implementation of the Wikipedia, and hence no automated machine trans- pipeline described in Section 3. We first high- lation and/or transliteration are invoked (e.g. for light the differences between YAGO2 and YAGO3, Person Names, Organization Names, etc.). While which justify the switch of the underlying knowl- this may harm the coverage of our linkage, it guar- edge base. Then, we present the techniques we antees the precision of our mapping at the same have developed in order to build the dictionary be- time. This is thanks to the high quality of inter- tween mentions and candidate entities. After that, wiki between named-entities in Wikipedia. we explain the context enrichment for Arabic enti- ties by exploiting cross-lingual evidences. Finally, 4.2 Name-Entity Dictionary we briefly explain the entity-entity relatedness mea- The dictionary in the context of NED refers to the sure applied for disambiguation. In the following relation that connects strings to canonical entities. table (cf. Table 1 for details) we summarize the In other words, given a mention string, the dictio- terminology used in the following section. nary provides a list of potential canonical entities this string may refer to. In our original implemen- 4.1 Entity Repository tation of AIDA, this dictionary was compiled from YAGO3 has been specifically designed as a multi- four sources extracted from Wikipedia (titles, dis- lingual knowledge base. Hence, standard YAGO3 ambiguation pages, redirects, and anchor texts). extractors take as an input a set of Wikipedia dumps We used the same sources after adapting them to from different languages, and produce a unified the Arabic domain, and added to them entries com- repository of named entities across all languages. ing from Freebase. In the following, we briefly This is done by considering inter-wiki links. If an summarize the main ingredients used to populate entity in language l L en has an English our dictionary: ∈ − { } counter part, the English one is kept instead of l Titles: The most natural possible name of a that in language , otherwise, the original entity • is kept. For example, in our repository, the entity canonical entity is the title of its correspond- used to represent Egypt is “Egypt” coming from ing page in Wikipedia. This is different from the English Wikipedia instead of “ar/Qå”Ó” coming the entity ID itself. For example, in our exam- from the Arabic Wikpedia. However, the entity that ple for the entity “Egypt” that gets its id from refers to the western part of Cairo is identified as the English Wikipeida, we consider the title “ar/ èQëA ®Ë@ HQ« ” because it has no counter-part in “Qå”Ó” coming from the Arabic Wikipedia. the English Wikipedia.. Formally, the set of entities Disambiguation Pages: These pages in YAGO3 are defined as follows: • are called in the Arabic Wikipedia “   ”. They are dedicated E = Een Ear iJ “ñJË@ HAj®“ ∪ pages to list the different possible meanings After the extraction is done, YAGO3 generates of a specific name. We harness all the links an entity dictionary for each and every language. in a disambiguation page and add them as

191 l A language in Wikipedia L Set of all languages in Wikipedia een An entity originated from the English WIkipedia ear An entity originated from the Arabic WIkipedia e An entity in the final collection of YAGO3 E Set of the corresponding entities Caten(e) Set of Categories of an entity e in the English Wikipedia Catar(e) Set of Categories of an entity e in the Arabic Wikipedia Inlinken(e) Set of Incoming Links to an entity e in the English Wikipedia Inlinkar(e) Set of Incoming Links to an entity e in the Arabic Wikipedia Trans(S) Translation of each element in S from English to Arabic using the appropriate dictionaries en ar → Table 1: Terminology

potential entities for that name. To this end, a mention called “The Eastern Area”, one of we extract our content solely from the Arabic the potential candidate meanings is the city of  Al-Ain in . Wikipedia. For instance, the phrase “ éJ KYÓ ” has a disambiguation page that lists all Y K@P Freebase: Freebase is a comprehensive re- the cities that all called Zayed including the • source which comes with multi-lingual labels ones in Egypt, and United Arab Emi- of different entities. In addition, there is a rates. one-to-one mapping between (most of) Free- base entities and YAGO3 entities, because Freebase is extracted from Wikipedia as well. Therefore, we carry over the Arabic names of the entities from Freebase to our AIDA Redirects:“  ” denotes redirects in schema after mapping the entities to their cor- • HCK ñm' Arabic Wikipedia. Those are pages where responding ones in YAGO3. you search for a name and it redirects you to the most prominent meaning of this name. 4.3 Entity-Descriptions This we extract from the Arabic Wikipedia as The context of an entity is the cornerstone in the well. For example, if you search in the Arabic data required to perform NED task with high qual- Wikipedia for the string “Y K@P ”, you will be au- ity. Having a comprehensive and “clean” context tomatically redirected to page of the president for each entity facilitates the task of the NED al- of the United Arabic Emirates. gorithm by providing good clues for the correct mapping. We follow the same approach that we used in the original AIDA framework by repre- senting an entity context as a set of characteristic keyphrases that captures the specifics of such en- tity. The keyphrases are further decomposed into Anchor Text: When people create links • keywords with specificity scores assigned to each in Wikipedia, sometimes they use different of them in order to estimate the global and entity- names from the title of the entity page as an an- specific prominence of this keyword. The origi- chor text. This indicates that this new name is nal implementation of AIDA extracted keyphrases also a possible name for that entity. Therefore, from 4 different sources (anchor text, inlink titles, we collect all anchors in the Arabic Wikipedia categories, as well as citation titles and external and associate them with the appropriate en- links). Below we summarize how we adopted the tities. For example, in the Arabic Wikipedia extraction to accommodate the disambiguation of page of Sheikh Zayed, there is a anchor link Arabic text. to the city of Al Ain “ar/á ªË@”, while the an- chor text reads “     ” (in English: Anchor Text: Anchors in a Wikipedia page éJ ¯Qå„Ë@ 鮢JÖÏ@ • “The Eastern Area”). Therefore, when there is are usually good indicators of the most im-

192 portant aspects of that page. In the original Table 2 summarizes which context resource has implementation of AIDA, all anchors in a been translated and/or enriched from the English page are associated with the corresponding Wikipedia. entity of this page, and added to the set of its keyphrases.The same holds for AIDAra- 4.4 Entity-Entity Relatedness Model bic. However, we extract the anchors from the For coherent text, there should be connection be- Arabic Wikipedia to get Arabic context. tween all entities mentioned in the text. In other words, a piece of text cannot cover too many as- Inlink Titles: In the same fashion that links • pects at the same time. Therefore, recent NED tech- to other entities are good clues for the aspects niques exploit entity-entity relatedness to further of the entity, links coming from other entities improve the quality of mapping mentions to enti- are as well. In AIDA, the set of the titles ties. The original implementation of AIDA used of the pages that has links to an entity were for that purpose a measure introduced by (Milne considered among the keyphrases of such an and Witten, 2008) that estimates the relatedness entity. We pursued the same approach here, or coherence between two entities using the over- and fused incoming links to an entity from lap in the incoming links to them in the English both English and Arabic Wikipedia. Once Wikipedia. set of the incoming links was fully built, we Despite the cultural difference, it is fairly con- applied - when applicable - interwiki links ceivable to assume that if two entities are related in to get the translation of titles of the entities the English Wikipedia, they should also be related coming from the English Wikipedia into the in the Arabic one. In addition, we enrich the link Arabic language. Formally: structure used in AIDA with the link structure of the Arabic Wikipedia. Hence, we estimate the relat- edness between entities using overlap in incoming Inlink(e) =Inlinkar(e) links in both the English and Arabic Wikipedia’s ∪ together. Trans(Inlinken(e)) en ar → 5 Experimentation Categories: Each Wikipedia page belongs to • 5.1 Setup and Results one or more categories, which are mentioned at the bottom part of the page. We configured Up to our knowledge, there is no standard Arabic YAGO3 to provide the union of the categories data set available for a systematic evaluation of from both, the English and Arabic Wikipedia. NED. In order to assess the quality of our system, We exploit the interwiki links among cate- we manually prepared a small benchmark collec- gories to translate the English categories to tion. To this end, we gathered 10 news articles from Arabic. This comes with two benefits, we www.aljazeera.net from the domains of sports and use the category mappings which result in politics including regional as well as international fairly accurate translation in contrast to ma- news. We manually annotated the mentions in the chine translation. In addition, we enrich the text, and disambiguated the text by using AIDAra- category system of the Arabic Wikipedia by bic. In our setup, we used the LOCAL configu- categories from the English for entities that ration setting of AIDA together with the original have corresponding English counterpart. weights. The data set contains a total of 103 men- tions. AIDArabic managed to annotate 34 of them correctly, and assigned 68 to NULL, while one mention was mapped wrongly. Cat(e) = Catar(e) Trans(Caten(e)) ∪ en ar → 5.2 Discussion Citation Titles and External Links: Those AIDArabic performance in terms of precision is • were two sources of entities context in the impressive (%97.1). Performance in that regard is original Wikipedia. Due to the small coverage positively influenced by testing on a “clean” input in the Arabic Wikipedia, we ignored them in of news articles. Nevertheless, AIDArabic loses on AIDArabic. recall. Mentions that are mapped to NULL, either

193 Context Source Arabic Wikipedia English Wikipedia Anchor Text + - Categories + + Title of Incoming Links + +

Table 2: Entities Context Sources have no correct entity in the entity repository, or text (e.g. social media) which contains a consid- the entity exists but lacks the corresponding name- erable amount of misspellings for example. Apart entity dictionary entry. from assessing and improving AIDArabic, a natural This observation confirms our initial hypothe- next step is to extend the framework by extractors sis that lack of data is one of the main challenges for other languages, such as French or German. for applying NED on Arabic text. Another aspect By doing so, we are going to create a framework, that harms recall is the nature of Arabic language. which will be in its final version fully language Letters get attached to the beginning and/or the agnostic. end of words (e.g. connected prepositions and pro- nouns). In such a case, when querying the dictio- Acknowledgments nary, AIDArabic is not able to retrieve the correct We would like to thank Fabian M. Suchanek and candidates for a mention like “A‚Q®K.”, because of Joanna Biega for their help with adopting YAGO3 the “H” in the beginning. Similar difficulties arise extraction code to fulfill AIDArabic requirements. when matching. the entities description. Here, many keywords do not be to match the input text because they appear in a modified version augmented with References some extra letters. [Auer et al.2007] Soren¨ Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, and Zachary Ives. 2007. 6 Conclusion & Outlook DBpedia: A nucleus for a web of open data. In Pro- ceedings of the 6th Intl Semantic Web Conference, In this paper, we have introduced the AIDArabic pages 11–15, Busan, Korea. framework, which allows named entity disambigua- [Bunescu and Pasca2006] Razvan Bunescu and Marius tion of Arabic texts based on an automatically gen- Pasca. 2006. Using encyclopedic knowledge for erated knowledge based derived from Wikipedia. named entity disambiguation. In Proceedings of Our proof-of-concept implementation shows that the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL entity disambiguation for Arabic texts becomes vi- 2006), pages 9–16, Trento, Italy. able, although the underlying data sources (in par- ticular Wikipedia) still is relatively sparse. Since [Cucerzan2007] S. Cucerzan. 2007. Large-scale our approach “integrates” knowledge encapsulated named entity disambiguation based on Wikipedia in interwiki links from the English Wikipedia, we data. In Proceedings of EMNLP-CoNLL 2007, pages 708–716, Prague, Czech Republic. are able to boost the amount of context informa- tion available compared to a solely monolingual [Darwish2013] Kareem Darwish. 2013. Named entity approach. recognition using cross-lingual resources: Arabic as an example. In Proceedings of the 51st Annual Meet- As a next step, intend to build up a proper ing of the Association for Computational Linguistics dataset that we will use for a systematic evalua- (ACL 2013) , pages 1558–1567, Sofia, Bulgaria. tion of AIDArabic. In addition, we plan to apply machine translation/transliteration techniques for [Ferragina and Scaiella2010] Paolo Ferragina and Ugo Scaiella. 2010. Tagme: On-the-fly annotation of keyphrases and/or dictionary lookup for keywords short text fragments (by Wikipedia entities). In Pro- in order to provide even more context informa- ceedings of the 19th ACM International Conference tion for each and every entity. In addition, we on Information and Knowledge Management (CIKM may employ approximate matching approaches for 2010), pages 1625–1628, New York, NY, USA. keyphrases to account for the existence of addi- [Ferrucci2012] D. A. Ferrucci. 2012. Introduction to tional letter connected to words. As a byproduct ”This is Watson”. IBM Journal of Research and De- we will be able to apply AIDArabic on less formal velopment (Volume 56, Issue 3), pages 235–249.

194 [Hoffart et al.2011] Johannes Hoffart, Mohamed Amir of documents. In Proceedings of the 7th In- Yosef, Ilaria Bordino, Hagen Furstenau,¨ Manfred ternational Conference on Semantic Systems ( I- Pinkal, Marc Spaniol, Bilyana Taneva, Stefan Semantics 2011), pages 1–8, New York, NY, USA. Thater, and Gerhard Weikum. 2011. Robust disam- biguation of named entities in text. In Proceedings [Milne and Witten2008] David N. Milne and Ian H. of the Conference on Empirical Methods in Natural Witten. 2008. Learning to link with Wikipedia. In Language Processing (EMNLP 2011), pages 782– Proceedings of the 17th ACM International Confer- 792, Edinburgh, Scotland. ence on Information and Knowledge Management (CIKM 2008), pages 509–518, New York, NY, USA. [Hoffart et al.2013] Johannes Hoffart, Fabian M. Suchanek, Klaus Berberich, and Gerhard Weikum. [Nguyen and Cao2008] Hien T. Nguyen and Tru H. 2013. YAGO2: A spatially and temporally en- Cao. 2008. Named entity disambiguation on an on- hanced knowledge base from Wikipedia. Artificial tology enriched by Wikipedia. In Proceedings of Intelligence (Volume 194), pages 28–61. IEEE International Conference on Research, Inno- vation and Vision for the Future (RIVF 2008), pages [Kulkarni et al.2009] Sayali Kulkarni, Amit Singh, 247–254, Ho Chi Minh City, Vietnam. Ganesh Ramakrishnan, and Soumen Chakrabarti. 2009. Collective annotation of Wikipedia entities [Ratinov et al.2011] Lev Ratinov, Dan Roth, Doug in web text. In Proceedings of the 15th ACM Inter- Downey, and Mike Anderson. 2011. Local and national Conference on Knowledge Discovery and global algorithms for disambiguation to wikipedia. Data Mining (SIGKDD 2009), pages 457–466, New In Proceedings of the 49th Annual Meeting of the York, NY, USA. Association for Computational Linguistics: Human Language Technologies (HLT 2011), pages 1375– [Mahdisoltani et al.2014] Farzane Mahdisoltani, 1384, Stroudsburg, PA, USA. Joanna Biega, and Fabian M. Suchanek. 2014. A knowledge base from multilingual [Yosef et al.2011] Mohamed Amir Yosef, Johannes – yago3. Technical report, Telecom ParisTech. Hoffart, Ilaria Bordino, Marc Spaniol, and Gerhard http://suchanek.name/work/publications/yago3tr.pdf. Weikum. 2011. AIDA: An online tool for accu- rate disambiguation of named entities in text and ta- [Mendes et al.2011] Pablo N. Mendes, Max Jakob, bles. In Proceedings of the 37th International Con- Andres´ Garc´ıa-Silva, and Christian Bizer. 2011. ference on Very Large Data Bases (VLDB 2011), DBbpedia Spotlight: Shedding light on the web pages 1450–1453, Seattle, WA, USA.

195