Aidarabic a Named-Entity Disambiguation Framework for Arabic Text
Total Page:16
File Type:pdf, Size:1020Kb
AIDArabic A Named-Entity Disambiguation Framework for Arabic Text Mohamed Amir Yosef, Marc Spaniol, Gerhard Weikum Max-Planck-Institut fur¨ Informatik, Saarbrucken,¨ Germany mamir|mspaniol|weikum @mpi-inf.mpg.de { } Abstract tion (NED) is essential for many application in the domain of Information Retrieval (such as informa- There has been recently a great progress in tion extraction). It also enables producing more the field of automatically generated knowl- useful and accurate analytics. The problem has edge bases and corresponding disambigua- been exhaustively studied in the literature. The tion systems that are capable of mapping essence of all NED techniques is using background text mentions onto canonical entities. Ef- information extracted from various sources (e.g. forts like the before mentioned have en- Wikipedia), and use such information to know the abled researchers and analysts from vari- correct/intended meaning of the mention. ous disciplines to semantically “understand” The Arabic content is enormously growing on contents. However, most of the approaches the Internet, nevertheless, background ground in- have been specifically designed for the En- formation is clearly lacking behind other languages glish language and - in particular - sup- such as English. Consider Wikipedia for example, port for Arabic is still in its infancy. Since while the English Wikipedia contains more than 4.5 the amount of Arabic Web contents (e.g. million articles, the Arabic version contains less in social media) has been increasing dra- than 0.3 million ones 1. As a result, and up to our matically over the last years, we see a knowledge, there is no serious work that has been great potential for endeavors that support done in the area of performing NED for Arabic an entity-level analytics of these data. To input text. this end, we have developed a framework called AIDArabic that extends the existing AIDA system by additional components 1.2 Problem statement that allow the disambiguation of Arabic texts based on an automatically generated NED is the problem of mapping ambiguous names knowledge base distilled from Wikipedia. of entities (mentions) to canonical entities regis- Even further, we overcome the still exist- tered in an entity catalog (knowledgebase) such as ing sparsity of the Arabic Wikipedia by ex- Freebase (www.freebase.com), DBpedia (Auer et ploiting the interwiki links between Arabic al., 2007), or Yago (Hoffart et al., 2013). For ex- and English contents in Wikipedia, thus, ample, given the text “I like to visit Sheikh Zayed. Despite being close to Cairo, it is known to be a enriching the entity catalog as well as dis- ambiguation context. quiet district”, or in Arabic,“ qJ Ë@ èPA KPI . k@ á ÓAîE.Q¯ á ÓÑ«QËAK. ZðYêËAK. Q ÒJK úæê¯ .Y K@P 1 Introduction èQ ëA®Ë@”. When processing this text automatically, we need to be able to tell that Sheikh Zayed de- 1.1 Motivation notes the the city in Egypt2, not the mosque in Internet data including news articles and web pages, Abu Dhabi3 or the President of the United Arab contain mentions of named-entities such as people, places, organizations, etc. While in many cases 1as of July 2014 the intended meanings of the mentions is obvi- 2http://en.wikipedia.org/wiki/Sheikh Zayed City ous (and unique), in many others, the mentions http://ar.wikipedia.org/wiki/Y K@P _ qJ .Ë@_ éJ KYÓ are ambiguous and have many different possible 3http://en.wikipedia.org/wiki/Sheikh Zayed Mosque meanings. Therefore, Named-Entity Disambigua- http://ar.wikipedia.org/wiki/Y K@P _ qJ .Ë@_ ©ÓAg. 187 Proceedings of the EMNLP 2014 Workshop on Arabic Natural Langauge Processing (ANLP), pages 187–195, October 25, 2014, Doha, Qatar. c 2014 Association for Computational Linguistics Emirates4. In order to automatically establish such order of magnitude smaller than the English mappings, the machine needs to be aware of the one. In addition, many entities in the Arabic characteristic description of each entity, and try to Wikipedia are specific to the Arabic culture find the most suitable one given the input context. with no corresponding English counterpart. In our example, knowing that the input text men- As a consequence, even many prominent enti- tioned the city of Cairo favors the Egyptian city ties are missing from the Arabic Wikipedia. over the mosque in Abu Dhabi, for example. In Name-Entity Dictionary: Most of the name- principle, state-of-the-art NED frameworks require • main four ingredients to solve this problem: entity dictionary entries originate from man- ual input (e.g. anchor links). Like outlined Entity Repository: A predefined universal before, Arabic Wikipedia has fewer resources • catalog of all entities known to the NED to extract name-entity mappings, caused by framework. In other words, each mention in the lack of entities and lack of manual input. the input text must be mapped to an entity in Entity-Descriptions: As already mentioned, the repository, or to null indicating the correct • there is a scarcity of anchor links in the Arabic entity is not included in the repository. Wikipedia. Further, the categorization system Name-Entity Dictionary: It is a many-to- of entities is insufficient, Both are essential • many relation between possible mentions and sources of building the entities descriptions. the entities in the repository. It connects an Hence, it is more challenging to produce com- entity with different possible mentions that prehensive description of each entity. might be used to refer to this entity, as well as Entity-Entity Relatedness Model: Related- connecting a mention with all potential candi- • ness estimation among entities is usually com- date entity it might denote. puted using the overlap in the entities descrip- Entity-Descriptions: It keeps per entity a tion and/or link structure of Wikipedia. Due to • bag of characteristic keywords or keyphrases the previously mentioned scarcity of contents that distinguishes an entity from another. In in the Arabic Wikipedia, it is also difficult to addition, they come with scoring scheme that accurately estimate the entity-entity related- signify the specificity of such keyword to that ness. entity. As a consequence, the main challenge in per- Entity-Entity Relatedness Model: For co- forming NED on Arabic text is the lack of a com- • herent text, the entities that are used for map- prehensive entity catalog together with rich descrip- ping all the mentions in the input text, should tions of each entity. We considered our open source 5 be semantically related. For that reason, an AIDA system (Hoffart et al., 2011)- mentioned as entity-entity relatedness model is required to state-of-the-art NED System by (Ferrucci, 2012) - asses the coherence. as a starting point and modified its data acquisition pipeline in order to generate a schema suitable for For the English language, all of the ingredi- performing NED on Arabic text. ents mentioned above are richly available. For instance, the English Wikipedia is a comprehen- 1.3 Contribution sive up-to-date resource. Many NED systems We developed an approach to exploit and fuse cross- use Wikipedia as their entity repository. Further- lingual evidences to enrich the background informa- more, many knowledge bases are extracted from tion we have about entities in Arabic to build a com- Wikipedia as well. When trying to apply the exist- prehensive entity catalog together with their con- ing NED approaches on the Arabic text, we face text that is not restricted to the Arabic Wikipedia. the following challenges: Our contributions can be summarized in the follow- ing points: Entity Repository: There is no such compre- • hensive entity catalog. Arabic Wikipedia is an Entity Repository: We switched to • YAGO3(Mahdisoltani et al., 2014), the 4http://en.wikipedia.org/wiki/Zayed bin Sultan Al Nahyan 5 http://ar.wikipedia.org/wiki/àA JîE _ È@_ àA ¢Ê_ áK ._ Y K@P https://www.github.com/yago-naga/aida 188 multilingual version of YAGO2s. YAGO3 mentions and the entities are the contextual similar- comes with a more comprehensive catalog ity between mention’s context and entity’s context. that covers entities from different languages The weights on the edges between the entities are (extracted from different Wikipedia dumps). the semantic relatedness among those entities. In a While we selected YAGO3 to be our back- subsequent process, the graph is iteratively reduced ground knowledge base, any multi-lingual to achieve a dense sub-graph where each mention knowledge base such as Freebase could be is connected to exactly one entity. used as well. The CSAW system uses local scores computed from 12 features extracted from the context sur- Name-Entity Dictionary: We compiled a • rounding the mention, and the candidate entities dictionary from YAGO3 and Freebase to pro- (Kulkarni et al., 2009). In addition, it computes vide the potential candidate entities for each global scores that captures relatedness among anno- mention string. While the mention is in Ara- tations. The NED is then formulated as a quadratic bic, the entity can belong to either the English programming optimization problem, which nega- or the Arabic Wikipedia. tively affects the performance. The software, how- ever, is not available. Entity-Descriptions: We harnessed different • ingredients in YAGO3, and Wikipedia to pro- DBpedia Spotlight uses Wikipedia anchors, ti- duce a rich entity context schema. For the tles and redirects to search for mentions in the input sake of precision, we did not employ any au- text (Mendes et al., 2011). It casts the context of the tomated translation. mention and the entity into a vector-space model. Cosine similarity is then applied to identify the Entity-Entity Relatedness Model: We candidate with the highest similarity. Nevertheless, • fused the link structure of both the English their model did not incorporate any semantic relat- and Arabic Wikipedia’s to compute a com- edness among entities.