Building a Multilingual Lexical Resource for Named Entity

Edinburgh Research Explorer Building a Multilingual Lexical Resource for Named Entity Recognition and Translation and Transliteration Citation for published version: Wentland, W, Knopp, J, Silberer, C & Hartung, M 2008, Building a Multilingual Lexical Resource for Named Entity Recognition and Translation and Transliteration. in N Calzolari, K Choukri, B Maegaard, J Mariani, J Odijk, S Piperidis & D Tapias (eds), Proceedings of the International Conference on Language Resources and Evaluation (LREC). European Language Resources Association (ELRA). <http://www.lrec- conf.org/proceedings/lrec2008/> Link: Link to publication record in Edinburgh Research Explorer Document Version: Publisher's PDF, also known as Version of record Published In: Proceedings of the International Conference on Language Resources and Evaluation (LREC) General rights Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policy The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer content complies with UK legislation. If you believe that the public display of this file breaches copyright please contact [email protected] providing details, and we will remove access to the work immediately and investigate your claim. Download date: 09. Oct. 2021 Building a Multilingual Lexical Resource for Named Entity Disambiguation, Translation and Transliteration Wolodja Wentland Johannes Knopp Carina Silberer Matthias Hartung Department of Computational Linguistics Heidelberg University fwentland, knopp, silberer, [email protected] Abstract In this paper, we present HeiNER, the multilingual Heidelberg Named Entity Resource. HeiNER contains 1,547,586 disambiguated English Named Entities together with translations and transliterations to 15 languages. Our work builds on the approach described in (Bunescu and Pasca, 2006), yet extends it to a multilingual dimension. Translating Named Entities into the various target languages is carried out by exploiting crosslingual information contained in the online encyclopedia Wikipedia. In addition, HeiNER provides linguistic contexts for every NE in all target languages which makes it a valuable resource for multilingual Named Entity Recognition, Disambiguation and Classification. The results of our evaluation against the assessments of human annotators yield a high precision of 0:95 for the NEs we extract from the English Wikipedia. These source language NEs are thus very reliable seeds for our multilingual NE translation method. 1. Introduction • a Translation Dictionary of all NEs encountered in the Named entities (NEs) are fundamental constituents in English version of Wikipedia with their translations texts, but are usually unknown words. Thus, named en- into 15 other languages available in Wikipedia, tity recognition (NER) and disambiguation (NED) are a • a Disambiguation Dictionary for each of the 16 lan- crucial prerequisite for successful information extraction, guages, that maps all ambiguous proper names to the information retrieval, question answering, discourse analy- set of unique NEs they refer to, and sis and machine translation. In line with (Bunescu, 2007), we consider NED as the task of identifying the unique en- • a Multilingual Context Database for all disambiguated tity that corresponds to potentially ambiguous occurrences NEs. of proper names in texts.1 This is particularly important when information about a NE is supposed to be collected From our perspective, it is especially the collection of and integrated from multiple documents. Especially multi- linguistic contexts that are provided for the disambiguated lingual settings in information retrieval (Virga and Khudan- NEs in every target language that makes HeiNER a valu- pur, 2003; Gao et al., 2004) or question answering (Ligozat able resource for NE-related tasks in multilingual settings. et al., 2006) require methods for transliteration and transla- These contexts can be used for supervised training of clas- tion of NEs. sifiers for tasks such as NER, NED or NE Classification in languages for which no suitable systems are available so In this paper we propose a method to acquire NEs in one far. source language and translate and transliterate2 them to a great variety of target languages. The method we apply in The paper is structured as follows: In Section 2., we in- order to recognise and disambiguate NEs in the source lan- troduce the aspects of Wikipedia’s internal structure that are guage follow the approach of (Bunescu and Pasca, 2006). essential for our approach. Section 3. describes the details For multilingual NE acquisition we exploit and enrich of how we acquire NEs monolingually for one source lan- crosslingual information contained in the online encyclo- guage and translate them to many target languages. In or- pedia Wikipedia. With this account we demonstrate a vi- der to assess the quality of HeiNER, we evaluate the seeds able solution for the efficient and reliable acquisition of dis- of the translation step, i.e. the set of NEs acquired in the ambiguated NEs that is particularly effective for resource- source language, against the judgements of human annota- poor target languages, employing heuristic NER in a single tors. The results of the evaluation are presented in Section source language. For the NED task, we rely on Wikipedia’s 4. Section 5. concludes and gives an outlook on future internal link structures. work. In its current state, HeiNER comprises the following 2. Wikipedia components which are freely available3 2.1. Overview Wikipedia4 is an international project which is devoted to 1As an example, (Bunescu, 2007) mentions the ambiguous the creation of a multilingual free online encyclopedia. It proper name Michael Jordan which can either refer to the Bas- 5 ketball player or a University Professor at Berkeley. uses freely available Wiki software to allow users to cre- 2Note that throughout this paper we will use the term translate ate and edit its content collaboratively. This process relies in a wider sense that subsumes transliteration into different scripts as well. 4http://wikipedia.org 3http://heiner.cl.uni-heidelberg.de 5http://www.mediawiki.org 3230 entirely on subsequent edits by multiple users who are ad- vised to adhere to Wikipedia’s quality standards as stated in the Manual of Style6. Wikipedia has received a great deal of attention within the research community over the last years and has been successfully utilised in such diverse fields as machine translation (Alegria et al., 2006), NE transliteration (Sproat et al., 2006), word sense disambiguation (Mihalcea, 2007), parallel corpus construction (Adafre and de Rijke, 2006) and ontology construction (Ponzetto and Strube, 2007). From a NLP perspective the attractiveness of employing the linguistic data provided by Wikipedia lies in the huge amount of NEs that it contains in contrast to commonly used lexical resources such as WordNet7. Wikipedia has grown impressively large over the last years. As of December 2007, Wikipedia had approximately 9.25 million articles in 253 languages, comprising a total of over 1.74 billion words for all Wikipedias8. Wikipedia ver- sions for almost all major European languages, Japanese Figure 2: Language editions found in Wikipedia by number and Chinese have surpassed a volume of 100.000 articles of articles. (see Figures 1 and 2 for an overview). 2.2. Structure Wikipedia is a large hypertext document which consists of individual articles, or pages, which are interconnected by links. The majority of links found in the articles is con- ceptual, rather than merely navigational. They often refer to more detailed descriptions of the respective entity11 de- noted by the link’s anchor text (Adafre and de Rijke, 2005). An important characteristic of articles found in Wikipedia is that they are usually restricted to the descrip- tion of one single, unambiguous entity. Pages Wikipedia pages can be loosely divided into four classes: Concept Pages, Named Entity Pages and Category Pages account for most of Wikipedia’s content and are joined by Meta Pages which are used for administrative tasks, inclu- sion of different media and disambiguation of ambiguous strings.12 Figure 1: Number of articles found in different languages. Redirect Pages The enormous size of this comparable corpus9 raised the Redirect Pages are used to normalise different sur- question whether Wikipedia can be used to conduct mul- face forms, denominations and morphological variants tilingual research. We believe that the internal structure of of a given unambiguous NE or concept to a unique Wikipedia and its multilingualism can be harnessed for sev- form or ID. To exemplify the use of Redirect Pages eral NLP tasks. In the following, we describe the technical consider that USA, United States of America properties of Wikipedia we make use of for the acquisition and Yankee land are all Redirect Pages to the ar- of multilingual NE information.10 ticle United States. These redirects enable us to identify USA, United States of America or 6 http://en.wikipedia.org/wiki/Wikipedia: Yankee land as varying surface forms of the unique NE Manual_of_Style United States. 7http://wordnet.princeton.edu/ 8Data taken from http://en.wikipedia.org/wiki/ Wikipedia 11Throughout this paper, we use

Load more