Mining Wiki Resources for Multilingual Named Entity Recognition
Total Page:16
File Type:pdf, Size:1020Kb
Mining Wiki Resources for Multilingual Named Entity Recognition Alexander E. Richman Patrick Schone Department of Defense Department of Defense Washington, DC 20310 Fort George G. Meade, MD 20755 [email protected] [email protected] required during development). The expectation is Abstract that for any language in which Wikipedia is sufficiently well-developed, a usable set of training In this paper, we describe a system by which data can be obtained with minimal human the multilingual characteristics of Wikipedia intervention. As Wikipedia is constantly can be utilized to annotate a large corpus of expanding, it follows that the derived models are text with Named Entity Recognition (NER) continually improved and that increasingly many tags requiring minimal human intervention languages can be usefully modeled by this method. and no linguistic expertise. This process, though of value in languages for which In order to make sure that the process is as resources exist, is particularly useful for less language-independent as possible, we declined to commonly taught languages. We show how make use of any non-English linguistic resources the Wikipedia format can be used to identify outside of the Wikimedia domain (specifically, possible named entities and discuss in detail Wikipedia and the English language Wiktionary the process by which we use the Category (en.wiktionary.org)). In particular, we did not use structure inherent to Wikipedia to determine any semantic resources such as WordNet or part of the named entity type of a proposed entity. speech taggers. We used our automatically anno- We further describe the methods by which tated corpus along with an internally modified English language data can be used to variant of BBN's IdentiFinder (Bikel et al., 1999), bootstrap the NER process in other languages. We demonstrate the system by using the specifically modified to emphasize fast text generated corpus as training sets for a variant processing, called “PhoenixIDF,” to create several of BBN's Identifinder in French, Ukrainian, language models that could be tested outside of the Spanish, Polish, Russian, and Portuguese, Wikipedia framework. We built on top of an achieving overall F-scores as high as 84.7% existing system, and left existing lists and tables on independent, human-annotated corpora, intact. Depending on language, we evaluated our comparable to a system trained on up to derived models against human or machine 40,000 words of human-annotated newswire. annotated data sets to test the system. 1 Introduction 2 Wikipedia Named Entity Recognition (NER) has long been a 2.1 Structure major task of natural language processing. Most of Wikipedia is a multilingual, collaborative encyclo- the research in the field has been restricted to a few pedia on the Web which is freely available for re- languages and almost all methods require substan- search purposes. As of October 2007, there were tial linguistic expertise, whether creating a rule- over 2 million articles in English, with versions based technique specific to a language or manually available in 250 languages. This includes 30 lan- annotating a body of text to be used as a training guages with at least 50,000 articles and another 40 set for a statistical engine or machine learning. with at least 10,000 articles. Each language is In this paper, we focus on using the multilingual available for download (download.wikimedia.org) Wikipedia (wikipedia.org) to automatically create in a text format suitable for inclusion in a database. an annotated corpus of text in any given language, For the remainder of this paper, we refer to this with no linguistic expertise required on the part of format. the user at run-time (and only English knowledge 1 Proceedings of ACL-08: HLT, pages 1–9, Columbus, Ohio, USA, June 2008. c 2008 Association for Computational Linguistics Within Wikipedia, we take advantage of five A redirect page is a short entry whose sole pur- major features: pose is to direct a query to the proper page. There • Article links, links from one article to another are a few reasons that redirect pages exist, but the of the same language; primary purpose is exemplified by the fact that • Category links , links from an article to special “USA” is an entry which redirects the user to the “Category” pages; page entitled “United States.” That is, in the vast • Interwiki links , links from an article to a majority of cases, redirect pages provide another presumably equivalent, article in another name for an entity. language; A disambiguation page is a special article • Redirect pages , short pages which often which contains little content but typically lists a provide equivalent names for an entity; and number of entries which might be what the user • Disambiguation pages , a page with little was seeking. For instance, the page “Franklin” content that links to multiple similarly named contains 70 links, including the singer “Aretha articles. Franklin,” the town “Franklin, Virginia,” the The first three types are collectively referred to as “Franklin River” in Tasmania, and the cartoon wikilinks . character “Franklin (Peanuts).” Most disambigua- A typical sentence in the database format looks tion pages are in Category:Disambiguation or one like the following: of its subcategories. “Nescopeck Creek is a [[tributary]] of the [[North 2.2 Related Studies Branch Susquehanna River]] in [[Luzerne County, Wikipedia has been the subject of a considerable Pennsylvania|Luzerne County]].” amount of research in recent years including Gabrilovich and Markovitch (2007), Strube and The double bracket is used to signify wikilinks. In Ponzetto (2006), Milne et al. (2006), Zesch et al. this snippet, there are three articles links to English (2007), and Weale (2007). The most relevant to language Wikipedia pages, titled “Tributary,” our work are Kazama and Torisawa (2007), Toral “North Branch Susquehanna River,” and “Luzerne and Muñoz (2006), and Cucerzan (2007). More County, Pennsylvania.” Notice that in the last link, details follow, but it is worth noting that all known the phrase preceding the vertical bar is the name of prior results are fundamentally monolingual, often the article, while the following phrase is what is developing algorithms that can be adapted to other actually displayed to a visitor of the webpage. languages pending availability of the appropriate Near the end of the same article, we find the semantic resource. In this paper, we emphasize the following representations of Category links: use of links between articles of different languages, [[Category:Luzerne County, Pennsylvania]], specifically between English (the largest and best [[Category:Rivers of Pennsylvania]], {{Pennsyl- linked Wikipedia) and other languages. vania-geo-stub}}. The first two are direct links to Toral and Muñoz (2006) used Wikipedia to cre- Category pages. The third is a link to a Template, ate lists of named entities. They used the first which (among other things) links the article to sentence of Wikipedia articles as likely definitions “Category:Pennsylvania geography stubs”. We of the article titles, and used them to attempt to will typically say that a given entity belongs to classify the titles as people, locations, organiza- those categories to which it is linked in these ways. tions, or none. Unlike the method presented in this The last major type of wikilink is the link be- paper, their algorithm relied on WordNet (or an tween different languages. For example, in the equivalent resource in another language). The au- Turkish language article “Kanuni Sultan Süley- thors noted that their results would need to pass a man” one finds a set of links including [[en:Sulei- manual supervision step before being useful for the man the Magnificent]] and [[ru: Сулейман I]]. NER task, and thus did not evaluate their results in These represent links to the English language the context of a full NER system. article “Suleiman the Magnificent” and the Russian Similarly, Kazama and Torisawa (2007) used language article “ Сулейман I.” In almost all Wikipedia, particularly the first sentence of each cases, the articles linked in this manner represent article, to create lists of entities. Rather than articles on the same subject. building entity dictionaries associating words and 2 phrases to the classical NER tags (PERSON, LO- not marked in an existing corpus or not needed for CATION, etc.) they used a noun phrase following a given purpose, the system can easily be adapted. forms of the verb “to be” to derive a label. For ex- Our goal was to automatically annotate the text ample, they used the sentence “Franz Fischler ... is portion of a large number of non-English articles an Austrian politician” to associate the label “poli- with tags like <ENAMEX TYPE=“GPE”>Place tician” to the surface form “Franz Fischler.” They Name</ENAMEX> as used in MUC (Message proceeded to show that the dictionaries generated Understanding Conference). In order to do so, our by their method are useful when integrated into an system first identifies words and phrases within the NER system. We note that their technique relies text that might represent entities, primarily through upon a part of speech tagger, and thus was not ap- the use of wikilinks. The system then uses catego- propriate for inclusion as part of our non-English ry links and/or interwiki links to associate that system. phrase with an English language phrase or set of Cucerzan (2007), by contrast to the above, Categories. Finally, it determines the appropriate used Wikipedia primarily for Named Entity Dis- type of the English language data and assumes that ambiguation, following the path of Bunescu and the original phrase is of the same type. Pa şca (2006). As in this paper, and unlike the In practice, the English language categorization above mentioned works, Cucerzan made use of the should be treated as one-time work, since it is explicit Category information found within Wiki- identical regardless of the language model being pedia.