Classifying Articles in English and German Wikipedia

Classifying articles in English and German Wikipedia Nicky Ringland and Joel Nothman and Tara Murphy and James R. Curran School of Information Technologies University of Sydney NSW 2006, Australia {nicky,joel,tm,james}@it.usyd.edu.au Abstract German NER is especially challenging since various features used successfully in English Named Entity (NE) information is criti- NER, including proper noun capitalisation, do cal for Information Extraction (IE) tasks. However, the cost of manually annotating not apply to German language data, making NEs sufficient data for training purposes, espe- harder to detect and classify (Nothman et al., cially for multiple languages, is prohibitive, 2008). Furthermore, German has partially free meaning automated methods for develop- word order which affects the reliability of con- ing resources are crucial. We investigate textual evidence, such as previous and next word the automatic generation of NE annotated features, for NE detection. data in German from Wikipedia. By incor- porating structural features of Wikipedia, Nothman et al. (2008) devised a novel method we can develop a German corpus which of automatically generating English NE train- accurately classifies Wikipedia articles into ing data by utilising Wikipedia’s internal struc- NE categories to within 1% F -score of the ture. The approach involves classifying all arti- state-of-the-art process in English. cles in Wikipedia into classes using a features- based bootstrapping algorithm, and then creating 1 Introduction a corpus of sentences containing links to articles identified and classified based on the link’s target. Machine Learning methods in Natural Language Processing (NLP) often require large annotated We extend the features used in Nothman et al. training corpora. Wikipedia can be used to au- (2008) for use with German Wikipedia by cre- tomatically generate robust annotated corpora for ating new heuristics for classification. We en- tasks like Named Entity Recognition (NER), com- deavour to make these as language-independent petitive with manual annotation (Nothman et al., as possible, and evaluate on English and German. 2009). The CoNLL-2002 shared task defined Our experiments show that we can accurately NER as the task of identifying and classifying classify German Wikipedia articles at an F - the names of people (PER), organisations (ORG), score of 88%, and 91% for entity classes only, places (LOC) and other entities (MISC) within achieving results very close to the state-of-the-art text (Tjong Kim Sang, 2002). There has been method for English data by Nothman et al. (2008) extensive research into recognising NEs in news- who reported 89% on all and 92% on entities only. paper text and domain-specific corpora, however Nothman et al.’s (2009) NER training corpus cre- most of this has been in English. The cost of pro- ated from these entity classifications outperforms ducing sufficient NER annotated data required for the best cross-corpus results with gold standard training makes manual annotation unfeasible, and training data by up to 12% F -score using CoNLL- the generation of this data is even more impor- 2003-style evaluation. Thus, we show that it is tant for languages other than English, where gold- possible to create free, high-coverage NE anno- standard corpora are harder to obtain. tated German-language corpora from Wikipedia. 2 Background of word forms which must be considered as po- tential NEs is much larger than for languages such The area of NER has developed considerably as English. German’s partially free word order from the Message Understanding Conferences also means that surface cues, such as PER enti- (MUC) of the 1990s where the task first emerged. ties often preceding verbs of communication, are MET, the Multilingual Entity Task associated much weaker. with MUC introduced NER in languages other A final consideration is gender. The name Mark than English (Merchant et al., 1996) which had is likely to be on any list of German person names, previously made up the majority of research in but also makes up part of Germany’s old currency, the area. The CoNLL evaluations of 2002 and the Deutsche Mark, also known as D-Mark or 2003 shifted the focus to Machine Language, and just Mark (gender: female; die), and also has the further multilingual NER research incorporated meaning ‘marrow’ (gender: neuter; das). Whilst language-independent NER. CoNLL-2002 evalu- gender can sometimes disambiguate word senses, ating on Spanish and Dutch (Tjong Kim Sang, in more complicated sentence construction, gen- 2002) and CoNLL-2003 on English and German der distinctions reflected on articles and adjec- (Tjong Kim Sang and De Meulder, 2003). tives can change or be lost when a noun is used The results of the CoNLL-2002 shared task in different cases. showed that whilst choosing an appropriate machine learning technique affected performance, feature choice was also vital. All of the top 8 2.2 Cross-language Wikipedia systems at CoNLL-2003 used lexical features, POS The cross-lingual link structure of Wikipedia rep- tags, affix information, previously predicted NE resents a valuable resource which can be ex- tags and orthographic features. ploited for inter-language NLP applications. Sorg The best-performing CoNLL-2003 system and Cimiano (2008) developed a method to au- achieved an F -score of 88.8% on English and tomatically induce new inter-language links by 72.4% on German (Florian et al., 2003). It com- classifying pairs of articles of two different lan- bined Maximum Entropy Models and robust risk gauges as connected by a inter-language link. minimisation with the use of external knowledge They use a classifier utilising various text and in the form of a small gazetteer of names. This graph-based features including edit distance be- was collected by manually browsing web pages tween the title of articles and link patterns. They for about two hours and was composed of 4500 find that since the fraction of bidirectional links first and last names, 4800 locations in Germany (cases where the English article e.g. Dog is linked and 190 countries. Gazetteers are very costly to to the German article Hund which is linked to the create and maintain, and so considerable research original English article) is around 95% for Ger- has gone into their automatic generation from man and English, they can be used in a bootstrap- online sources including Wikipedia (Toral and ping manner to find new inter-language links. The Munoz,˜ 2006). consistency and accuracy of the links was also The CoNLL-2003 results for German were con- found to vary, with roughly 50% of German lan- siderably lower than for English, up to 25% guage articles being linked to their English equiv- difference in F -score (Tjong Kim Sang and alents, and only 14% from English to German. De Meulder, 2003). The top performing systems Richman and Schone (2008) proposed a sys- all achieved F -scores on English more than 15 tem in which English Wikipedia article classi- higher than on German. fications are used to produce NE-annotated corpora in other languages, achieving an F -score of 2.1 German NER up to 84.7% on French language data, evaluated German is a very challenging language for NER, against human-annotated corpora with the MUC because various features used in English do not evaluation metric. So far there has been very little apply. There is no distinction in the capitalisa- research into directly classifying articles in non- tion of common and proper nouns so the number English Wikipedias. 2.3 Learning NER Rank Article Pageviews 1 2008 Summer Olympics 4 437 251 Machine learning approaches to NER are flexible 2 Wiki 4 030 068 due to their statistical data-driven approach, but 3 Sarah Palin 4 004 853 training data is key to their performance (Noth- 4 Michael Phelps 3 476 803 man et al., 2009). The size and topical coverage 5 YouTube 2 685 316 of Wikipedia makes its text appropriate for train- 6 Bernie Mac 2 013 775 ing general NLP systems. 7 Olympic Games 2 003 678 The method of Nothman et al. (2008) for trans- 8 Joe Biden 1 966 877 9 Georgia (country) 1 757 967 forming Wikipedia into an NE annotated corpus 10 The Dark Knight (film) 1 427 277 relies on the fact that links between Wikipedia articles often correspond to NEs. By using structural Table 1: Most frequently viewed Wikipedia articles features to classify an article, links to it can be la- from August 2008, retrieved from http://stats.grok.se belled with an NE class. The process of deriving a corpus of NE anno- Rank Title Inlinks tated sentences from Wikipedia consists of two 1 United States 543 995 2 Australia 344 969 main sub-tasks: (1) selecting sentences to include 3 Wikipedia 272 073 in the corpus; and (2) classifying articles linked 4 Association Football 241 514 in those sentences into NE classes. By relying on 5 France 227 464 redundancy, articles that are difficult to classify with confidence may simply be discarded. Table 2: Most linked-to articles of English Wikipedia. This method of processing Wikipedia enables the creation of free, much larger NE-annotated tions. Both of these consist of randomly se- corpora than have previously been available, with lected articles, Dakka and Cucerzan’s consisting wider domain applicability and up-to-date, copy- of a random set of 800 pages, expanded by list right free text. We focus on the first phase of this co-occurrence. Nothman et al.’s data set ini- process: accurate classification of articles. tially consisted of 1100 randomly sampled ar- NLP tasks in languages other than English are ticles from among all Wikipedia articles. This disadvantaged by the lack of available data for biased the sample towards entity types that are training and testing.

Load more