Deriving a Large-Scale Taxonomy from Wikipedia

Deriving a Large Scale Taxonomy from Wikipedia Simone Paolo Ponzetto and Michael Strube EML Research gGmbH Schloss-Wolfsbrunnenweg 33 69118 Heidelberg, Germany http://www.eml-research.de/nlp Abstract label the relations between categories. As a result we are able to derive a large scale taxonomy. We take the category system in Wikipedia as a conceptual network. We label the semantic relations between categories using methods based on connectivity in the network and lexico- Motivation syntactic matching. As a result we are able to derive a large Arguments for the necessity of symbolically encoded scale taxonomy containing a large amount of subsumption, knowledge for AI date back at least to McCarthy (1959). i.e. isa, relations. We evaluate the quality of the created re- source by comparing it with ResearchCyc, one of the largest Such need has become clearer throughout the last decades, manually annotated ontologies, as well as computing seman- as it became obvious that AI subfields such as information tic similarity between words in benchmarking datasets. retrieval, knowledge management, and natural language processing (NLP) all profit from machine accessible knowledge (see Cimiano et al. (2005) for a broader motivation). Introduction E.g., from a computational linguistics perspective, knowl- The availability of large coverage, machine readable knowledge bases for NLP applications should be: edge is a crucial theme for Artificial Intelligence. While • domain independent, i.e. have a large coverage, in partic- advances towards robust statistical inference methods (cf. ular at the instance level; e.g. Domingos et al. (2006) and Punyakanok et al. (2006)) will certainly improve the computational modeling of intelli- • up-to-date, in order to process current information; gence, we believe that crucial advances will also come from • multilingual, in order to process information in a language rediscovering the deployment of large knowledge bases. independent fashion. Creating knowledge bases, however, is expensive and they are time-consuming to maintain. In addition, most of the The Wikipedia categorization system satisfies all these existing knowledge bases are domain dependent or have a points. Unfortunately, the Wikipedia categories do not form limited and arbitrary coverage – Cyc (Lenat & Guha, 1990) a taxonomy with a fully-fledged subsumption hierarchy, but and WordNet (Fellbaum, 1998) being notable exceptions. only a thematically organized thesaurus. As an example, the The field of ontology learning deals with these problems category CAPITALS IN ASIA1 is categorized in the upper by taking textual input and transforming it into a taxon- category CAPITALS (isa), whereas a category such as PHI- omy or a proper ontology. However, the learned ontolo- LOSOPHY is categorized under ABSTRACTION and BELIEF gies are small and mostly domain dependent, and evalua- (deals-with?) as well as HUMANITIES (isa) and SCIENCE tions have revealed a rather poor performance (see Buitelaar (isa). Another example is a page such as EUROPEAN MI- et al. (2005) for an extensive overview). CROSTATES which belongs to the categories EUROPE (are- We try to overcome such problems by relying on a wide located-in) and MICROSTATES (isa). coverage online encyclopedia developed by a large number of users, namely Wikipedia. We use semi-structured input Related Work by taking the category system in Wikipedia as a conceptual network. This provides us with pairs of related concepts There is a large body of work concerned with acquiring whose semantic relation is unspecified. The task of creat- knowledge for AI and NLP applications. Many NLP com- ing a subsumption hierarchy then boils down to distinguish ponents can get along with rather unstructured, associative between isa and notisa relations. We use methods based on knowledge as provided by the cooccurence of words in large connectivity in the network and lexico-syntactic patterns to corpora, e.g., distributional similarity (Church & Hanks, Copyright c 2007, Association for the Advancement of Artificial 1We use Sans Serif for words and queries, CAPITALS for Intelligence (www.aaai.org). All rights reserved. Wikipedia pages and SMALL CAPS for Wikipedia categories. 1440 1990; Lee, 1999; Weeds & Weir, 2005, inter alia) and vec- Category network cleanup (1) tor space models (Schutze,¨ 1998). Such unlabeled relations We start with the full categorization network consisting of between words proved to be as useful for disambiguating 165,744 category nodes with 349,263 direct links between syntactic and semantic analyses as the manually assembled them. We first clean the network from meta-categories knowledge provided by WordNet. used for encyclopedia management, e.g. the categories un- However, the availability of reliable preprocessing com- der WIKIPEDIA ADMINISTRATION. Since this category is ponents like POS taggers, syntactic and semantic parsers connected to many content bearing categories, we cannot allows the field to move towards higher level tasks, such remove this portion of the graph entirely. We remove in- as question answering, textual entailment, or complete di- stead all those nodes whose labels contain any of the fol- alogue systems which require to understand language. This lowing strings: wikipedia, wikiprojects, lists, mediawiki, lets researchers focus (again) on taxonomic and ontological template, user, portal, categories, articles, pages. This resources. The manually constructed Cyc ontology provides leaves 127,325 categories and 267,707 links still to be pro- a large amount of domain independent knowledge. How- cessed. ever, Cyc cannot (and is not intended to) cope with most spe- cific domains and current events. The emerging field of on- Refinement link identification (2) tology learning tries to overcome these problems by learning The next preprocessing step includes identifying so-called (mostly) domain dependent ontologies from scratch. How- refinement links. Wikipedia users tend to organize many ever, the generated ontologies are relatively small and the category pairs using patterns such as YXand XBYZ(e.g. results rather poor (e.g., Cimiano et al. (2005) report an F- MILES DAVIS ALBUMS and ALBUMS BY ARTIST). We la- measure of about 33% with regard to an existing ontology bel these patterns as expressing is-refined-by semantic re- of less than 300 concepts). It seems to be more promising lations between categories. While these links could be in to extend existing resources such as Cyc (Matuszek et al., principle assigned a full isa semantics, they represent meta- 2005) or WordNet (Snow et al., 2006). The examples shown categorization relations, i.e., their sole purpose is to better in these works, however, seem to indicate that the exten- structure and simplify the categorization network. We take sion takes place mainly with respect to named entities, a all categories containing by in the name and label all links task which is arguably not as difficult as creating a complete with their subcategories with an is-refined-by relation. This (domain-) ontology from scratch. labels 54,504 category links and leaves 213,203 relations to Another approach for building large knowledge bases re- be analyzed. lies on input by volunteers, i.e., on collaboration among the users of an ontology (Richardson & Domingos, 2003). How- Syntax-based methods (3) ever, the current status of the Open Mind and MindPixel projects2 does indicate that they are largely academic en- The first set of processing methods to label relations between terprises. Similar to the Semantic Web (Berners-Lee et al., categories as isa is based on string matching of syntactic 2001), where users are supposed to explicitly define the se- components of the category labels. mantics of the contents of web pages, they may be hin- Head matching. The first method labels pairs of cate- dered by too high an entrance barrier. In contrast, Wikipedia gories sharing the same lexical head, e.g. BRITISH COM- and its categorization system feature a low entrance barrier PUTER SCIENTISTS isa COMPUTER SCIENTISTS. We parse achieving quality by collaboration. In Strube & Ponzetto the category labels using the Stanford parser (Klein & Man- (2006) we proposed to take the Wikipedia categorization ning, 2003). Since we parse mostly NP fragments, we con- system as a semantic network which served as basis for com- strain the output of the head finding algorithm (Collins, puting the semantic relatedness of words. In the present 1999) to return a lexical head labeled as either a noun or work we develop this idea a step further by automatically a 3rd person singular present verb (this is to tolerate errors assigning isa and notisa labels to relations between the cat- where plural noun heads have been wrongly identified as egories. That way we are able to compute the semantic sim- verbs). In addition, we modify the head finding rules to re- ilarity between words instead of their relatedness. turn both nouns for NP coordinations (e.g. both buildings and infrastructures for BUILDINGS AND INFRASTRUC- Methods TURES IN JAPAN). Finally, we label a category link as isa if Since May 2004 Wikipedia allows for structured access by the two categories share the same head lemma, as given by means of categories3. The categories form a graph which a finite-state morphological analyzer (Minnen et al., 2001). can be taken to represent a conceptual network with un- Modifier matching. We next label category pairs as not- specified semantic relations (Strube & Ponzetto, 2006). We isa in case the stem of the lexical head of one of the cate- present here our methods to derive isa and notisa relations gories, as given by the Porter stemmer (Porter, 1980), occurs from these generic links. in non-head position in the other category. This is to rule out thematic categorization links such as CRIME COMICS and 2 www.openmind.org and www.mindpixel.com CRIME or ISLAMIC MYSTICISM and ISLAM. 3Wikipedia can be downloaded at http://download. wikimedia.org. In our experiments we use the English Both methods achieve a good coverage by identifying re- Wikipedia database dump from 25 September 2006.

Load more