Large-Scale Taxonomy Mapping for Restructuring and Integrating Wikipedia

Large-Scale Taxonomy Mapping for Restructuring and Integrating Wikipedia Simone Paolo Ponzetto Roberto Navigli Seminar fur¨ Computerlinguistik Dipartimento di Informatica University of Heidelberg Universita` di Roma “La Sapienza” [email protected] [email protected] Abstract entries, but also in the domain orientation of its categorization network, i.e. very specific categories such as MEDICI- We present a knowledge-rich methodology for dis- NAL PLANTS or WOODLAND SALAMANDERS1. However, ambiguating Wikipedia categories with WordNet such categorization system is merely a thematically organized synsets and using this semantic information to re- thesaurus. Although methods have been developed to induce structure a taxonomy automatically generated from taxonomies from it [Ponzetto and Strube, 2007, WikiTaxon- the Wikipedia system of categories. We evaluate omy henceforth], these cope badly with very general con- against a manual gold standard and show that both cepts. This is because the upper regions of the Wikipedia category disambiguation and taxonomy restructur- categorization are almost exclusively thematic, and no sub- ing perform with high accuracy. Besides, we assess sumption relation can be found while remaining inside the these methods on automatically generated datasets category network. For instance, COUNTRIES is categorized and show that we are able to effectively enrich under (isa)PLACES, which in turn is categorized under GE- WordNet with a large number of instances from OGRAPHY and NATURE, thus having no suitable parent dom- Wikipedia. Our approach produces an integrated inating it with a subsumption relation. This is reflected in resource, thus bringing together the fine-grained the 3,487 roots included in WikiTaxonomy: the resource is a classification of instances in Wikipedia and a well- sparse set of taxonomic islands in need to be linked to more structured top-level taxonomy from WordNet. general concepts for it to resemble a sane taxonomy. In ad- dition, being automatically generated, manual inspection of 1 Introduction that resource reveals several errors, e.g. FRUITS isa PLANTS, The need of structured knowledge for intelligent systems is a which can be automatically corrected by enforcing taxonomic leitmotiv of Artificial Intelligence (AI) – starting from [Mc- constraints from a reference ontology, i.e. given a taxonomy Carthy, 1959] till current echoes in [Schubert, 2006]. Previ- mapping, one could recover from errors by aligning the auto- ous efforts aiming at maximizing the quality of knowledge matically generated taxonomy to a manual one. repositories have concentrated on collecting this knowledge We tackle these issues by proposing a two-phase method- manually: the WordNet project [Fellbaum, 1998] for instance ology. The method starts with WikiTaxonomy, although in provides a semantic lexicon for English and has become de principle any taxonomy can be input. In a first step, the tax- facto the most widely used knowledge resource in Natu- onomy is automatically mapped to WordNet. This mapping ral Language Processing (NLP). However, while providing a can be cast as a Word Sense Disambiguation (WSD) problem [ ] comprehensive repository of word senses, WordNet contains Navigli, 2009 : given a Wikipedia category (e.g. PLANTS), the objective is to find the WordNet synset that best captures very little domain-oriented knowledge and is populated with 2 2 only a few thousand instances, i.e. named entities. the meaning of the category label (e.g. plantn) . The op- To overcome the limitations of manually assembled knowl- timal mapping is found based on a knowledge-rich method edge repositories, research efforts in AI and NLP have been which maximizes the structural overlap between the source devoted to automatically harvest that knowledge [Buitelaar and target knowledge resources. As a result, the Wikipedia et al., 2005]. In particular, the last years have seen a grow- taxonomy is automatically ‘ontologized’. Secondly, the map- ing interest for the automatic acquisition of machine readable ping outcome of the first phase is used to restructure the knowledge from semi-structured knowledge repositories such Wikipedia taxonomy itself. Restructuring operations are ap- as Wikipedia [Suchanek et al., 2007; Nastase and Strube, plied to those Wikipedia categories which convey the highest 2008; Wu and Weld, 2008, inter alia]. Nonetheless, ques- degree of inconsistency with respect to the corresponding part tions remain whether these automatically-induced knowledge 1We use Sans Serif for words, CAPITALS for Wikipedia pages resources achieve the same quality of manually engineered and SMALL CAPS for Wikipedia categories. 2 i ones, such as WordNet or Cyc [Lenat and Guha, 1990]. We denote with wp the i-th sense of a word w with part of The most notable strength of Wikipedia, i.e. its very large speech p. We use word senses to unambiguously denote the cor- 2 coverage, lies not only in its large number of encyclopedic responding synsets (e.g. plantn for { plant, flora, plant life }). 2083 of the WordNet subsumption hierarchy. This ensures that the PLANTS structure of the Wikipedia taxonomy better complies with a reference manual resource. In fact, category disambiguation LEGUMES TREES BOTANY EDIBLE PLANTS and taxonomy restructuring synergetically profit from each BEANS PEAS ACACIA PALMS HERBS CROPS other: disambiguated categories allow it to enforce taxonomic constraints and a restructured taxonomy in turn provides a MEDICINAL HERBS FRUITS better context for category disambiguation. Our approach to taxonomy mapping and restructuring pro- PEARS APPLES vides three contributions: first, it represents a sound and ef- fective methodology for enhancing the quality of an automati- Figure 1: An excerpt of the Wikipedia category tree rooted at cally extracted Wikipedia taxonomy; second, as an additional PLANTS. outcome, we are able to populate a reference taxonomy such as WordNet with a large amount of instances from Wikipedia; finally, by linking WikiTaxonomy to WordNet we create a sense) with category PLANTS from Figure 1. Category dis- new subsumption hierarchy which includes in its lowest re- ambiguation is performed in two steps: gions the fine-grained classification from Wikipedia, and in 1. WordNet graph construction. We start with an empty its upper regions the better structured content from Word- graph G =(V,E). For each category c ∈ T , and for Net. This allows to connect the taxonomic islands found in each head h ∈ heads(c), the set of synsets containing h WikiTaxonomy via WordNet, since the higher regions of the is added to V . For instance, given the category BEANS merged resource are provided by the latter. we add to V the synsets which contain the four WordNet senses of bean (namely, ‘edible seed’, ‘similar-to-bean 2 Methodology seed’, ‘plant’, and ‘human head’). Next, for each vertex Our methodology takes as input a Wikipedia taxonomy (Sec- v0 ∈ V we set v = v0 and we climb up the WordNet tion 2.1). First, it associates a synset with each Wikipedia cat- isa hierarchy until either we reach its root or we encounter 1 3 egory in the taxonomy (Section 2.2). Next, it restructures the a vertex v ∈ V (e.g. legumen is a parent of beann). In taxonomy in order to increase its alignment with the WordNet the latter case, if (v, v) ∈/ E we add it to E (e.g. we add 3 1 subsumption hierarchy (Section 2.3). (beann, legumen)toE) and set its weight w(v, v ) to 0. Finally, for each category c ∈ T whose head occurs in 2.1 Preliminaries the synset v (in our example, LEGUMES), the edge weight We take as input WikiTaxonomy3. We can view the tax- w(v, v) is increased as follows: F onomy as a forest of category trees. As an example, in 1 Figure 1 we show an excerpt of the category tree rooted at w(v, v)=w(v, v)+ d (v ,v)−1 d (c ,c)−1 PLANTS. Each vertex in the tree represents a Wikipedia cat- 2 WN 0 · 2 Wiki 0 egory. The label of this category is often a complex phrase, where dWN(v0,v ) is the number of subsumption edges AZZ HARMONICA PLAYERS BY NATIONALITY e.g. J .Inor- between v0 and v in WordNet and dWiki(c0,c) is the der to produce a mapping to WordNet, we need to find the lex- number of edges between c0 (the category corresponding heads(c) c ical items best matching each category label , e.g. to v0) and c in the category tree T (set to the depth D of AZZ HARMONICA PLAYERS J can be mapped to any Word- our tree if c is not an ancestor of c0). The procedure is Net sense of player. Terms in WordNet (we use version 3.0) repeated iteratively by setting v = v, until the root of are first searched for a full match with the category label, e.g. the WordNet hierarchy is reached. In our example, we plant for PLANTS. If no full match is found, we fall back to 3 1 have that dWN(beann, legumen)=1anddWiki(BEANS, the head of the category. First, the lexical heads of a cate- LEGUMES) = 1, thus the weight of the corresponding edge gory label are found using a state-of-the-art parser [Klein and is set to 1/(21−1 · 21−1)=1. Analogously, we update the Manning, 2003]. Then, we take as head of a category the min- 1 2 weights on the path legumen → ··· → plantn. We note imal NP projection of its lexical head, e.g. public transport that the contribution added to w(v, v) exponentially de- UBLIC TRANSPORT IN ERMANY for P G . Such NP is found in creases with the distance between v0 and v and between the parse tree by taking the head terminal and percolating up c0 and c . At the end of this step, we obtain a graph G in- the tree until the first NP node is found.

Large-Scale Taxonomy Mapping for Restructuring and Integrating Wikipedia

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support