Bootstrapping Domain Ontologies from Wikipedia: a Uniform Approach
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015) Bootstrapping Domain Ontologies from Wikipedia: A Uniform Approach Daniil Mirylenka and Andrea Passerini Luciano Serafini University of Trento Fondazione Bruno Kessler Via Sommarive 9, 38123, Trento, Italy Via Sommarive 18, 38123, Trento, Italy fdmirylenka, [email protected] serafi[email protected] Abstract web resources. An example of a system that exploits such a resource is ScienScan [Mirylenka and Passerini, 2013], which Building ontologies is a difficult task requiring provides structured access to computer science literature by skills in logics and ontological analysis. Domain navigating the Wikipedia category network. A fully devel- experts usually reach as far as organizing a set of oped formal representation of this structure would enhance concepts into a hierarchy in which the semantics of these types of applications with the possibility of a more flex- the relations is under-specified. The categorization ible semantic search and navigation [Osborne et al., 2013]. of Wikipedia is a huge concept hierarchy of this For this purpose, we argue, it is worth facilitating the trans- form, covering a broad range of areas. We propose formation of the informally structured knowledge representa- an automatic method for bootstrapping domain on- tions into fully formalized ontologies. tologies from the categories of Wikipedia. The In this paper we propose an automatic method based on method first selects a subset of concepts that are the machine learning techniques for extracting domain ontol- relevant for a given domain. The relevant concepts ogy skeletons from the Wikipedia category hierarchy. We use are subsequently split into classes and individuals, the term ontology skeleton to indicate a basic version of an and, finally, the relations between the concepts are ontology that contains all the primitive concepts, the essen- classified into subclass of, instance of, part of, and tial hierarchical relations between them, and some track of generic related to. We evaluate our method by gen- the remaining generic relations of unspecified type. An on- erating ontology skeletons for the domains of Com- tology skeleton is meant to be further refined in two ways: puting and Music. The quality of the generated on- first, by providing corrections to the solutions proposed by tologies has been measured against manually built the automatic algorithm, and second, by assigning more spe- ground truth datasets of several hundred nodes. cific relation types to generic relations. The method first se- lects a subset of the categories that are relevant for the given 1 Introduction domain. The relevant categories are then split into classes Building ontologies is a difficult task that requires expertise in and individuals, and, finally, the relations between the cate- the domain that is being modeled, as well as in logic and on- gories are classified as either subclass of, instance of, part of, tological analysis. Firstly, domain knowledge is necessary to or generic related to. decide “the scope and the boundaries of the ontology” [Iqbal We evaluate our method by generating ontology skeletons et al., 2013], that is, to separate the entities relevant to the for the domains of Computing and Music. The quality of the modeled domain from the accessory elements that should not generated ontologies has been measured against manually be included into the ontology. In the subsequent phase, do- built ground truth datasets. The results suggest high quality in main expertise is needed to express the relations between the selecting the relevant nodes, discriminating classes from in- selected entities. The most important kind of relations are dividuals, and identifying the subclass of relation. The accu- hierarchical relations, such as meronomy, and the relation be- racy of identifying the instance of relation is on par with the tween an entity and its type. A clear distinction between rela- state of the art. The more difficult part of relation between tion types requires additional competences in logics and on- Wikipedia categories was never addressed in the literature, tological analysis. Domain experts tend to merge these rela- and our initial results need further improvement. The code, data, and experiments are available1 online. tions into a generic broader/narrower relation, which results in the partially formalized knowledge resources such as clas- sification schemes, lexicons, and thesauri. 2 Problem statement These types of partially structured descriptions of the do- According to [Suarez-Figueroa´ et al., 2012], our problem main are widely used in the Semantic Web, spanning from fits into the scenario of reusing non-ontological resources global categorizations, such as that of Wikipedia, to domain- for building ontologies. Our resource, the categorization of specific schemes, such as the ACM Computing Classification 1 System, and providing great support for structural access to https://github.com/anonymous-ijcai/dsw-ont-ijcai 1464 Wikipedia2, is represented by a hierarchy of labels used for 3 Solution organizing Wikipedia articles. With some exceptions, the cat- egories follow the established naming conventions3, such as: Each of the three steps of our method—selecting the relevant categories, splitting them into classes and individuals, and • the names of the topic categories should be singular, nor- classifying the relations—is cast into a binary classification mally corresponding to the name of a Wikipedia article; problem. More precisely, the problem at the first step reduces • the names of the set categories should be plural; to discriminating between relevant and irrelevant nodes, the • the meaning of a name should be independent of the way next one—to discriminating between classes and individuals, the category is connected to other categories. and that at the last step—to discriminating between a spe- As suggested above, there are two main kinds of categories: cific relation (such as subclass of) and the generic related to. topic categories, named after a topic, and set categories, Solving the problem at each step requires providing manually which are named after a class. An example of a topic category annotated examples: for instance, those of relevant and irrel- is France, which contains articles speaking about the whole evant categories. The rest of the work is done by a machine France; an example of a set category is Cities in France. learning algorithm. In the following sections we provide the With the exception of occasional cycles, Wikipedia cate- detailed description of the three steps of the method. gories are organized into a lattice. There is a top-level cat- egory, and all other categories have at least one parent. The 3.1 Selecting the subgraph of relevant categories semantics of the hierarchical relation is not specified, and in- In order to identify the relevant categories, we first select dividual relations may be of different ontological nature. The the most general category representing the domain of interest main types of relations are: (henceforth referred to as “the root”). For the computer sci- 1. subset relation between the set categories, e.g. ence domain, for instance, we choose the category Computing. Countries in Europe Baltic countries, The category selection algorithm is based on the following 2. membership relation between a set category and a topic two observations. First, all relevant categories are descen- category, e.g. Countries in Europe Albania, dants of the root with respect to the sub-category relations. 3. part-of relation, usually, between two topic categories, Second, the categories further from the root are more likely e.g. Scandinavia Sweden, to be irrelevant (following sub-category links, one can arrive 4. sub-topic relation between two topic categories, e.g. from Computing to Buddhism in China in just 9 hops). European culture European folklore, The algorithm (Algorithm 1) performs a breadth-first 5. other relations, whose nature may be specified by the traversal of the category graph, starting from the root. For category labels, e.g. Europe Languages of Europe. each category being visited, the decision is made (line 7), Formalizing these relations in description logics requires: whether the category is relevant, and should be scheduled for The definition of the signature: For each set category, such recursive traversal. The decision is made by a trained classi- as Country in Europe, one should introduce a class. For each fier, based on the features of the category (discussed further). topic category, such as Sweden, one should introduce an indi- To ensure the termination of the algorithm, we set a limit vidual. Then, the relations between the entities should be de- max depth on the maximum allowed traversal depth. Based on clared, such as part of and subtopic of. New relational sym- our experience, we claim that using max depth=20 is safe, in bols should be introduced to formalize the relations implied the sense that any category that is further than 20 hops from by the category names, such as the relation spoken in implied the root is irrelevant. In practice much smaller value can be by the category Language of Europe. chosen, depending on the domain. For Computing we empiri- cally discovered that max depth=7 with high confidence. The definition of the axioms: Finally, the sub-category re- lations between Wikipedia categories should be transformed into axioms of description logic, such as: Algorithm 1: Selection of the relevant categories. • Baltic countries v Country in Europe, input : root; max depth • part of(Sweden, Scandinavia), etc.. output: relevant: the set of relevant categories In the rest of the paper we propose an automatic method 1 queue empty FIFO queue for extracting domain ontology skeletons from the category 2 visited frootg network of Wikipedia. At the high level, the method consists 3 relevant ? of the following three steps: 1) selecting the relevant subset 4 queue.push(root) of categories, 2) transforming each category into a class or an 5 while queue is not empty : individual, and 3) establishing the semantic relations between 6 category queue.pop() the categories.