Relation Discovery on the Dbpedia Semantic Web

Relation Discovery on the DBpedia Semantic Web Obey Liu ([email protected]) May 2009 Abstract DBpedia is a community effort to extract structured information from Wikipedia. It has both interesting properties of having a deep ontology and a massive dataset, in addition to being very diversified, dynamic and multilingual. We describe here how DBpedia can be specifically used to discover relations between entities within DBpedia itself and beyond, in other linked ontologies. In the course of this research, many insights on the nature of the DBpedia dataset have been noted and will also be presented. Keywords: dbpedia, relation discovery, ontology mapping, background knowledge, semantic web TER 2009, supervised by Jérôme Euzenat 1 1 Introduction 1.1 Background Exploring the Semantic Web has become more and more useful thanks to the growth in number and breadth of semantic data sources. Current approaches often involve exploring multiple distributed ontologies[SdM08] but this is highly dependent on the quality of cross-ontology mappings. Instead, other approaches concen- trate on using a web of manually linked datasets such as the Linked Data project. The Linked Data project1 connects through the Web related datasets that weren’t previously linked or makes it easier to link datasets that were previously linked through various incompatible methods. Wikipedia defines Linked Data as “a term used to describe a recommended best practice for exposing, sharing, and connecting pieces of data, information, and knowledge on the Semantic Web using URIs and RDF.” Among the many datasets linked by this project, the DBpedia dataset has become a de facto core, or nucleus, becoming the point of rendez-vous for many other datasets because of its large size, wide thematic coverage and dynamism. Today, most ontologies and datasets cover only specific domains, are created by relatively small groups, and are very cost intensive to keep up-to-date. At the same time, Wikipedia has grown into one of the most central and versatile knowledge sources available, maintained by thousands of contributors. The DBpedia project2 leverages this gigantic source of knowledge by extracting structured information from Wikipedia. Relation discovery is the problem of finding a path between two nodes on a semantic graph. This problem has applications for example in semantic enrichment of wordsets or finding paths of logical inference. Because of the way Wikipedia is created, through loosely coordinated incremental editing by volunteers, and its seldom attained size, more than 2.6 million concepts, DBpedia creates unique challenges and op- portunities for these applications. Our particular goal here is to automatically create paths between con- cept nodes on the DBpedia semantic graph that are similar to what a human tasked to do it would find manually. We aspire to reproduce the common sense that a manual operator would apply. For this paper, we analyzed the structure of the DBpedia extracted ontology and dataset, classifying various kinds of relations and their uses in relation discovery. With this knowledge, we experimented by trial and error with search algorithms to discover paths between nodes in an efficient way. To easily execute these searches, we wrote various tools in Python against a Virtuoso RDF datastore containing the latest full dump of DBpedia in all its languages. 1.2 Contributions Overall, this paper makes the following contributions: development of high-level Python tools building blocks to work with DBpedia • analysis of the various kinds of relations in DBpedia and their usefulness in the Semantic Web • experimentation of various DBpedia-specific search algorithms and graph pruning heuristics • 1.3 Purposes Ontology mapping Ontology mapping techniques are essential to semantically bridge isolated ontologies and datasets. One interesting way to map two ontologies is to compare their topologies and recognize duplicated relations between classes[SdM06]. Using DBpedia as background knowledge to provide an intermediary mapping needs an efficient way to discover DBpedia relations between all the classes of an ontology. Relation discovery in DBpedia Discovering relations between concepts in DBpedia without specific at- tention to its specificities has already been done[LSA07] and is rather straightforward, but the discovered relations only rely on a small part of the DBpedia dataset and are not easily usable in other algorithms. Creating paths between concepts in DBpedia could find use of discriminating between various kinds of relations and expand the breadth of exploited datasets. 1http://linkeddata.org/ 2http://dbpedia.org/ 2 Relation discovery in other knowledge bases DBpedia is a rather particular knowledge base with a great variety of relations of various semantic strength. Insights from relation discovery in DBpedia could be reused for relation discovery in other knowledge bases or across heterogeneous knowledge bases. 2 Storing and exploring the DBpedia ontologies and dataset Because DBpedia is a large dataset with multiple ontologies organizing its concepts, particular care had to be taken to store this vast amount of data and make its properties accessible in an efficient way. 2.1 The DBpedia dataset The DBpedia dataset is presented as a large multi-domain ontology. It currently describes 2.6 million “things” with 274 million “facts” (as of November 2008). It uses the Resource Description Framework (RDF)3 as a flexible data model for representing the information and the SPARQL4 query language to query it. Wikipedia source articles consist mostly of free text, but also contain some structured information, such as infobox templates, category trees, geo-coordinates and links to external web pages. Identifying and describing “Things” Each thing in the DBpedia data set is identified by a URI reference of the form http://dbpedia.org/resource/Name, where Name is taken from the URL of the source Wikipedia article, which has the form http://en.wikipedia.org/wiki/Name. Every DBpedia resource is then at least described by a label, a short and long abstract and a link to the corresponding Wikipedia page. Classifications DBpedia provides several classification schemata for these “Things”: Wikipedia Categories are represented using the SKOS vocabulary5 • The YAGO Classification is derived from the Wikipedia Categories using Word Net[SKW08]. • The Word Net Synset Links are derived from the Wikipedia Infoboxes using Word Net synsets. This is • more semantically precise than datasets relying on the Wikipedia Categories. Infobox Data Wikipedia Infoboxes offer a very specific faceted approaches to a broad range of things and are thus very valuable to structure information that can be very expressively queried. Two main datasets are extracted from infoboxes: The Infobox Dataset is created by parsing all infoboxes within all articles. The resulting 22.8 mil- • lion pieces of information are represented through 8000 different property types, with combinations varying with infobox types. There is no formal ontology in this data set The Infobox Ontology is based on the Infobox Dataset, but with hand-generated mappings of Wikipedia • infoboxes to a newly created DBpedia ontology. The ontology consists of 170 classes which form a subsumption hierarchy and have altogether 900 properties. The mappings address weaknesses in the Wikipedia infobox system, like duplicate infoboxes for the same classes or using different names for the same property. Therefore, the instance data within the Infobox Ontology is much cleaner and better structured, but doesn’t currently cover the whole range of infoboxes and properties. The full ontology only contains about 882000 instances. External Links Because Wikipedia and the Semantic Web are deeply rooted into a whole ecosystem of linked information, external links to other sources of information are naturally provided: HTML links to external web pages, either as reference pages, or as the “official” homepage of the • Thing 3http://www.w3.org/TR/rdf-primer/ 4http://www.w3.org/TR/rdf-sparql-query/ 5http://www.w3.org/2004/02/skos/ 3 RDF links to external data sources, using the owl:sameAs property; for example countries can be • linked with the Geonames6, Eurostat7 or CIA Factbook8 ontologies and authors can be linked to the Project Gutenberg9 ontology. 2.2 RDF Datastore and SPARQL Queries RDF Datastore For this research, we used the full dump of DBpedia 3.2 in 35 languages, totaling 38 Gb of uncompressed triples in NT format, separated in various files, depending on the dataset and language. When loaded into native datastore format or into memory cache, this takes about 60 Gb. The choice of an RDF datastore was quite straightforward: Sesame10 and Jena11, popular Java-based datastores could not scale to about 300 millions triples, leaving OpenLink Virtuoso12 as the only real contender[BS09], which is also the datastore chosen by the DBpedia creators. Another interesting storage possibility was BRAHMS, a high performance C based in-memory datastore, but it would have forced us to write even prototype code in C and, although performance would be very high, not enough memory would be available on the most systems. Of course, 60 Gb would not fit in most systems’ memory but a subset of the dataset, such as only the english dataset slices, could, maybe even used in combination with a regular on-disk database. SPARQL Queries Here is an example of a rather advanced SPARQL query and its graphical output on Figure 1: PREFIX dbo: <http://dbpedia.org/ontology/>

Load more