<<

Relation Discovery on the DBpedia

Obey Liu ([email protected]) May 2009

Abstract

DBpedia is a community effort to extract structured information from . It has both inter- esting properties of having a deep ontology and a massive dataset, in addition to being very diversified, dynamic and multilingual. We describe here how DBpedia can be specifically used to discover relations between entities within DBpedia itself and beyond, in other linked ontologies. In the course of this re- search, many insights on the nature of the DBpedia dataset have been noted and will also be presented. Keywords: , relation discovery, ontology mapping, background knowledge, semantic web

TER 2009, supervised by Jérôme Euzenat

1 1 Introduction

1.1 Background Exploring the Semantic Web has become more and more useful thanks to the growth in number and breadth of semantic data sources. Current approaches often involve exploring multiple distributed ontologies[SdM08] but this is highly dependent on the quality of cross-ontology mappings. Instead, other approaches concen- trate on using a web of manually linked datasets such as the project. The Linked Data project1 connects through the Web related datasets that weren’t previously linked or makes it easier to link datasets that were previously linked through various incompatible methods. Wikipedia defines Linked Data as “a term used to describe a recommended best practice for exposing, sharing, and connecting pieces of data, information, and knowledge on the Semantic Web using URIs and RDF.” Among the many datasets linked by this project, the DBpedia dataset has become a de facto core, or nucleus, becoming the point of rendez-vous for many other datasets because of its large size, wide thematic coverage and dynamism. Today, most ontologies and datasets cover only specific domains, are created by relatively small groups, and are very cost intensive to keep up-to-date. At the same time, Wikipedia has grown into one of the most central and versatile knowledge sources available, maintained by thousands of contributors. The DBpedia project2 leverages this gigantic source of knowledge by extracting structured information from Wikipedia. Relation discovery is the problem of finding a path between two nodes on a semantic graph. This prob- lem has applications for example in semantic enrichment of wordsets or finding paths of logical . Because of the way Wikipedia is created, through loosely coordinated incremental editing by volunteers, and its seldom attained size, more than 2.6 million , DBpedia creates unique challenges and op- portunities for these applications. Our particular goal here is to automatically create paths between con- cept nodes on the DBpedia semantic graph that are similar to what a human tasked to do it would find manually. We aspire to reproduce the common sense that a manual operator would apply. For this paper, we analyzed the structure of the DBpedia extracted ontology and dataset, classifying various kinds of relations and their uses in relation discovery. With this knowledge, we experimented by trial and error with search algorithms to discover paths between nodes in an efficient way. To easily execute these searches, we wrote various tools in Python against a Virtuoso RDF datastore containing the latest full dump of DBpedia in all its .

1.2 Contributions Overall, this paper makes the following contributions:

development of high-level Python tools building blocks to work with DBpedia • analysis of the various kinds of relations in DBpedia and their usefulness in the Semantic Web • experimentation of various DBpedia-specific search algorithms and graph pruning heuristics • 1.3 Purposes Ontology mapping Ontology mapping techniques are essential to semantically bridge isolated ontolo- gies and datasets. One interesting way to map two ontologies is to compare their topologies and recognize duplicated relations between classes[SdM06]. Using DBpedia as background knowledge to provide an intermediary mapping needs an efficient way to discover DBpedia relations between all the classes of an ontology.

Relation discovery in DBpedia Discovering relations between concepts in DBpedia without specific at- tention to its specificities has already been done[LSA07] and is rather straightforward, but the discovered relations only rely on a small part of the DBpedia dataset and are not easily usable in other algorithms. Creating paths between concepts in DBpedia could find use of discriminating between various kinds of relations and expand the breadth of exploited datasets.

1http://linkeddata.org/ 2http://dbpedia.org/

2 Relation discovery in other knowledge bases DBpedia is a rather particular with a great variety of relations of various semantic strength. Insights from relation discovery in DBpedia could be reused for relation discovery in other knowledge bases or across heterogeneous knowledge bases.

2 Storing and exploring the DBpedia ontologies and dataset

Because DBpedia is a large dataset with multiple ontologies organizing its concepts, particular care had to be taken to store this vast amount of data and make its properties accessible in an efficient way.

2.1 The DBpedia dataset The DBpedia dataset is presented as a large multi-domain ontology. It currently describes 2.6 million “things” with 274 million “facts” (as of November 2008). It uses the Resource Description Framework (RDF)3 as a flexible for representing the information and the SPARQL4 query to query it. Wikipedia source articles consist mostly of free text, but also contain some structured information, such as templates, category trees, geo-coordinates and links to external web pages.

Identifying and describing “Things” Each thing in the DBpedia data set is identified by a URI refer- ence of the form http://dbpedia.org/resource/Name, where Name is taken from the URL of the source Wikipedia article, which has the form http://en.wikipedia.org/wiki/Name. Every DBpedia resource is then at least described by a label, a short and long abstract and a link to the corresponding Wikipedia page.

Classifications DBpedia provides several classification schemata for these “Things”:

Wikipedia Categories are represented using the SKOS vocabulary5 • The Classification is derived from the Wikipedia Categories using Net[SKW08]. • The Word Net Synset Links are derived from the Wikipedia using Word Net synsets. This is • more semantically precise than datasets relying on the Wikipedia Categories.

Infobox Data Wikipedia Infoboxes offer a very specific faceted approaches to a broad range of things and are thus very valuable to structure information that can be very expressively queried. Two main datasets are extracted from infoboxes:

The Infobox Dataset is created by all infoboxes within all articles. The resulting 22.8 mil- • lion pieces of information are represented through 8000 different property types, with combinations varying with infobox types. There is no formal ontology in this data set

The Infobox Ontology is based on the Infobox Dataset, but with hand-generated mappings of Wikipedia • infoboxes to a newly created DBpedia ontology. The ontology consists of 170 classes which form a subsumption hierarchy and have altogether 900 properties. The mappings address weaknesses in the Wikipedia infobox system, like duplicate infoboxes for the same classes or using different names for the same property. Therefore, the instance data within the Infobox Ontology is much cleaner and better structured, but doesn’t currently cover the whole range of infoboxes and properties. The full ontology only contains about 882000 instances.

External Links Because Wikipedia and the Semantic Web are deeply rooted into a whole ecosystem of linked information, external links to other sources of information are naturally provided:

HTML links to external web pages, either as reference pages, or as the “official” homepage of the • Thing

3http://www.w3.org/TR/rdf-primer/ 4http://www.w3.org/TR/rdf-sparql-query/ 5http://www.w3.org/2004/02/skos/

3 RDF links to external data sources, using the owl:sameAs property; for example countries can be • linked with the Geonames6, Eurostat7 or CIA Factbook8 ontologies and authors can be linked to the Project Gutenberg9 ontology.

2.2 RDF Datastore and SPARQL Queries RDF Datastore For this research, we used the full dump of DBpedia 3.2 in 35 languages, totaling 38 Gb of uncompressed triples in NT format, separated in various files, depending on the dataset and language. When loaded into native datastore format or into memory cache, this takes about 60 Gb. The choice of an RDF datastore was quite straightforward: Sesame10 and Jena11, popular Java-based datastores could not scale to about 300 millions triples, leaving OpenLink Virtuoso12 as the only real contender[BS09], which is also the datastore chosen by the DBpedia creators. Another interesting storage possibility was BRAHMS, a high performance C based in-memory datas- tore, but it would have forced us to write even prototype code in C and, although performance would be very high, not enough memory would be available on the most systems. Of course, 60 Gb would not fit in most systems’ memory but a subset of the dataset, such as only the english dataset slices, could, maybe even used in combination with a regular on-disk .

SPARQL Queries Here is an example of a rather advanced SPARQL query and its graphical output on Figure 1: PREFIX dbo:

SELECT ?name ?birth ?description ?person WHERE { ?person dbo:birthplace . ?person skos:subject . ?person dbo:birthdate ?birth . ?person :name ?name . ?person rdfs:comment ?description . }

This queries for the name, birthdate, description and URI of German musicians who were born in Berlin. This demonstrates the combined use of multiple datasets:

birth date (dbo:birthdate) and birth place dbo:birthplace are an Infobox Ontology properties, • linking for birth place to a city entity

Category:German_musicians is a Wikipedia Category, linked through SKOS vocabulary (skos:subject • )

name (foaf:name) and description (rdfs:comment) are regular Friend Of A Friend13 and RDF Schema14 • RDF properties

3 Basic relation discovery in DBpedia

DBpedia is, after all, a very large oriented graph. One of the most elementary task, or so it seems, is to find a path between two nodes, Things, on this graph. The challenge here is to find one or more useful (to be defined) paths in a quick and efficient way. A good path can bring valuable information about the relation between the two nodes. 6http://geonames.org/ 7http://www4.wiwiss.fu-berlin.de/eurostat/ 8http://www4.wiwiss.fu-berlin.de/factbook/ 9http://www4.wiwiss.fu-berlin.de/gutendata/ 10http://www.openrdf.org/ 11http://jena.sourceforge.net/ 12http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/ 13http://www.foaf-project.org/ 14http://www.w3.org/TR/rdf-schema/

4 Klaus Voormann (born 29 April 1938) is a German artist, musician, and record producer known for his long association with the The Beatles, for whom he designed the cover of their album Revolver, as well Klaus Voormann as for being the bassist with the British Invasion group Manfred Mann and later a respected session player and record producer.

foaf:name

rdfs:comment

dbo:birthdate 1938-04-29

dbo:birthplace

skos:subject

dbo:birthplace

skos:subject

Alexander Marcus foaf:name

rdfs:comment

dbo:birthdate

Alexander Marcus (born 24 December 1979) is a persona of the German house music producer Felix Rennefeld. He has become 1979-12-24 popular throughout Germany after a series of videos he produced and displayed on youtube.

Figure 1: A query for German musicians born in Berlin. We see the categories in bold, the two matching musicians in gray and the rest being their properties.

5 3.1 Relations exploration The basic algorithmic problem here is the one of the uninformed, multiple sources, multiple destinations shortest paths in a positively weighted graph. For this, we examined two algorithms:

a Bidirectional Breadth-First Search with a FIFO queue executed sequentially from all sources and • destinations pairs

a variant of Bidirectional Breadth-First Search with a priority queue, executed simultaneously from • all sources and destinations

Bidirectional Breadth-First Search We first gradually implemented the Bidirectional Breadth-First Search. Bidirectional Breadth-First Search is a graph search algorithm that finds a shortest path from an source node to a destination node in a directed graph. It runs two simultaneous searches: one forward from the source, and one backward from the destination, stopping when the two meet in the middle. The reason for this approach is that in many cases it is faster: for instance, if we assume that both searches expand a tree with branching factor b, and the distance from source to destination is d , each of the two searches has complexity O b d /2, and the sum of these two search times is much less than the O b d  complexity that would result from a single search from source to the destination.

1. Enqueue the root node in the forward queue: the source Thing in the path we want to discover

2. Enqueue the target node in the backward queue: the destination Thing in the path we want to dis- cover

3. Dequeue a forward node and examine it

If the forward node is found in the backward zone, we have met in the middle: quit the search • and return the complete path Otherwise enqueue any successors (the direct forward neighbors) that have not yet been exam- • ined in the forward queue and add them to the forward zone

4. Dequeue a backward node and examine it

If the backward node is found in the forward zone, we have met in the middle: quit the search • and return the complete path Otherwise enqueue any predecessors (the direct backward neighbors) that have not yet been • examined in the backward queue and add them to the backward zone

5. If the any of the two queues is empty, every node on the connex graph has been examined, quit the search and return “not found”

6. If we are still under the depth limit, repeat from Step 2

We later implemented a multithreaded version, requesting neighbors for multiple nodes simultane- ously, improving performance on multicore systems. To sidestep synchronization issues, we examine all the nodes of a given depth in one direction before proceeding with the other direction and then the next depth. When looking for all relations between nodes in a set of node, we execute the algorithm on all pairs of nodes, relying on memory caching on the RDF datastore to speed up redundant searches.

3.2 First examination of paths This search returns the shortest path considering the graph as unweighted. This leads to a broad range of issues. First among these are the heavy presence of certain kinds of relations of questionable usefulness. Three examples are the disambiguation nodes, the redirections and the wikilink relations.

6 Hermes

Mercury_(mythology)

Mercury_(element) Mercury_(plant)

Mercury_(planet) Mercury

Mercury_(automobile) Project_Mercury Ford_Motor_Company

Mercury_Records

Elton_John

Figure 2: A disambiguation node with its disambiguated nodes. This demonstrate how a disambiguation node inapropriately links a Roman god, an automobile company and a pop singer

Disambiguation nodes Disambiguation nodes are a way to resolve conflicts in article titles when a single term can be associated with more than one topic, making that term likely to be the natural title for more than one article. For example, the word “Mercury” can refer to several different things, including an element, a planet, an automobile brand, a record label, a NASA manned-spaceflight project, a plant, and a Roman god. Since only one Thing can have the generic name “Mercury”, unambiguous titles are used for each of these top- ics 2. There must then be a way to direct to the correct specific article when an ambiguous term is refer- enced by a, most probably parsed, RDF relation; this is what is known as disambiguation. In this case it is achieved using Mercury as a node linking to all other articles. The presence of disambiguation nodes is clearly detrimental to the quality of the results. Things linked by a disambiguation node are only related through rough homonymy. This kind of relation are a symptom of the partly parsed nature of DBpedia: links from an article are sometimes created with a disambiguation page instead of the disambiguated article, either because of oversight or because of subsequent replace- ment of an article with a disambiguation page. This kind of issue never appears in internal relations in a hand-crafted ontology. To resolve this issue, we decided to outright exclude all disambiguation nodes from path exploration. When considering a relation leading to a disambiguation page, only one subsequent link (the actualy dis- ambiguated page) is useful and all other are noise. There are methods[NC08] to automatically disam- biguate links on Wikipedia but we chose not to implement them for this research.

Redirections Redirections are RDF nodes that are only used to point to another node. They are used for multiple purposes:

misspellings, alternative spellings, tenses, punctuation, capitalizations: for example “GOOGLE” redi- • rects to “Google” and “E. coli” redirects to “Escherichia coli”

sub-topics and small topics in broader contexts: for example, “Distributed denial of service” redi- • rects to “Denial of service”, or worse “Blackberry 8820” redirects to “List of BlackBerry products”; this creates some semantic noise

7 According to [HSB07], 78% of redirections are perfect synonyms, 12% of redirections reflect pages which have been incorporated into a larger page and the last 10% are either not easily identifiable or noise. Considering these statistics, we chose to keep redirections are useful ways to provide alignment between labels representing the same thing.

Wikilinks Wikilinks are the simplest links from one Wikipedia article to another. They are parsed from Wikipedia articles bodies for DBpedia as simple “source page” and “destination page” pairs. Compared to the other kinds of RDF triples in DBpedia, they are the most general, in the sense that they cover the most kind of relations, yet are the least precise, because they don’t have a relation property, only using a generic “wikilink” relation type.

Seine

wikilink wikilink

Louvre

wikilink

wikilink wikilink Nile wikilink wikilink

wikilink wikilink wikilink wikilink

Ancient_Egypt Alexandria Aswan April_28

wikilink wikilink wikilink

Giuseppe_Ungaretti François_Mitterrand Maurice_Thorez

Figure 3: Paths between the Seine and the Nile. The results are overwhelmed by wikilinks, yet there are no direct link between the two major rivers.

Because they are the bread and butter of Wikipedia articles, they outnumber all other datasets in DBpe- dia: there are 70 millions Wikilink triples, compared to 30 millions Infobox dataset triples or only 7 millions Wikipedia Categories dataset triples. Because of their number, they tend to make the DBpedia graph overly dense and in almost all cases, paths including wikilinks are strictly shorter than any path excluding wik- ilinks. Most shortest paths using wikilinks tend also, by recursivity of the previous justification, be only made of wikilinks. In a subjective examination, wikilinks tend to almost always provide no useful semantic value.

4 Improved algorithms and heuristics

To help with our work, we would like to define a rough distance metric for each triple. This distance met- ric roughly correlates with the semantic strength of the measured relation. First, we need to modify our exploration algorithm to handle a weighted graph.

4.1 Multi-directional Breadth-First Search with Priority Queue For the second iteration of the relation explorer, we decided to change a few objectives:

instead of only looking for the shortest path in one direction between two nodes, we simultaneously • examine both directions, creating a forward and a backward expanding zone

when given a set of nodes, instead of examining each pairs, we execute all Breadth-First Searches • simultaneously, looking for contacts between zones expanding from all nodes

8 instead of a FIFO queue, we use a priority queue based on a distance metric to reproduce an algo- • rithm similar to a Dijkstra search

because of serious synchronization problems, we reverted back to a monothreaded system • Here is an extract of the core search code: for vertex in vertices_set: if vertex == edge.source: break if edge.triple.destination in zones[vertex].backward: # met another zone, note contact add_contact(Contact(edge.source, edge.triple, zones[vertex].backward[ edge.triple.destination], vertex)) if edge.triple.source != edge.triple.destination and edge.triple. destination in vertices_set: # met a vertex, stop return neighbours = forward_spawn(edge.triple) for triple in neighbours: if triple.destination not in zones[edge.source].forward: zones[edge.source].forward[triple.destination] = triple new_edge = Edge(edge.distance + triple.distance, triple, edge.source, True) edges_queue.put(new_edge)

4.2 Categories The Wikipedia Categories dataset is made of two parts:

a forest of RDF nodes with all the categories and their subsumptions • a set of triples, linking non-categories Things to the categories they belong in with a SubjectOf • relation

Our search algorithm currently only search directed paths between from source to a destination. Because categories only have incoming relations or relations leading to other categories, it cannot detect that two sources nodes have paths leading to the same category or parent categories. The Wikipedia Categories dataset is important because it can hierarchically organize islands of RDF data. We have found that most categories relations for articles have similar relations with wikilinks. With the SKOS vocabulary, categories provide a much better replacement. To correct this, we modified our search algorithm to stores the encountered Categories and the number of sources with a path to each Category. After the main search, we enumerate the Categories with two or more sources and report the paths leading to them.

4.3 Experiments with simple weights This is not entirely related to other work about distance metrics in ontologies since this does not evaluate the equality of two entities and is strongly tied to specifics of Wikipedia. Each triple comprises a source, a relation and a destination. We can then calculate a distance based on any of these elements or combinations of thereof. In our implementation, we start with a base 1 distance and add various values based on cases we match. An infinite distance means this triple will be dropped from any calculations. Here are a few examples:

+ , source/destination is Disambiguates • ∞ +10, relation is Wikilink •

9 Bordeaux_wine Bordeaux

#subject #subject departement

Category:Bordeaux Gironde

1, sousPréfectures

Blaye

1, langProperty

Appellation_d'Origine_Contrôlée

Figure 4: Paths between a few Things: “Bordeaux wine”, “Bordeaux” and “Appellation d’Origine Contrôlée”. We see here that “Bordeaux wine” would not be linked to the other Things without the “Bordeaux” Category.

+5, relation is a subsumption between two categories • +40, source/destination is year: Wikipedia has year articles (E.g.: 200915) which cover a very large • number of unrelated topics, mostly with Wikilinks

So far, these distances values are determined in an arbitrary fashion, based on experimental observa- tions, but already improve results for a few test cases. For a few examples we examined, the chosen weights produced good results, but we discovered that changing topics often led to degraded results. This may be explained by a few reasons:

semantic relations in DBpedia are highly dependant on user added information in infoboxes, in cat- • egorization or in wikilinks and thus different mixes need different weights

the weights matchers we have chosen are too general and totally disjointed: a much better granular- • ity is necessary to ensure a smooth gradation of accumulated weights and thus a more useful ranking of paths

5 Conclusions

5.1 Acknowledgments This research has been possible thanks to the TER program at Ensimag and the supervision of Jérôme Euzenat of the INRIA EXMO16 team.

5.2 Conclusions For this paper, we analyzed the structure of the DBpedia extracted ontology and dataset. When classifying the various kinds of relations and their uses in relation discovery, we found a very varied range of relations, some of which warranted special treatment to be made useful or to be excluded. Although still far from the quality of hand-picked semantic links, the links we were able to create at the end of this research were good enough to create rudimentary ontologies from word sets by building se- mantic links between entities from the set using our algorithms. To execute our various relations searches, we wrote many generic tools in Python that can be reused for other DBpedia research. We have confirmed that the multiple and vast datasets of DBpedia can open a large range of study topics and be useful to test multiple semantic exploration algorithms.

15http://en.wikipedia.org/wiki/2009 16http://exmo.inrialpes.fr/

10 5.3 Future work Many areas are left to explore, by using more datasets, a deeper analysis and better heuristics.

More datasets With new versions, the DBpedia dataset gets more data and becomes more semantic:

English is still the largest dataset but other languages are catching up. It would be interesting to • exploit multilingualism. No dictionary has the multidomain breadth of DBpedia so using it to align ontologies in different languages could prove useful

The OWL Ontology dataset has only been used as RDF in this paper. The OWL-specific properties of • the DBpedia Ontology, such as collections, unions and disjunctions, could be used to improve the quality of our paths

Better heuristics We have only developed simple heuristics for adding weights to the DBpedia semantic graph.

More global matchers, such as particular sequences of patterns or PageRank style metrics have proven • effective in other areas

Since we wrote our search tools in Python, we can leverage general bayesian networks libraries such • as Open Bayes to massively create new weights based on subjective human input

References

[BS09] Christian Bizer and Andreas Schultz. The berlin benchmark. International Journal On Semantic Web and Information Systems, 5(1), 2009.

[HSB07] Martin Hepp, Katharina Siorpaes, and Daniel Bachlechner. Harvesting wiki consensus: Using wikipedia entries as vocabulary for . IEEE Computing, 11(5):54– 65, 2007.

[LSA07] Jens Lehmann, Jörg Schüppel, and Sören Auer. Discovering unknown connections - the dbpedia relationship finder. In Sören Auer, Christian Bizer, Claudia Müller, and Anna V.Zhdanova, editors, CSSW, volume 113 of LNI, pages 99–110. GI, 2007.

[NC08] Hien T. Nguyen and Tru H. Cao. Named entity disambiguation on an ontology enriched by wikipedia. In RIVF, pages 247–254. IEEE, 2008.

[SdM06] Marta Sabou, Mathieu d’Aquin, and Enrico Motta. Using the semantic web as background knowl- edge for ontology mapping. In Pavel Shvaiko, Jérôme Euzenat, Natalya Fridman Noy, Heiner Stuckenschmidt, V. Richard Benjamins, and Michael Uschold, editors, Ontology Matching, vol- ume 225 of CEUR Workshop Proceedings. CEUR-WS.org, 2006.

[SdM08] Marta Sabou, Mathieu d’Aquin, and Enrico Motta. Scarlet: Semantic relation discovery by har- vesting online ontologies. In Sean Bechhofer, Manfred Hauswirth, Jörg Hoffmann, and Manolis Koubarakis, editors, ESWC, volume 5021 of Lecture Notes in Computer Science, pages 854–858. Springer, 2008.

[SKW08] Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. Yago: A large ontology from wikipedia and . J. Web Sem., 6(3):203–217, 2008.

11