Using Dbpedia Categories to Evaluate and Explain Similarity in Linked Open Data

Using DBpedia Categories to Evaluate and Explain Similarity in Linked Open Data Houcine Senoussi Quartz Laboratory, EISTI, Cergy, France Keywords: DBpedia, DBpedia Categories, Linked Open Data, Similarity Measure, Semantic Web. Abstract: Similarity is defined as the degree of resemblance between two objects. In this paper we present a new method to evaluate similarity between resources in Linked Open Data. The input of our method is a pair of resources belonging to the same type (e.g. Person or Painter), described by their Dbpedia categories. We first compute the ’distance’ between each pair of categories. For that we need to explore the graph whose vertices are the categories and whose edges connect categories and sub-categories. Then we deduce a measure of the similarity/dissimilarity between the two resources. The output of our method is not limited to this measure but includes other quantitative and qualitative informations explaining similarity/dissimilarity of the two resources. In order to validate our method, we implemented it and applied it to a set of DBpedia resources that refer to painters belonging to different countries, centuries and artistic movements. 1 INTRODUCTION it is intended to define the topics of the resources. Categories contain all important information DBpedia (Lehmann et al., 2015) is one of the most about a resource. For example, when the article is important semantic datasets freely accessible on the about a novel, they give us all information about it web. It contains structured knowledge extracted from (author, date, language, genre, ...). Therefore, catego- Wikipedia. To define RDF triples, Dbepdia uses its ries contain all elements we need to compare two re- own vocabulary1, RDF2, RDFS3 and OWL vocabula- sources and to measure their similarity/dissimilarity. ries and other ontologies such that Dublin Core Me- Similarity is defined as the degree of resem- tadata Intitiative4 (dcmi), Skos5 and Foaf 6. blance between two objects (Meymandpour and Da- DBpedia uses many thousands predicates to des- vis, 2016). According to Tversky (Tversky, 1977) it cribe resources but all these predicates don’t have the serves to ”classify objects, form concepts and make same importance. For example, on the french version generalizations”. Many methods to evaluate simila- of Dbpedia7 we have 2087968(resp. 3352) articles be- rity have been presented by researchers. An overview longing to the type Person (resp. Painter). These arti- of these methods can be found in (Meng et al., 2013) cles use 2887 (resp 345) predicates, but only 13 (resp. and (Meymandpour and Davis, 2016). In this article, 14) predicates are used in 99% of the articles and only we present a new method for evaluating and explai- 18 (resp. 21) predicates are used in 80% of the arti- ning similarity between objects. These objects are re- cles. Only these common predicates can be used to presented as sets of Dbpedia categories. compare resources. dcterms:subject is one of these The rest of this paper is organised as follows. In few predicates. Its values are DBpedia categories and section 2 we describe DBpedia categories and their organization. Section 3 summarizes the motivation of 1 http://dbpedia.org/ontology/ this work and its contributions. Sections 4, 5 and 6 2 https://www.w3.org/1999/02/22-rdf-syntax-ns give a detailed description of our method. In section 3https://www.w3.org/2000/01/rdf-schema 4 7 we present our experimental results. Section 8 des- http://dublincore.org/documents/2012/06/14/dcmi- cribes related work. We conclude in the section 9. terms/ 5https://www.w3.org/2009/08/skos-reference/skos.html 6http://xmlns.com/foaf/spec/ 7http://fr.dbpedia.org/ 8Retrieved April 28, 2017 117 Senoussi, H. Using DBpedia Categories to Evaluate and Explain Similarity in Linked Open Data. DOI: 10.5220/0006939001170127 In Proceedings of the 10th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2018) - Volume 2: KEOD, pages 117-127 ISBN: 978-989-758-330-8 Copyright © 2018 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved KEOD 2018 - 10th International Conference on Knowledge Engineering and Ontology Development 2 DBpedia CATEGORIES on display in a parisian museum. We call that a ’hidden’ commonality. There are two main types of categories : administra- 2. To give different weights to features depending tive categories and content categories. Administrative on their ’obvious’ importance : let us for example categories are used to organise the Wikipedia project. consider the following objects : P1 = ..., Aut- They are non-semantic categories. The content cate- hor=Claude Monet, Category=Vandalized{ works gories are used to group articles dealing with the same of art,... and P2= ..., Author=Claude Mo- subject. In other words, two Wikipedia articles belong net,... and} P3= ..., Category=Vandalized{ works to the same category if they share some property : e.g. of art,...} . According{ to a standard feature-based Leonardo da Vinci, Raphael and Michelangelo belong similarity} measures, the similarity between P1 to the 16th-century Italian painters. and P2 is equal to the similarity between P1 and Some categories are called container categories : P3 because the two pairs of paintings have the they are intended to be populated entirely by subcate- same number of common features and the same gories. Other categories can contain only articles or number of different features. But in the defini- both sub-categories and articles. We call the former tion of a painting the feature ’Author=’ is ob- pure categories and the latter mixed categories. viously more ’important’ than the feature ’Cate- Given what we have summarized above, Dbpe- gory=Vandalized works of art’. Therefore, in a dia categories are organised using two graphs. The ’good’ similarity measure, contribution of the for- first one is a bipartite graph GS=(R,C, ES), where mer feature should be more important than that of R is the set of all resources, C the set of categories the latter. and ES the set of edges defined by the predicate dcterms:subject. The second one is an acyclic directed In addition to these two essential properties, we want graph GB=(C, EB) where EB is the set of edges defi- our similarity measure to have some other nice pro- ned by the predicate skos:broader. For two categories perties : to be intuitive, data type-independent and cat1 and cat2, we have cat1 skos:broader cat2 if cat1 dataset-independent, and its results are easily explai- is a sub-category of cat2. We notice that in this graph ned. about 13% of the vertices are isolated. The majority To the best of our knowledge, no one of the known of these vertices correspond to date categories. We methods has all these properties (see section 8). also have a little number of sinks (vertices without in- The main contributions of this paper can be sum- coming edges). About 50% are sources (vertices wit- marized as follows : hout incoming edges). These vertices represent pure 1. Defining a unified representation of LOD resour- categories. ces using weighted DBpedia categories. 2. An intuitive algorithm that uses categories’ graph 3 MOTIVATION AND to measure similarity between resources. CONTRIBUTION 3. The output of this algorithm is not limited to the similarity measure but contains qualitative ele- Our aim in this work is to define a similarity measure ments explaining it. for linked data that simulates as well as possible hu- man notion of similarity. Given how humans evaluate the similarity between objects, such a measure must 4 PROBLEM FORMALIZATION have at least the two following properties : 1. Be able to detect hidden commonality : let us Given two DBpedia resources belonging to the for example consider two paintings defined by the • same type, our objective is to measure their si- following sets of features : P1 = Author=Claude milarity and give an explanation to this simila- { Monet, Creation Year= 1914, Museum=Musee´ rity/dissimilarity. de l’Orangerie and P2= Author=Auguste Re- } { A resource is described by its categories and each noir, Creation Year= 1911, Museum=Petit Pa- • lais . These two paintings don’t have common category is assigned a weight. features.} A standard feature-based similarity me- As input we have two resources represented by asure will conclude that their similarity is equal to • their categories and their weights. In other words 0. However, they have an important commonality each resource is described by a set of couples : both of them were painted by french impressi- R= (c ,w ) where c s are categories and w s are { i i } i i onist painters, were created about 1910, and are real numbers such that ∑wi=1. 118 Using DBpedia Categories to Evaluate and Explain Similarity in Linked Open Data To measure similarity between R1= (c1i,w1i) each isolated category cd, for each category c = cd, • and R2= (c2 ,w2 ) we need a function{ compu-} dist(c ,c) = INF. 6 { i i } d ting the ’distance’ between categories. Let us call In this work, we considered more speci- dist such a function. ally birth and death categories (YYYY births and We use dist to compute first the distance bet- YYYY deaths) that we find in resources belonging to • the type Person. These categories are processed as ween each pair (c1i,c2 j), then the distance between each category c and the other resource, and follows : finally the distance between the two resources. 1. The distance between two birth (resp. death) ca- The desired output contains 3 levels. The level tegories is the number of generations between • 0 contains couples (c1,c2) R1 R2 such that them. If this number of generations is greater than { ∈ × } 2 DEPTH MAX we consider that the distance is c1 is close to c2.

Load more