Classifying Inconsistencies in Dbpedia Language Specific Chapters
Total Page:16
File Type:pdf, Size:1020Kb
Classifying Inconsistencies in DBpedia Language Specific Chapters Elena Cabrio, Serena Villata, Fabien Gandon INRIA 2004 Route des Lucioles BP93, 06902 Sophia Antipolis, France fi[email protected] Abstract This paper proposes a methodology to identify and classify the semantic relations holding among the possible different answers obtained for a certain query on DBpedia language specific chapters. The goal is to reconcile information provided by language specific DBpedia chapters to obtain a consistent results set. Starting from the identified semantic relations between two pieces of information, we further classify them as positive or negative, and we exploit bipolar abstract argumentation to represent the result set as a unique graph, where using argumentation semantics we are able to detect the (possible multiple) consistent sets of elements of the query result. We experimented with the proposed methodology over a sample of triples extracted from 10 DBpedia ontology properties. We define the LingRel ontology to represent how the extracted information from different chapters is related to each other, and we map the properties of the LingRel ontology to the properties of the SIOC-Argumentation ontology to built argumentation graphs. The result is a pilot resource that can be profitably used both to train and to evaluate NLP applications querying linked data in detecting the semantic relations among the extracted values, in order to output consistent information sets. Keywords: DBpedia language specific chapters, inconsistencies detection, bipolar argumentation 1. Introduction France), and ii) the combination of these query results may The Web is an information space which is evolving from lead to inconsistent information sets about the same topic. sharing textual documents to publishing structured data. In To better understand the problem, let us consider for in- particular, the Linked Data initiative aims at fostering the stance a question answering system that takes as input NL publication and interlinking of data on the Web, giving birth questions and queries two (or more) language specific chap- 4 to the so called Web of Data, a dataspace where structured ters of DBpedia for answer retrieval (e.g. QAKiS (Cabrio data are interlinked with each other. The best known ex- et al., 2013a)). Ideally, we would expect that each DBpedia ample of Linked Data is DBpedia (Bizer et al., 2009), a endpoint returns its own answer and all these answers are structured information dataset extracted from Wikipedia. identical. However, this is not always the case, and three The advantage brought by the availability of such informa- possible situations arise, illustrated in Figure 1 (Cojan et tion in a structured format (usually expressed in RDF1) is al., 2013): i) the same result is provided from both chapters that in this way sophisticated queries can be raised against (i.e., both datasets contain the same value for the same rela- Wikipedia data to extract further information (e.g., us- tion), ii) two different results are provided (i.e., the datasets ing the SPARQL query language2), and new links to dif- contain different values for the same relation), iii) the re- ferent datasets on the Web may be exploited. DBpedia sult is provided from one chapter only (only one dataset presents several language specific chapters3 extracted from provides the value for the queried relation). Considering the language specific Wikipedia chapters, where informa- the two DBpedia biggest chapters, i.e. English and French tion is presented in a structured format. DBpedia chap- ones, on the total of all the properties used in the two chap- ters, well connected through Wikipedia instance interlink- ters, 11.4% of the properties have the same value in the two ing, can contain different information from one language to chapters (∼343,000), 5.8% have different values in the two another. In particular, language specific chapters can pro- chapters (∼175,000), 67.3% have a value only in the En- vide i) more details about certain topics, ii) fill informa- glish chapter (∼2,022,500), and 15.5% has a value only in tion gaps (provide information missed in the other DBpedia the French chapter (∼460,500). If the results are the same, chapters), or iii) conceptualize certain relations according the answer to be returned to the user is precisely this result to a specific cultural viewpoint (Rinser et al., 2012). If on set, but if the results are different then a specific manage- the one side relying on information contained in language ment of the query result has to be done before returning the specific DBpedia chapters in Natural Language Processing answer to the user. (NLP) interfaces and applications (e.g., NL question an- While several initiatives are proposed to push DBpedia con- swering systems over Linked Data) would bring promising tributors into extending the property mapping to DBpedia improvements on systems coverage, on the other side they common ontology, in this paper we focus on case (ii), and face the following problems: i) different results can be ob- we address the following research question: tained for the same query (e.g., English and French DBpe- • How to detect the inconsistencies arising from merg- dia chapters provide a different date of birth of Louis IX of ing information from DBpedia language specific chap- ters? 1http://www.w3.org/RDF/ 2http://www.w3.org/TR/rdf-sparql-query/ In particular, the research question breaks down into the 3http://wiki.dbpedia.org/ Internationalization/Chapters 4http://qakis.org/qakis2/ 1443 Figure 1: Possible outcomes of the values comparison among DBpedia language specific chapters. following sub-questions: 2. Inconsistencies classification The first step of our approach consists in identifying which 1. Which kind of semantic relations can be established kind of semantic relation holds among the different values to reconcile information provided by language specific obtained by querying a set of language specific DBpedia DBpedia chapters to obtain a consistent results set? chapters with a certain query. More precisely, we want 2. How to compute in an automated way consistent re- to detect the linguistic phenomena holding among the in- sults sets starting from the identified relations? consistent values obtained querying two DBpedia language specific chapters at a time, given a certain subject and a To answer the research questions, we propose a data-driven certain ontological property. approach to identify and classify the semantic relations In the following, we propose our classification of the lin- holding among the possible different answers obtained for guistic phenomena we detected, defined referring to widely a certain query on DBpedia language specific chapters. We accepted linguistic categories in the literature. We identify consider 17 DBpedia ontology properties, and we collect the following positive relations between values (mainly dis- the data from the different multilingual chapters of DBpe- course and lexical semantics relations): dia concerning such properties. We further identify which kind of relations hold between the query results items. In a) identity, i.e. same value but in different languages order to provide a machine-readable version of our dataset, (missing SameAs link in DBpedia) we introduce the LingRel ontology, a lightweight vocabu- E.g., Dairy product vs Produits laitiers lary composed of properties only, where the relations we (property: product); Yevgeny Kaspersky vs identified are formally represented. Eugene Kaspersky (property: owner) Starting from the identified semantic relations between two b) acronym, i.e. initial components in a phrase or a word posi- pieces of information, we further classify them as E.g., PSDB vs Partito della Social Demo- tive negative or , and we exploit bipolar abstract argumen- crazia Brasiliana (property: leaderParty); tation (Dung, 1995; Cayrol and Lagasquie-Schiex, 2011) RFID vs Radio-frequency identification to represent the result set as a unique graph, such that ar- (property: product) gumentation semantics can be adopted to detect the (pos- sible multiple) consistent sets of elements of the query re- c) disambiguated entity, i.e. one value contains in the sult. Only this consistent set of information items has to be name the class to which the entity belongs returned to the user. Again, in order to provide such argu- E.g., Michael Lewis (Author) vs Michael mentation graphs in a machine-readable format, we adopt Lewis (property: author); David Williams (oil the extended SIOC Argumentation vocabulary5 proposed baron) vs David Williams (property: fundedBy) by Cabrio et al. (2013c) where two kinds of relations among the nodes of the graph hold, i.e., a support relation and an d) identity:stage name, i.e. pen/stage names pointing to attack relation. the same entity The remainder of the paper is as follows. We first present E.g., Lemony Snicket vs Daniel Handler (prop- our classification of the relations holding between the val- erty: author); The Wachowskis vs Andy et ues we retrieved from the results set over different lan- Larry Wachowski (property: director) guage specific DBpedia chapters (Section 2.), and second e) coreference, i.e. an expression referring to another we adopt argumentation theory to detect the inconsisten- expression describing the same thing (in particular, cies arising in the results set of the query (Section 3.). The non normalized expressions) feasibility study of our data-driven approach is detailed in E.g., William Burroughs vs William S. Section 4. Section 5. compares the proposed approach with Burroughs (property: author); Arup vs Arup other approaches adopting argumentation theory for on- Associates (property: architect) tologies alignment and communities management. Finally, some conclusions are drawn. f) renaming, i.e. reformulation of the same entity name in time 5http://bit.ly/SIOC-Argumentation E.g., Charleville (Ardennes), old name of 1444 Charleville-Mezi´ eres` (property: birthPlace); An AF encodes, through the conflict (i.e., attack) relation, Edo, old name of Tokyo (property: birthPlace) the existing conflicts within a set of arguments.