RDF Shape Induction Using Knowledge Base Profiling

RDF Shape Induction using Knowledge Base Profiling Nandana Mihindukulasooriya Mohammad Rifat Ahmmad Giuseppe Rizzo Universidad Politécnica de Madrid, Rashid Instituto Superiore Mario Boella, Italy Spain Politecnico di Torino, Italy [email protected] [email protected] [email protected] Raúl García-Castro Oscar Corcho Marco Torchiano Universidad Politécnica de Madrid, Universidad Politécnica de Madrid, Politecnico di Torino, Italy Spain Spain [email protected] [email protected] [email protected] ABSTRACT Ontology languages are designed for inferring new knowledge Knowledge Graphs (KGs) are becoming the core of most artificial using known axioms rather than for validating data based on ax- intelligent and cognitive applications. Popular KGs such as DBpedia ioms. A reasoner and a validator have different functions, i.e., a and Wikidata have chosen the RDF data model to represent their reasoner is used for inferring new knowledge while a validator is data. Despite the advantages, there are challenges in using RDF used for finding violations against a set of constraints. The under- data, for example, data validation. Ontologies for specifying domain pinning principles used in OWL, such as the use of Open World conceptualizations in RDF data are designed for entailments rather Assumption (OWA) and Non-Unique Name Assumption, can lead to than validation. Most ontologies lack the granular information unexpected results in a validator. Thus, it is challenging to perform needed for validating constraints. Recent work on RDF Shapes and certain validation tasks using such languages. standardization of languages such as SHACL and ShEX provide Validation against a pre-defined schema is one of the key steps better mechanisms for representing integrity constraints for RDF in most conventional data storage and publishing paradigms; for data. However, manually creating constraints for large KGs is still instance, validation against DDL constraints in SQL databases or a tedious task. In this paper, we present a data driven approach for XML Schema or RelaxNG in XML documents [27]. For RDF data, inducing integrity constraints for RDF data using data profiling. typically OWL or RDF Schema based ontologies are used for doing Those constraints can be combined into RDF Shapes and can be used this kind of validations. However, the use of ontologies for validat- to validate RDF graphs. Our method is based on machine learning ing RDF data leads to several problems: (a) the mismatch between techniques to automatically generate RDF shapes using profiled intended purposes, inference (for which the ontologies are designed RDF data as features. In the experiments, the proposed approach for) and validation, and (b) lack of information useful for validation achieved 97% precision in deriving RDF Shapes with cardinality in most of the ontologies. constraints for a subset of DBpedia data. In general, data quality remains a critical aspect to obtain trust by users. Data quality relates to the perception of the “fitness for use” CCS CONCEPTS in a given context [32]. Constraints are limitations incorporated on the data that are supposed to be satisfied all the time by instances • Information systems → Data cleaning; • Computing method- of the database [2]. They are useful for users to understand data ologies → Knowledge representation and reasoning; as they represent characteristics that data naturally exhibits [22]. KEYWORDS Data profiling is defined as “the activity of creating small but infor- mative summaries of a database” [17]. In this study, we present a RDF Shape, Knowledge Base, Data Quality, Machine Learning data-driven approach by applying machine learning techniques on profiled RDF data to automatically generate integrity constraints. 1 INTRODUCTION Then, a set of constraints related to a specific class are combined to generate an RDF Shape as a collection of constraints. These RDF Knowledge Graphs (KGs) are becoming the core of most artificial shapes can be used to validate RDF graphs. intelligent applications driven by domain knowledge. Popular KGs One of the main use cases for generating RDF Shapes from data is such as DBpedia, YAGO2, and Wikidata have chosen the RDF data quality assessment. In this work, we use DBpedia [21] (version 2016- model for knowledge representation. RDF is a graph-based data 04) as our target KB and we experiment on assessing the quality of model which is the de facto model in Semantic Web and Linked all dbo:Person instances in the DBpedia KB. The main requirement Data applications. RDF graphs can capture and represent domain for such quality assessment would be to identify the validation rules information in a semantically rich manner using ontologies. As that each person instance should conform to. For example, a person with any other data model, in order to specify the conditions that should have exactly one value for the “dbo:birthDate" property or must be satisfied by an RDF graph, constraints need to be imposed. the values of the “dbo:height" property should always be a positive In practical settings, constraints are commonly used for three main number. The instances of the Person class have more than 13,000 tasks: (a) specifying the properties that data should hold; (b) handle associated properties (including dbo, DBpedia ontology proper- contradictions within data or with respect to the domain under ties and dbp, auto-generated properties from Wikipedia infobox consideration; or (c) as a help for query optimization. keys). Thus, it is a tedious, time-consuming, and error-prone task Shapes Working Group1 and later on through the Shape Expres- to generate such validation rules manually. Some of the validation sion Community Group2. The outcomes of such efforts, languages rules can be encoded into the ontology, but it still requires a lot of such as W3C Shapes Constraint Language (SHACL) [19] and Shape manual effort. This work proposes an approach for inducing such Expressions Language [26], allow integrity constraints to be de- validation rules in the form of RDF shapes by profiling the data and fined for RDF data and to validate data. However, for largeRDF using inductive approaches to extract those rules. Other use cases knowledge bases such as DBpedia, which contains more than 750 for inducting shapes include describing the data (which is helpful classes and 60,000 properties (dbo: and dbp:), it is still a challenge to in generating queries or creating dynamic user interfaces). generate integrity constraints manually. In this context, the main The main contribution of this paper is a novel approach for research problem addressed in the paper is: How to automatically generating RDF Shapes based on inducing constraints using ma- generate integrity constraints for RDF data? chine learning algorithms and using data profiling information In this study, we propose a data-driven approach to automati- as features. The proposed approach is defined in a generic way cally generate RDF shapes for a given dataset by applying machine that applies to any type of constraint and is validated using two learning techniques to RDF data profiling information extracted common constraint types: (a) cardinality constraints, and (b) range from the dataset. constraints. The remainder of this paper is structured as follows: in Section 2 2.1 Objective we define the problem setting; in Section 3 we provide references Using the information extracted from statistical data profilers, our and a background on related research activities; in Section 4 we objective is to generate representative RDF Shapes for the data in a describe our approach and in Section we report the experimental given dataset. We generate shapes at the class-level. An example results 5 achieved by our approach in a controlled setup (Section 6). excerpt of RDF Shape in SHACL for the dbo:Person class is illustrated We conclude in Section 7 providing insights on future research in Listing 1. endeavours. Listing 1: A snippet from an example Person shape @prefix dbo: <http://dbpedia.org/ontology/> . 2 PROBLEM DEFINITION @prefix sh: <http://www.w3.org/ns/shacl#> . A knowledge graph is defined to be consistent if it does not contain conflicting or contradictory data [16]. When a schema is available ex:DBpediaPerson a sh:NodeShape; with the integrity constraints that the data should comply, the sh:targetClass dbo:Person; data usually goes through a validation process that verifies the sh:property [sh:path foaf:name; compliance against those constraints. These integrity constraints sh:minCount 1; encapsulate the consistency requirements for the data in order sh:nodeKind sh:Literal ]; to be fit to a given set of use cases. For instance, in relational sh:property [ sh:path dbo:birthDate; databases, integrity constraints are expressed in a data definition sh:datatype xsd:date ; language (DDL), and the database management systems (DBMS) sh:minCount 1; ensure that any data inserted into the database will not lead to any sh:maxCount 1; inconsistency in the entire database. However, validation of RDF sh:nodeKind sh:Literal ] ; data is not done in this manner due to several reasons such as the sh:property [ sh:path dbo:birthPlace; lack of a language for expressing constraints (RDFS and OWL are sh:datatype dbo:Place; designed for entailment and not for validation as discussed in the sh:nodeKind sh:BlankNodeOrIRI; introduction), or having generic models suitable for wider use and sh:minCount 1; not for specific use cases. sh:maxCount 1 ] . When an ontology is available with TBox axioms that define the conceptualization of the domain, a reasoner can be used to verify To this end, a large number of special purpose systems and whether the dataset is consistent with the domain by verifying the techniques have been developed for solving such constraint-based axioms defined in the ontology. However, most of the ontologies mining and learning problems [10]. Constraints can be used both in do not have rich axioms that could help to detect inconsistencies a descriptive setting, such as discovering association rules [3], or in in data.

Load more