Ontop: Answering SPARQL Queries Over Relational Databases
Total Page:16
File Type:pdf, Size:1020Kb
Undefined 0 (0) 1 1 IOS Press Ontop: Answering SPARQL Queries over Relational Databases Diego Calvanese a, Benjamin Cogrel a, Sarah Komla-Ebri a, Roman Kontchakov b, Davide Lanti a, Martin Rezk a, Mariano Rodriguez-Muro c, and Guohui Xiao a a Free University of Bozen-Bolzano {calvanese,bcogrel, sakomlaebri,dlanti,mrezk,xiao}@inf.unibz.it b Birkbeck, University of London [email protected] c IBM TJ Watson [email protected] Abstract. In this paper we present Ontop, an open-source Ontology Based Data Access (OBDA) system that allows for querying relational data sources through a conceptual representation of the domain of interest, provided in terms of an ontology, to which the data sources are mapped. Key features of Ontop are its solid theoretical foundations, a virtual approach to OBDA that avoids materializing triples and that is implemented through query rewriting techniques, extensive optimizations exploiting all elements of the OBDA architecture, its compliance to all relevant W3C recommendations (including SPARQL queries, R2RML mappings, and OWL 2 QL and RDFS ontologies), and its support for all major relational databases. Keywords: Ontop, OBDA, Databases, RDF, SPARQL, Ontologies, R2RML, OWL 1. Introduction vocabulary, models the domain, hides the structure of the data sources, and can enrich incomplete data with Over the past 20 years we have moved from a background knowledge. Then, queries are posed over world where most companies had one all-knowing this high-level conceptual view, and the users no longer self-contained central database to a world where com- need an understanding of the data sources, the relation panies buy and sell their data, interact with several between them, or the encoding of the data. Queries are data sources, and analyze patterns and statistics com- translated by the OBDA system into queries over po- ing from all of them. The challenge is shifting from tentially very large (usually relational and federated) obtaining information to finding the right informa- data sources. The ontology is connected to the data tion. It has always been the case that information is sources through a declarative specification given in power but today attention rather than information be- terms of mappings that relate symbols in the ontol- comes the scarce resource, and those who can distin- ogy (classes and properties) to (SQL) views over data. guish valuable information from background clutter The W3C standard R2RML [12] was created with the gain power [16]. To separate the wheat from the chaff, goal of providing a language for the specification of the companies need a comprehensive understanding of mappings in the OBDA setting. The ontology together their data and the ability to cope with diversity in the with the mappings exposes a virtual RDF graph, which data. can be queried using SPARQL, the standard query lan- Since the mid 2000s, Ontology-Based Data Access guage in the Semantic Web community. These vir- (OBDA) has become a popular approach to tackling tual RDF graphs can be materialized, generating RDF this problem [27]. In OBDA, a conceptual layer is triples that can be used with RDF triplestores, or al- given in the form of an ontology that defines a shared ternatively they can be kept virtual and queried only 0000-0000/0-1900/$00.00 c 0 – IOS Press and the authors. All rights reserved 2 Ontop during query execution. The virtual approach avoids ment regime in SPARQL. The system is available as a the cost of materialization and can profit from more Protégé 4 plugin, a SPARQL endpoint through Sesame than 30 years’ maturity of relational systems (efficient Workbench, and a Java library supporting OWL API query answering, security, robust transaction support, and Sesame API. etc.). The structure of the paper is as follows. Section 2 To illustrate these concepts and the different notions presents a high-level overview of the architecture of in this paper, we will use the following running exam- Ontop. Section 3 describes the SPARQL query answer- ple. All the material needed to run this example in the ing techniques implemented in Ontop. Section 4 out- OBDA system Ontop (and a complementary tutorial) lines the applications of Ontop, in particular the Sta- can be found online1. toil and Siemens use cases in the context of the Op- tique project. Related SPARQL query answering sys- Example 1.1 (Hospital Database). We consider a hos- tems are surveyed in Section 5. Section 6 is a retro- pital database with a single table tbl_patient that spective on the development of Ontop over the past five contains information about lung cancer patients. The years. Finally, Section 7 concludes the paper. table has 4 attributes: the patient identifier (pid), his/her name, the type of cancer (tumor) and its stage. The lung cancer can be of two types: Non-Small Cell 2. Architecture of Ontop Lung Carcinoma (NSCLC) and Small Cell Lung Car- cinoma (SCLC), which are encoded in the table by a Ontop is an open-source3 OBDA system released boolean value type as follows: under the Apache license, developed at the Free Uni- – false for NSCLC and true for SCLC. versity of Bozen-Bolzano. The Ontop system exposes relational databases as virtual RDF graphs by linking The stage of the cancer is encoded by a positive integer the terms (classes and properties) in the ontology to value stage as follows: the data sources through mappings. This virtual RDF – 1–6 for NSCLC stages I, II, III, IIIa, IIIb and IV, graph can then be queried using SPARQL, by translat- – 7–8 for SCLC stages Limited and Extensive. ing the SPARQL queries into SQL queries over the re- lational databases. This translation process is transpar- Finally, our sample table contains the following data: ent to the user. pid name type stage The architecture of Ontop, which is illustrated in Fig. 1, can be divided in four different layers: (i) the 1 ’Mary’ false 4 inputs, i.e., the domain-specific artifacts such as the 2 ’John’ true 7 ontology, database, mappings, and queries; (ii) the core Suppose we need a simple piece of information from of the system in charge of query translation, optimiza- tion, and execution; (iii) the APIs exposing standard this database: “Give me the names of patients with a Java interfaces to users of the system; and (iv) the tumor of stage IIIa”. Even this simple query in this applications that allow end-users to execute SPARQL tiny database already presents some challenges, since queries over databases. to create the query and to understand and analyze the We explore each of these components in turn. results we need to know how the information is en- coded in the data. In the following sections we describe 2.1. Inputs: Ontology, Mappings, Queries, and how to use the Ontop system to address this challenge Databases by enhancing the database with a semantic layer. In this paper we present the OBDA system Ontop2, To the best of our knowledge, Ontop is the first a mature open-source system, which is currently being OBDA system that supports all the W3C recom- used in a number of projects. Ontop supports all the mendations related to OBDA: OWL 2 QL, R2RML, W3C recommendations related to OBDA: OWL 2 QL, SPARQL, SWRL, and the OWL 2 QL entailment R2RML, SPARQL, SWRL, and the OWL 2 QL entail- regime in SPARQL4. In addition, it supports all 1https://github.com/ontop/ontop-examples/ 3http://github.com/ontop/ontop/ tree/master/swj-2015 4SWRL and the OWL 2 QL entailment regime are currently sup- 2http://ontop.inf.unibz.it/ ported experimentally. Ontop 3 Sesame Workbench & Optique Application Protege Layer SPARQL Endpoint Platform API OWL-API Sesame Storage And Inference Layer (SAIL) API Layer Ontop Ontop SPARQL Query Answering Engine (Quest) Core OWL-API Sesame API R2RML API JDBC (OWL Parser) (SPARQL Parser) OWL 2 QL R2RML Inputs Relational SPARQL Ontologies Mappings Databases Queries Fig. 1. Architecture of the Ontop system the major commercial and open-source relational mapping language, which is easier to learn and use. databases. Ontop allows users to convert native mappings into R2RML mappings and vice-versa. Intuitively, a map- Ontology. Ontop allows for RDFS [6] and ping assertion consists of a source (an SQL query re- OWL 2 QL [24] as ontology languages. OWL 2 QL is trieving values from the database) and a target (defin- based on the DL-Lite family of lightweight description ing RDF triples with values from the source). logics [9,2], which guarantees that queries over the ontology can be rewritten into equivalent queries over Example 2.2. The ontology in Example 2.1 can be the databases. Recently Ontop has been extended to populated from the database in Example 1.1 by means support also a fragment of SWRL [43]. of the following mappings: :db1/{pid} rdf:type :Patient . Example 2.1. The following ontology captures the do- SELECT pid FROM tbl_patient main knowledge of our running example. It describes :db1/neoplasm/{pid} rdf:type :NSCLC . the concepts of cancer and cancer patient with the fol- SELECT pid, stage FROM tbl_patient lowing OWL axioms: WHERE type = false :NSCLC rdfs:subClassOf :LungCancer . :db1/neoplasm/{pid} rdf:type :SCLC . :SCLC rdfs:subClassOf :LungCancer . SELECT pid, stage FROM tbl_patient :LungCancer rdfs:subClassOf :Neoplasm . WHERE type = true :hasNeoplasm rdfs:domain :Patient . :db1/{pid} :hasName {name} . :hasNeoplasm rdfs:range :Neoplasm . SELECT pid, name FROM tbl_patient :hasName rdf:type owl:DatatypeProperty . :db1/{pid} :hasNeoplasm :db1/neoplasm/{pid} . :hasStage rdf:type owl:ObjectProperty . SELECT pid FROM tbl_patient In particular, classes :NSCLS and :SCLC are both :db1/neoplasm/{pid} :hasStage :stage-IIIa . SELECT pid FROM tbl_patient subclasses of :LungCancer (that is, they both are WHERE stage = 4 types of lung cancer), which in turn is a subclass of (To ease readability we use a slightly simplified :Neoplasm.