Ontop: Answering SPARQL Queries Over Relational Databases
Total Page:16
File Type:pdf, Size:1020Kb
Semantic Web 0 (0) 1 1 IOS Press Ontop: Answering SPARQL Queries over Relational Databases Editor(s): Óscar Corcho, Universidad Politécnica de Madrid, Spain Solicited review(s): Jean Paul Calbimonte, École Polytechnique Fédérale de Lausanne, Switzerland; José Luis Ambite, University of Southern California, USA; one anonymous reviewer Diego Calvanese a, Benjamin Cogrel a, Sarah Komla-Ebri a, Roman Kontchakov b, Davide Lanti a, Martin Rezk a, Mariano Rodriguez-Muro c, and Guohui Xiao a a Free University of Bozen-Bolzano, Italy {calvanese,bcogrel,sakomlaebri,dlanti,mrezk,xiao}@inf.unibz.it b Birkbeck, University of London, UK [email protected] c IBM TJ Watson, US [email protected] Abstract. We present Ontop, an open-source Ontology-Based Data Access (OBDA) system that allows for querying relational data sources through a conceptual representation of the domain of interest, provided in terms of an ontology, to which the data sources are mapped. Key features of Ontop are its solid theoretical foundations, a virtual approach to OBDA, which avoids materializing triples and is implemented through the query rewriting technique, extensive optimizations exploiting all elements of the OBDA architecture, its compliance to all relevant W3C recommendations (including SPARQL queries, R2RML mappings, and OWL 2 QL and RDFS ontologies), and its support for all major relational databases. Keywords: Ontop, OBDA, Databases, RDF, SPARQL, Ontologies, R2RML, OWL 1. Introduction Since the mid 2000s, Ontology-Based Data Access (OBDA) has become a popular approach for tackling Over the past 20 years we have moved from a this challenge [42]. In OBDA, a conceptual layer is world where most companies operated with a sin- provided in the form of an ontology that defines a gle all-knowing, self-contained, central database to a shared vocabulary, models the domain, hides the struc- world where companies buy and sell their data, in- ture of the data sources, and enriches incomplete data teract with several data sources, and analyze patterns with background knowledge. Then, queries are posed and statistics coming from them. The focus is shifting over this high-level conceptual view, and the users from obtaining information to finding the right infor- no longer need an understanding of the data sources, mation. It has always been the case that information the relation between them, or the encoding of the is power, but today attention rather than information data. Queries are translated by the OBDA system into becomes the scarce resource, and those who can dis- queries over potentially very large (usually relational tinguish valuable information from background clut- and federated) data sources. The ontology is connected ter gain power [28]. To separate the wheat from the to the data sources through a declarative specification chaff, companies need a comprehensive understanding given in terms of mappings that relate symbols in the of their data and the ability to cope with diversity in ontology (classes and properties) to (SQL) views over the data. the data. The W3C standard R2RML [17] was cre- 1570-0844/0-1900/$27.50 c 0 – IOS Press and the authors. All rights reserved 2 Ontop ated with the goal of providing a language for spec- is encoded in the data. In this paper, we describe how ifying mappings in the OBDA setting. The ontology to use the Ontop system to address this challenge by together with the mappings exposes a virtual RDF enhancing the database with a semantic layer. graph, which can be queried using SPARQL, the stan- We present the OBDA system Ontop2, a mature dard query language in the Semantic Web community. open-source system, which is currently being used This virtual RDF graph can be materialized, generat- in a number of projects. Ontop supports all the ing RDF triples to be used with RDF triplestores, or al- W3C recommendations related to OBDA: OWL 2 QL, virtual ternatively it can be kept and queried only dur- R2RML, SPARQL, SWRL, and the OWL 2 QL entail- ing query execution. The virtual approach avoids the ment regime in SPARQL. The system is available as a cost of materialization and can profit from more than Protégé plugin, a SPARQL endpoint through Sesame 30 years’ maturity of relational database systems (effi- Workbench, and a Java library supporting OWL API cient query answering, security, robust transaction sup- and Sesame API. port, etc.). The structure of the article is as follows. Section 2 To illustrate these concepts and different notions in presents a high-level overview of the architecture of this article, we will use the following running exam- Ontop. Section 3 surveys additional tools that can be ple. All the material required to run this example in the used with Ontop for creating, deploying, and query- OBDA system Ontop (and a complementary tutorial) 1 ing OBDA systems. Section 4 describes the SPARQL can be found online . query answering techniques implemented in Ontop. Example 1.1 (Hospital Database). We consider a hos- Section 5 outlines the applications of Ontop, in partic- pital database with a single table tbl_patient that ular the Statoil and Siemens use cases in the context contains information on lung cancer patients. The ta- of the Optique EU project. Related SPARQL query an- ble has 4 attributes: the patient identifier (pid), his/her swering systems are surveyed in Section 6. Section 7 is name, the type of cancer (tumor), and its stage. The a retrospective on the development of Ontop over the lung cancer can be of two types: Non-Small Cell Lung past five years. Finally, Section 8 concludes the article. Carcinoma (NSCLC) and Small Cell Lung Carcinoma (SCLC), which are encoded in the table by a boolean value type as follows: 2. Architecture of Ontop – false for NSCLC and true for SCLC. Ontop is an open-source3 OBDA system released The stage of the cancer is encoded by a positive integer under the Apache license, developed at the Free Uni- value stage as follows: versity of Bozen-Bolzano. The Ontop system exposes relational databases as virtual RDF graphs by linking – NSCLC: 1–6 for stages I, II, III, IIIa, IIIb, and IV, the terms (classes and properties) in the ontology to respectively; the data sources through mappings. This virtual RDF – SCLC: 1 and 2 for stages Limited and Extensive, graph can then be queried using SPARQL by translat- respectively. ing the SPARQL queries into SQL queries over the re- Our sample table contains the following data: lational databases. This translation process is transpar- ent to the user. pid name type stage The architecture of Ontop, which is illustrated in 1 ’Mary’ false 4 Fig. 1, can be divided in four layers: (i) the inputs, 2 ’John’ true 1 i.e., the domain-specific artifacts such as the ontology, mappings, database, and queries; (ii) the core of the Suppose we need a simple piece of information from system in charge of query translation, optimization, this database: “Give me the names of patients with a and execution; (iii) the APIs exposing standard Java tumor of stage IIIa”. Even this simple query in this tiny interfaces to users of the system; and (iv) the applica- database already presents some challenges, because in tions that allow end-users to execute SPARQL queries order to formulate the query and to understand and an- over databases. We explore each of these components alyze the results we need to know how the information in turn. 1https://github.com/ontop/ontop-examples/ 2http://ontop.inf.unibz.it tree/master/swj-2015 3http://github.com/ontop/ontop Ontop 3 Sesame Workbench & Optique Application Protege Layer SPARQL Endpoint Platform API OWL-API Sesame Storage And Inference Layer (SAIL) API Layer Ontop Ontop SPARQL Query Answering Engine (Quest) Core OWL-API Sesame API R2RML API JDBC (OWL Parser) (SPARQL Parser) OWL 2 QL R2RML Inputs Relational SPARQL Ontologies Mappings Databases Queries Fig. 1. Architecture of the Ontop system 2.1. Inputs: Ontology, Mappings, Queries, and classes of :LungCancer (that is, both are types Databases of lung cancer), which in turn is a subclass of :Neoplasm. The object property :hasNeoplasm has To the best of our knowledge, Ontop is the first OBDA system that supports all the W3C recom- class :Patient as its domain and :Neoplasm as its mendations related to OBDA: OWL 2 QL, R2RML, range (in other words, it relates patients to neoplasms). SPARQL, SWRL, and the OWL 2 QL entailment We also have a datatype property :hasName and an regime in SPARQL4. In addition, it supports all major commercial and open-source relational databases. object property :hasStage. Ontology. Ontop uses RDFS [8] and OWL 2 QL [39] as ontology languages. OWL 2 QL is based on the DL- Lite family of lightweight description logics [11,3], Mappings. Ontop supports two mapping languages: which guarantees that queries over the ontology can be the W3C RDB2RDF Mapping Language (R2RML), rewritten into equivalent queries over the data alone. Recently Ontop has been extended to support also a which is a widely used standard; and the native Ontop fragment of SWRL [61]. mapping language, which is easier to learn and use. Example 2.1. The following ontology captures the do- Ontop includes tools for converting native mappings main knowledge of our running example. It describes into R2RML mappings and vice-versa. Intuitively, a the concepts of cancer and cancer patient with the fol- mapping assertion consists of a source, which is an lowing OWL axioms: SQL query retrieving values from the database, and a :NSCLC rdfs:subClassOf :LungCancer . :SCLC rdfs:subClassOf :LungCancer . target, which constructs RDF triples with values from :LungCancer rdfs:subClassOf :Neoplasm . the source. :hasNeoplasm rdfs:domain :Patient . :hasNeoplasm rdfs:range :Neoplasm . :hasName a owl:DatatypeProperty .