Conjunctive Query Answering in the Description Logic EL Using a Relational System

Carsten Lutz - Universität Bremen, Germany David Toman - University of Waterloo, Canada FrankWolter - University of Liverpool, UK

Presented By : Jasmeet Jagdev Background • One of the main application of ontologies is data access • Ontologies formalize conceptual information about the data • Calvanese et al. have argued, true scalability of CQ answering over DL ontologies can only be achieved by making use of RDBMSs • Not straight forward ! Background

• RDBMSs are unaware of Tboxes (the DL mechanism for storing conceptual information) and adopt the closed-world semantics • In contrast, ABoxes (the DL mechanism for storing data) and the associated ontologies employ the open-world semantics Introduction

• Approach for using RDBMSs for CQ answering over DL ontologies • We apply it to an extension of EL family of DLs • Widely used as ontology languages for large scale bio-medical ontologies such as SNOMED CT and NCI Introduction

• EL allows – Concept Intersection – Existential restrictions

dr • ELH⊥ = EL + bottom concept, role inclusions, and domain and range restrictions

Main Idea

• To incorporate the consequences of the TBox T into the relational instance corresponding to the given ABox A • To introduce the idea of combined first-order (FO) rewritability • It is possibly only if: – A and T can be written into an FO structure – q and T into FO query q*

Main Idea

• Properties of this approach: – It applies to DLs for which data complexity of CQ dr answering is PTIME-complete, such as ELH⊥ dr – For ELH⊥ , rewriting step can be carried out in polynomial time and produce only a polynomial blowup Notations

dr • In ELH⊥ , concepts are according to the rule:

C ::= A | T | ⊥ |C П D | ∃r.C

A – Concept Names taken from set NC

r – Role names taken from set NR C,D – concepts

Notations • A TBox is a finite set of concept inclusions - C D role inclusions - r s domain restrictions - dom(r) C and range restrictions - ran(r) C

• An Abox is a finite set of concept assertions - A(a) role assertions - r(a, b)

• a, b individual names taken from set NI

• A knowledge base is a pair (T ,A) with a TBox T and an Abox A Notations

• Sets NV (of variables) and NI (of individual names) form the set NT of terms • A first-order (FO) query q is a first-order formula built from NT and the unary and binary predicates from NC and NR • q = 휑(v) FO formula - 휑 free variables - v = v1, . . . , vk • q is k-ary if there are k answer variables Notations

• An FO conjunctive query is of the form q = ∃u.ψ(u, v) • where – ψ is a conjunction of concept atoms A(t) and role atoms r(t, t’) – u,v are the quantified variables of q

• We use : – var(q) - set of all variables in u and v – qvar(q) - set of quantified variables – avar(q) - answer variables – term(q) - terms in q Prime Concept

dr • For a given ELH⊥ knowledge base K and a CQ q cert(q,K) = set of all certain answers = tuple (a1, . . . , ak) Such that, (a1, . . . , ak) occur in K for each model I of K I i.e. I satisfies q with vi assigned to a i , 1 ≤ i ≤ k ABox Rewriting • Extension of the ABox to a canonical model of the knowledge base dr – ELH⊥ –KB K = (T,A) – sub(T ) = sub-concepts of T – rol(T ) = role names in T – Ind (A) = individual names in A

– ranT (r) to denote the (unique) concept C with ran(r) C ∈ T

Canonical Model of IK Problems

• This situation can be detected by observing that the object xT,A is not reachable from a by a role chain in IK • This deficiency of IK is easily repaired by restricting it to elements reachable from some aIK with a ∈ Ind(A) • A path in I is finite sequence d0r1d1……… rndn, n ≥ 0, where d0 ∈ I I Ind(A) and (di,di+1) ∈ r i+1 for all i < n r IK • I K denotes the restriction of IK to those d ∈ Δ for which there exists a p ∈ pathsA(I ) such that d = tail(p). r K • Then I K provides the correct certain answers to the query q Problems

r • This problem can be overcome by replacing I K with its unraveling into a less constrained, tree-like model

• UK = (A,R)-unraveling J of I is defined as follows • UK becomes infinite r • We work with I K and instead rewrite the query to q* • Such that

for all a1,……..,ak ∈ Ind(A)

Query Rewriting

* • q R contains one additional unary predicate r • Aux(x) = in I K • ∼q denote the smallest on term(q) that includes the identity relation, is transitive, and satisfies the following closure condition

Query Rewriting..

• For any equivalence class ζ of ∼q

• Fork= is the set of pairs (pre(ζ), ζ) • Fork≠ is the set of variables v ∈ qvar(q) such that there is no implicant of in([v])

Query Rewriting.. Query Rewriting..

• For

q*R is defined as Transformed Queries Transformed Queries Implementation and Experiments

• Based on NCI thesaurus which is a well-known ontology from the bio-medical domain • Extracted EL-TBox that contains – 65K primitive concept names – 70 primitive roles – 70K concept inclusions and concept definitions

Implementation and Experiments

• The auxiliary part of canonical models, which is independent of the Abox consists of – 702K concept assertions – 171K role assertions • IBM DB2 DBMS was used

Implementation and Experiments

r • We have used only two relations to represent I K – acbox (conceptid, indid) – arbox (roleid, domain-indid, range-indid) • Where – conceptid and roleid are numerical identifiers for concept names and roles names, – Indid, domain-indid, and range-indid are numerical identifiers for individuals from NI ∪ NIaux – acbox represents concept memberships – arbox represents role memberships Implementation and Experiments

• Sample Query q Nerve(x)∧¬ Aux(x) • Equivalent SQL statement select indid from acbox where conceptid=141723 and indid > 0 Results

• Simple chains (Q1) • Star queries (Q2,Q3,Q4) • Cyclic queries (Q5) Conclusion

• A novel approach is provided to CQ answering in DLs using RDBMSs • Can be used for DLs for which the data complexity is between LOGSPACE and PTIME • One drawback of this approach is the blowup of the data, which is polynomial but still considerable on large data sets • Future work could be to reduce this blowup by incorporating the TBox partly into the data and partly into the query