From: AAAI-98 Proceedings. Copyright © 1998, AAAI (www.aaai.org). All rights reserved.
What can Knowledge Representation do for Semi-Structured Data?
Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini Dipartimento di Informatica e Sistemistica Universit`a di Roma “La Sapienza”
Via Salaria 113, 00198 Roma, Italy g fcalvanese,degiacomo,lenzerini @dis.uniroma1.it
Abstract data are kept. The labels of edges in the schemas are formu- D lae of a certain theory T , and the notion of a database
The problem of modeling semi-structured data is important in being coherent to a schema S is given in terms of a spe- many application areas such as multimedia data management, cial relation, called simulation, between the graph repre- biological databases, digital libraries, and data integration. senting the database and the graph representing the schema. Graph schemas (Buneman et al. 1997) have been proposed Roughly speaking, a simulation is a correspondence be-
recently as a simple and elegant formalism for representing S
semistructured data. In this model, schemas are represented tween the edges of D and those of such that, whenever D
as graphs whose edges are labeled with unary formulae of there is an edge labeled a in , there is a corresponding a a theory, and the notions of conformance of a database to edge in S labeled with a formula satisfied by (but not nec- a schema and of subsumption between two schemas are de- essarily vice-versa). The notion of simulation is less rigid fined in terms of a simulation relation. Several authors have than the usual notion of satisfaction, and suitably reflects stressed the need of extending graph schemas with various the need of dealing with less strict structures of data. types of constraints, such as edge existence and constraints In (Buneman et al. 1997), the authors point out that, for on the number of outgoing edges. In this paper we analyze several tasks related to data management, it is important to the appropriateness of various knowledge representation for- be able to reason about schemas, in particular to check sub- malisms for representing and reasoning about graph schemas sumption between two schemas, which is the task of de- extended with constraints. We argue that neither First Order Logic, nor Logic Programming nor Frame-based languages ciding whether every database conforming to one schema are satisfactory for this purpose, and present a solution based always conforms to another schema. They also present al- on very expressive Description Logics. We provide tech- gorithms for, and analyze the complexity of checking sub- niques and complexity analysis for the problem of deciding sumption in BDFS. schema subsumption and conformance in various interesting Several papers indicate that in many applications there is cases, that differ by the expressive power in the specification the need to extend the BDFS model with different types of of constraints. constraints. Indeed, in (Buneman et al. 1997) all the proper- ties of the schema are expressed in terms of the structure of the graph, and therefore, there is no possibility of specifying Introduction additional conditions, such as existence of edges or bounds on the number of edges emanating from a node, or imposing The ability to represent data whose structure is less rigid that a certain subgraph is well-founded. and strict than in conventional databases is considered a Our intuition suggests that Knowledge Representation crucial aspect in modern approaches to data modeling, and (KR) techniques should be very useful for the above pur- is important in many application areas, such as biologi- pose. After all, the problem deals with representing con- cal databases, digital libraries, data integration, and access straints, and reasoning about schemas with constraints. The to web databases (Abiteboul 1997; Buneman et al. 1997; basic goal of the work reported in this paper was to verify Christophides et al. 1994; Mendelzon, Mihaila, & Milo this intuition, and we present here the following results of 1997; Quass et al. 1995). Consider, for example, the set of our investigation: home pages designed by the faculties for a University web
site. Since different home pages may vary considerably one We analyze the appropriateness of various KR formalisms from another, it is extremely hard to describe their structure for representing and reasoning about graph schemas ex- in a rigid form such as the one imposed, say, by relational tended with constraints, and demonstrate that neither First databases. Indeed, we need structuring mechanisms that are Order Logic, nor Logic Programming nor Frame-based much more flexible than traditional data models. languages are satisfactory for this purpose.
Following (Abiteboul 1997), we define semi-structured We show that very expressive Description Logics (DLs), data as data that is neither raw, nor strictly typed as in con- such as the ones studied in (De Giacomo & Lenzerini ventional database systems. BDFS (Basic Data model For 1996; Calvanese 1996; De Giacomo & Lenzerini 1997), Semi-structured data) (Buneman et al. 1997) is a formal are the right tools for modeling and reasoning about semi- and elegant data model, based on graphs with labeled edges, structured data with constraints. In particular, we propose where information on both the values and the schema for the to express constraints in terms of DLs formulae associated
to nodes of the schema. A formula on a node u imposes
D S
Copyright c 1998, American Association for Artificial Intel- a condition that, for every database conforming to , u ligence (www.aaai.org). All rights reserved. must be satisfied by every node of D simulating . We consider languages for specifying constraints with dif- In (Buneman et al. 1997), an algorithm is presented for
ferent expressive power, and present several results on checking subsumption (and also conformance, being a T -
the corresponding reasoning problems. We show that database a special case of T -schema). The algorithm es-
adding various types of local constraints (i.e. constraints sentially looks for the greatest simulation between the nodes
O 1
m t m that impose conditions only on the edges emanating from O
of the two schemas, and works in time T ,
t x a node) does not increase the complexity of reasoning. m
where is the size of the two schemas and T is the time T On the other hand, we present an intractability result for needed to check whether a formula of size x is valid in .
the case of non-local constraints. Finally, we study the In general it is meaningful not to consider T to be part of
case where the constraints are expressed in a very pow- the input of the problem (Buneman et al. 1997). Therefore,
ALC Q
m m erful DL, namely, (De Giacomo & Lenzerini t
whenever T may be assumed to be independent of ,
m m 1997), that allows for imposing complex conditions on t
T can be replaced by a constant (e.g. when is polyno-
T S the schema, such as well-foundedness of subgraphs. We mial in the size jS j of a -schema , which is considerably
present a technique for checking subsumption in this case, smaller than jT j). showing that the problem is decidable in double exponen- If not specified otherwise, we also make the assumption
tial time. that the theory T is not part of the input to the reasoning Our presentation starts with a brief description of both problems addressed in the paper (namely, consistency and
subsumption). ALC Q
BDFS, and the description logic . ALC Q The Description Logic Preliminaries Description logics (DLs) allow one to represent a domain In this section, we describe the basic characteristics of the of interest in terms of concepts and roles. Concepts model
BDFS model for semi-structured data and the description classes of individuals, while roles model relationships be- ALC Q
tween classes. We concentrate on the DL studied ALC Q logic . in (De Giacomo & Lenzerini 1997), where a correspondence was shown with a well-known logic of programs, called The BDFS Data Model modal mu-calculus (Kozen 1983; Streett & Emerson 1989), ThedatamodelBDFS, which is the basis of our investiga- that has been recently investigated for expressing temporal
tion, is an edge-labeled graph model of semi-structured data, properties of reactive and parallel processes (Stirling 1996; ALC Q
where labels are unary formulae of a first order language Emerson 1996). can be viewed as a well-behaved L
L fragment of first-order logic with fixpoints (Park 1970; T T . The language is constituted by a set of predicates, including the equality predicate “=”, and one constant for Abiteboul, Hull, & Vianu 1995). We make use of the stan-
every element of a universe U . dard first-order notions of scope, bound and free occurrences A schema in BDFS always refers to a complete and decid- of variables, closed formulae, etc., treating and as quan-
tifiers.
U T
able theory T on .Inotherwords, is the set of first order ALC Q The primitive symbols in are atomic concepts, formulae which are true (or valid) for the elements of U ,and
(concept) variables,andatomic roles (in the following
p L it is decidable to check whether a formula in T is valid
called simply roles). Concepts are formed according to the
T j= p in T (in notation, ). following syntax:
Definition 1 A BDFS T -schema is a rooted connected graph
C ::= A j :C j C u C j nR C j X C j X 2
1 . .
L T
whose edges are labeled with unary formulae of T .A -
R n database is a rooted connected graph whose edges are la- where A denotes an atomic concept, arole, a natural
number, and X a variable, and the restriction is made that
beled with constants of T .
X C
every free occurrence of X in . is in the scope of an G
For any rooted graph G, we denote the root of by even number of negations.
C t C
1 2
oot G G Nodes G
r , the set of nodes of by ,andtheset We introduce the following abbreviations: for
::C u:C > A t:A ? :> 9R C
1 2
Edges G
of edges of G by . We denote an edge from node , for , for , . for
a
1 R C 8R C :9R :C nR C :
. , . for . , . for
v a u ! v
u to node labeled by with .
n+1 R = nR C nR C u nR C .C , . for . . ,and
In order to establish if a database is coherent with a
C :X :C [X=:X ] C [X=:X ] X . for . (where is the con- schema, or if a schema is more general than another schema,
cept obtained by substituting all free occurrences of X with
the notions of conformance and subsumption are defined as X
: ). I
follows. I
= ;
An interpretation I consists of an interpre-
I I
D T
Definition 2 A T -database conforms to a BDFS - tation domain ,andaninterpretation function ,which
I
D S
schema S , in notation , if there exists a simulation maps every atomic concept to a subset of ,andevery
I I
D S D
from to , i.e. a binary relation from the nodes of atomic role to a subset of . The presence of free
0
root D root S u u
to those of S satisfying: (1) ,(2) variables does not allow us to extend the interpretation func-
a
I
! v D
implies that for each edge u in , there exists an edge tion directly to every concept of the logic. For this reason
p
0 0
0 we introduce valuations. A valuation on an interpretation
! v S T j= pa v v
u in such that , and .
I
I is a mapping from variables to subsets of .Givena
0 0
S T S [X=E ]
Definition 3 If S and are two BDFS -schemas, then valuation , we denote by the valuation identical to
0
SvS T D [X=E ]X = E
subsumes S , in notation , if for every -database , except for the fact that .
0 0 0
S D S S S SvS I I
D implies . is equivalent to if and Let be an interpretation and a valuation on .We
0
vS I S . assign meaning to concepts of the logic by associating to
I
and an extension function , mapping concepts to subsets extending it to an interpretation by computing the exten-
I
n ;:::;n N sion of the predicates 1 on the basis of the for-
of , as follows:
OLS
mulae in F .
I I
= X X
M FOLS
Given a logical model of , if we traverse the
I I I
A = A
dge M
relation e in starting from one object satisfying the
I I I
= C :C
root predicate, we should obtain a structure that corre-
I I I
C u C = C \ C
1 2 1 2
sponds to a database conforming to S , and, on the con-
I I
nR C = fs 2 j
. trary, every database conforming to S should correspond
0 0 I 0 I
fs j s; s 2 R s 2 C gng
FOLS
and to a logical model of in which the extension of
T
I I I
X C = fE j C E g
.
[X=E ]
the root predicate is not empty. S
Consider, for example, the BDFS schema 1 with two I
Observe that C can be seen as an operator from sub-
[X=E ]
u u u u
2 1 2
nodes 1 and , and one edge from to labeled with
I I
P FOLS
E sets of to subsets of , and that, by the syntactic . The corresponding set 1 of formulae in FOL is:
restriction enforced on variables, such an operator is guar-
f 8x n x , 8y 8z edge x; y ; z P z ^ n y ; 2
anteed to be monotonic wrt set inclusion. The constructs 1
8x n x , :9y 9z edge x; y ; z g
2
C X C
X . and . denote respectively the least fixpoint and
D d e
Consider the database 1 with two nodes and and an
the greatest fixpoint of the operator. The extension of closed
e t P t T edge from d to labeled with such that is valid in .It
concepts is independent of the valuation, and therefore for
D S 1 is easy to see that 1 conforms to , and that it corresponds
closed concepts we do not consider the valuation explicitly.
dge d; e; t
to the pre-interpretation where e is true, which
ALC Q C v
A knowledge base is a finite set of axioms 1
FOLS d 2 n 1
is extended to a model of 1 such that and
C C C ALC Q
2 2
where 1 and are closed concepts of .An
e 2 n
2 . Observe that the empty database with just one node
I I
I C v C C C 2
interpretation satisfies an axiom 1 ,if .
1 2 S
and no edges also conforms to 1 and corresponds to the