From: AAAI-98 Proceedings. Copyright © 1998, AAAI (www.aaai.org). All rights reserved.

What can Knowledge Representation do for Semi-Structured Data?

Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini Dipartimento di Informatica e Sistemistica Universit`a di Roma “La Sapienza”

Via Salaria 113, 00198 Roma, Italy g fcalvanese,degiacomo,lenzerini @dis.uniroma1.it

Abstract data are kept. The labels of edges in the schemas are formu- D lae of a certain theory T , and the notion of a database

The problem of modeling semi-structured data is important in being coherent to a schema S is given in terms of a spe- many application areas such as multimedia data management, cial relation, called simulation, between the graph repre- biological databases, digital libraries, and data integration. senting the database and the graph representing the schema. Graph schemas (Buneman et al. 1997) have been proposed Roughly speaking, a simulation is a correspondence be-

recently as a simple and elegant formalism for representing S

semistructured data. In this model, schemas are represented tween the edges of D and those of such that, whenever D

as graphs whose edges are labeled with unary formulae of there is an edge labeled a in , there is a corresponding a a theory, and the notions of conformance of a database to edge in S labeled with a formula satisfied by (but not nec- a schema and of subsumption between two schemas are de- essarily vice-versa). The notion of simulation is less rigid fined in terms of a simulation relation. Several authors have than the usual notion of satisfaction, and suitably reflects stressed the need of extending graph schemas with various the need of dealing with less strict structures of data. types of constraints, such as edge existence and constraints In (Buneman et al. 1997), the authors point out that, for on the number of outgoing edges. In this paper we analyze several tasks related to data management, it is important to the appropriateness of various knowledge representation for- be able to reason about schemas, in particular to check sub- malisms for representing and reasoning about graph schemas sumption between two schemas, which is the task of de- extended with constraints. We argue that neither First Order , nor Logic Programming nor Frame-based languages ciding whether every database conforming to one schema are satisfactory for this purpose, and present a solution based always conforms to another schema. They also present al- on very expressive Description . We provide tech- gorithms for, and analyze the complexity of checking sub- niques and complexity analysis for the problem of deciding sumption in BDFS. schema subsumption and conformance in various interesting Several papers indicate that in many applications there is cases, that differ by the expressive power in the specification the need to extend the BDFS model with different types of of constraints. constraints. Indeed, in (Buneman et al. 1997) all the proper- ties of the schema are expressed in terms of the structure of the graph, and therefore, there is no possibility of specifying Introduction additional conditions, such as existence of edges or bounds on the number of edges emanating from a node, or imposing The ability to represent data whose structure is less rigid that a certain subgraph is well-founded. and strict than in conventional databases is considered a Our intuition suggests that Knowledge Representation crucial aspect in modern approaches to data modeling, and (KR) techniques should be very useful for the above pur- is important in many application areas, such as biologi- pose. After all, the problem deals with representing con- cal databases, digital libraries, data integration, and access straints, and reasoning about schemas with constraints. The to web databases (Abiteboul 1997; Buneman et al. 1997; basic goal of the work reported in this paper was to verify Christophides et al. 1994; Mendelzon, Mihaila, & Milo this intuition, and we present here the following results of 1997; Quass et al. 1995). Consider, for example, the set of our investigation: home pages designed by the faculties for a University web

site. Since different home pages may vary considerably one  We analyze the appropriateness of various KR formalisms from another, it is extremely hard to describe their structure for representing and reasoning about graph schemas ex- in a rigid form such as the one imposed, say, by relational tended with constraints, and demonstrate that neither First databases. Indeed, we need structuring mechanisms that are Order Logic, nor Logic Programming nor Frame-based much more flexible than traditional data models. languages are satisfactory for this purpose.

Following (Abiteboul 1997), we define semi-structured  We show that very expressive Description Logics (DLs), data as data that is neither raw, nor strictly typed as in con- such as the ones studied in (De Giacomo & Lenzerini ventional database systems. BDFS (Basic Data model For 1996; Calvanese 1996; De Giacomo & Lenzerini 1997), Semi-structured data) (Buneman et al. 1997) is a formal are the right tools for modeling and reasoning about semi- and elegant data model, based on graphs with labeled edges, structured data with constraints. In particular, we propose where information on both the values and the schema for the to express constraints in terms of DLs formulae associated

to nodes of the schema. A formula on a node u imposes

D S

Copyright c 1998, American Association for Artificial Intel- a condition that, for every database conforming to , u ligence (www.aaai.org). All rights reserved. must be satisfied by every node of D simulating .  We consider languages for specifying constraints with dif- In (Buneman et al. 1997), an algorithm is presented for

ferent expressive power, and present several results on checking subsumption (and also conformance, being a T -

the corresponding reasoning problems. We show that database a special case of T -schema). The algorithm es-

adding various types of local constraints (i.e. constraints sentially looks for the greatest simulation between the nodes

O 1

m  t m that impose conditions only on the edges emanating from O

of the two schemas, and works in time T ,

t x a node) does not increase the complexity of reasoning. m

where is the size of the two schemas and T is the time T On the other hand, we present an intractability result for needed to check whether a formula of size x is valid in .

the case of non-local constraints. Finally, we study the In general it is meaningful not to consider T to be part of

case where the constraints are expressed in a very pow- the input of the problem (Buneman et al. 1997). Therefore,

ALC Q

m m erful DL, namely, (De Giacomo & Lenzerini t

whenever T may be assumed to be independent of ,

m m 1997), that allows for imposing complex conditions on t

T can be replaced by a constant (e.g. when is polyno-

T S the schema, such as well-foundedness of subgraphs. We mial in the size jS j of a -schema , which is considerably

present a technique for checking subsumption in this case, smaller than jT j). showing that the problem is decidable in double exponen- If not specified otherwise, we also make the assumption

tial time. that the theory T is not part of the input to the reasoning Our presentation starts with a brief description of both problems addressed in the paper (namely, consistency and

subsumption). ALC Q

BDFS, and the description logic  . ALC Q The Description Logic  Preliminaries Description logics (DLs) allow one to represent a domain In this section, we describe the basic characteristics of the of interest in terms of concepts and roles. Concepts model

BDFS model for semi-structured data and the description classes of individuals, while roles model relationships be- ALC Q

tween classes. We concentrate on the DL  studied ALC Q logic  . in (De Giacomo & Lenzerini 1997), where a correspondence was shown with a well-known logic of programs, called The BDFS Data Model modal mu-calculus (Kozen 1983; Streett & Emerson 1989), ThedatamodelBDFS, which is the basis of our investiga- that has been recently investigated for expressing temporal

tion, is an edge-labeled graph model of semi-structured data, properties of reactive and parallel processes (Stirling 1996; ALC Q

where labels are unary formulae of a first order language Emerson 1996).  can be viewed as a well-behaved L

L fragment of first-order logic with fixpoints (Park 1970; T T . The language is constituted by a set of predicates, including the equality predicate “=”, and one constant for Abiteboul, Hull, & Vianu 1995). We make use of the stan-

every element of a universe U . dard first-order notions of scope, bound and free occurrences  A schema in BDFS always refers to a complete and decid- of variables, closed formulae, etc., treating  and as quan-

tifiers.

U T

able theory T on .Inotherwords, is the set of first order ALC Q The primitive symbols in  are atomic concepts, formulae which are true (or valid) for the elements of U ,and

(concept) variables,andatomic roles (in the following

p L it is decidable to check whether a formula in T is valid

called simply roles). Concepts are formed according to the

T j= p in T (in notation, ). following syntax:

Definition 1 A BDFS T -schema is a rooted connected graph

C ::= A j :C j C u C j  nR C  j X C j X 2

1 . .

L T

whose edges are labeled with unary formulae of T .A -

R n database is a rooted connected graph whose edges are la- where A denotes an atomic concept, arole, a natural

number, and X a variable, and the restriction is made that

beled with constants of T .

X C

every free occurrence of X in . is in the scope of an G

For any rooted graph G, we denote the root of by even number of negations.

C t C

1 2

oot G G Nodes G

r , the set of nodes of by ,andtheset We introduce the following abbreviations: for

::C u:C  > A t:A ? :> 9R C

1 2

Edges G

of edges of G by . We denote an edge from node , for , for , . for

a

 1 R C  8R C :9R :C  nR C  :

 . , . for . , . for

v a u ! v

u to node labeled by with .

n+1 R  = nR C   nR C  u  nR C  .C , . for . . ,and

In order to establish if a database is coherent with a

C :X :C [X=:X ] C [X=:X ] X . for . (where is the con- schema, or if a schema is more general than another schema,

cept obtained by substituting all free occurrences of X with

the notions of conformance and subsumption are defined as X

: ). I

follows. I

=  ;  

An I consists of an interpre-

I I

 

D T

Definition 2 A T -database conforms to a BDFS - tation domain ,andaninterpretation function ,which

I



D  S

schema S , in notation , if there exists a simulation maps every atomic concept to a subset of ,andevery

I I

D S  D

 

from to , i.e. a binary relation from the nodes of atomic role to a subset of  . The presence of free

0

root D   root S  u  u

to those of S satisfying: (1) ,(2) variables does not allow us to extend the interpretation func-

a

I



! v D

implies that for each edge u in , there exists an edge tion directly to every concept of the logic. For this reason

p



0 0

0 we introduce valuations. A valuation on an interpretation

! v S T j= pa v  v

u in such that , and .

I 

I is a mapping from variables to subsets of .Givena

0 0

S T S  [X=E ]

Definition 3 If S and are two BDFS -schemas, then valuation , we denote by the valuation identical to

0

SvS T D  [X=E ]X = E

subsumes S , in notation , if for every -database , except for the fact that .

0 0 0

S D S S S SvS I  I

D implies . is equivalent to if and Let be an interpretation and a valuation on .We

0

vS I S . assign meaning to concepts of the logic by associating to

I 

and  an extension function , mapping concepts to subsets extending it to an interpretation by computing the exten-



I

n ;:::;n N sion of the predicates 1 on the basis of the for-

of  , as follows:

OLS 

mulae in F .

I I

= X   X



M FOLS 

 Given a logical model of , if we traverse the

I I I

A = A 



dge M

relation e in starting from one object satisfying the

I I I

=  C :C  

 root predicate, we should obtain a structure that corre-

I I I

C u C  = C  \ C 

1 2 1 2

  

sponds to a database conforming to S , and, on the con-

I I

 nR C  = fs 2  j 

. trary, every database conforming to S should correspond

0 0 I 0 I

fs j s; s  2 R s 2 C gng

FOLS 

and  to a logical model of in which the extension of

T

I I I

X C  = fE  j C E g

. 

[X=E ]

 the root predicate is not empty. S

Consider, for example, the BDFS schema 1 with two I

Observe that C can be seen as an operator from sub-

[X=E ]

u u u u

2 1 2

nodes 1 and , and one edge from to labeled with

I I

P FOLS 

E   sets of to subsets of , and that, by the syntactic . The corresponding set 1 of formulae in FOL is:

restriction enforced on variables, such an operator is guar-

f 8x n x , 8y 8z edge x; y ; z   P z  ^ n y ; 2

anteed to be monotonic wrt set inclusion. The constructs 1

8x n x , :9y 9z edge x; y ; z  g

2

C X C

X . and . denote respectively the least fixpoint and

D d e

Consider the database 1 with two nodes and and an

the greatest fixpoint of the operator. The extension of closed

e t P t T edge from d to labeled with such that is valid in .It

concepts is independent of the valuation, and therefore for

D S 1 is easy to see that 1 conforms to , and that it corresponds

closed concepts we do not consider the valuation explicitly.

dge d; e; t

to the pre-interpretation where e is true, which

ALC Q C v

A knowledge base is a finite set of 1

FOLS  d 2 n 1

is extended to a model of 1 such that and

C C C ALC Q

2 2

where 1 and are closed concepts of .An

e 2 n

2 . Observe that the empty database with just one node

I I

I C v C C C 2

interpretation satisfies an 1 ,if .

1 2 S

and no edges also conforms to 1 and corresponds to the

I

I is a model of a knowledge base ,if satisfies all axioms

dge x; y ; z 

pre-interpretation where e is always false. This C

in . A closed concept is satisfiable in a knowledge base

FOLS 

is correctly captured by 1 .

I

I C 6= ;

u if there exists a model of such that . S

On the other hand, consider a schema 2 with one node

u u P FOLS  Theorem 4 (De Giacomo & Lenzerini 1997) Satisfiability and one edge from to labeled with .Theset 2

is simply

ALC Q ALC Q

of closed  concepts in knowledge bases is

x nx , 8y 8z edge x; y ; z   P z  ^ ny  g an EXPTIME-complete problem. f8

and it is easy to see that there is at least one model of

OLS  Which KR Formalism for Semi-Structured F

2 which does not correspond to any database con-

S I

Data Modeling? forming to 2 . Indeed, consider a pre-interpretation as-

dge d; d; t P t T In this section we discuss how the technology of KR can signing true to e , with valid in . Clearly,

contribute to the problem of modeling semi-structured data such a pre-interpretation can be extended to a model of

FOLS  n

with constraints. 2 simply by letting the extension of be empty. n We start our investigation by verifying whether First Or- However, since d is not in the extension of , the resulting

der Logic (FOL) is suited for representing BDFS schemas. model does not reflect the fact that the database correspond-

I S We observe that, in principle, FOL would be an interesting ing to conforms to 2 . tool, because it would allow expressing very complex con- The above example shows a general problem that FOL straints on the schema. However, we need to check if FOL is has in representing BDFS schemas. The first order semantics able to model the notions of BDFS schemas, databases, and is intrinsically too liberal for capturing the notion of simu- simulation. lation. Indeed, in order to reflect such a notion, we need a

A reasonable approach to expressing BDFS schemas in type of semantics that forces an object o to be in the exten- n FOL is based on the following observations: sion of a predicate i whenever there is no evidence that it

cannot satisfy such predicate. We observe that the greatest

 u ;:::;u N

If 1 are the nodes of the schema, we make use fixpoint semantics (Baader 1996) satisfies exactly this prop-

n ;:::;n edge n x

N i

of predicate symbols, 1 and ,where erty. Thus, FOL extended with fixpoints would be suitable,

x n

means that is a node (the i corresponding to the root but FOL itself is not.

dge x; y ; z 

is called the root predicate), and e means that For the same reason, one can verify that neither Logic

y z

there is an edge from x to labeled with . The fact that Programming languages under the completion seman-

F T

the label z satisfies the formula of is represented by tics (Lloyd 1987), nor Frame-based languages are suited

z 

F . for modeling BDFS schemas. In particular, the difficulties

S FOLS 

 We represent the schema by means of a set of arise when BDFS schemas with cycles are taken into account

S 2 u (see the schema in the example above). On the contrary,

suitable formulae. For example, the fact that a node i

u P

u schemas without cycles may be modeled correctly, for ex-

k 1 has two outgoing edges to j and labeled with and

P ample by using KR systems such as CLASSIC (Borgida &

2 , is represented by the formula

Patel-Schneider 1994). Note, however, that the assumption

x n x , 8y 8z edge x; y ; z  

8 of acyclicity of semi-structured data schemas is too restric-

i

P z  ^ n y  _ P z  ^ n y 

1 j 2 k tive in practice.

The above observations tell us that, in order to use the

 OLS  An interpretation of F is obtained by choosing a KR technology for modeling and reasoning about semi-

so-called pre-interpretation that assigns a truth value to structured data with constraints, we must resort to KR for-

dge x; y ; z  every ground formula of the form e ,andthen malisms that are able:

a

u ! v T j= pa

 to model graphs with no limitations on cycles; outgoing edge such that , while a constraint

1 p of the form 9 , called functionality-constraint, imposes  to interpret such graphs by making use of the greatest fix-

point semantics; that u has at most one such outgoing edge. More precisely,

S =G ; C  T L D

let be a -schema with l -constraints. and a

 to express complex constraints on the graphs;

u D

T -database. Then a node of satisfies a constraint ,in

 u j=

to provide reasoning procedures for computing the sub- notation c , if the following conditions are satisfied:

sumption relation between schemas.

u j= >

c always

a ALC Q

Below we show that  has exactly the above char-

u j= 9p 9u ! v 2 Edges D  T j= pa

c iff .

acteristics. However, as a first step, we need to formally de- a

u j= :9p 8u ! v 2 Edges D  T j= :pa

c iff .

fine graph schemas with constraints and the associated rea- a

1

fu ! v 2 Edges D  jT j= pag1 u j= 9 p

soning tasks. c iff

u j= ^ u j=  ^ u j= 

c 1 2 1 c 2 iff c

Schemas with Constraints

D T

Note that we can view a T -database as a -schema

D; C  C u=> u We address the problem of extending the BDFS data model  with constraints, where for every node

in order to express constraints on the graph representing a of D (such a schema is always consistent).

schema. We conceive a constraint for a BDFS schema S as Checking the consistency of a schema amounts to visiting

a formula associated to a node u of the schema. The for- the graph and removing nodes that violate constraints, which

mula is expressed in a certain language L, and its role is to can be detected by a local check. An algorithm for subsump-

impose a condition that, for every database D conforming tion is obtained essentially by incorporating local checks for

D u to S , must be satisfied by every node of simulating .In constraint violations into the algorithm of (Buneman et al. other words, constraints are used to impose additional con- 1997). ditions on the schema, with respect to those already implied

Theorem 8 Consistency and subsumption of T -schemas

by the structure of the graph. L

with l -constraints, can be checked in polynomial time in

L S =

Definition 5 A T -schema with -constraints is a pair the size of the schemas.

G ; C  G T C  ,where is a BDFS -schema, and is a total func- tion from the nodes of G to formulae of a constraint language Non-Local Constraints

L. Next we consider languages in which the constraints are not

D T

Definition 6 A T -database conforms to a -schema with local, i.e. they can express conditions on edges that are not

S =G ; C  D S L-constraints , in notation , if there exists directly connected to the node labeled with the constraint.

a constraint-consistent simulation, i.e. a binary relation  We show that even in a simple non-local constraint lan-

L ALE

D G root D  

from the nodes of to those of satisfying: (1) guage, namely ALE inspired by the DL (Donini et

0 0

T

u  u u C u  oot G 

r ,(2) implies that (2.1) satisfies , and al. 1992), consistency and subsumption of -schemas be-

q

a 0

0 come intractable.

! v D u ! v

(2.2) for each edge u in , there exists an edge

L 0

Theformulaeof ALE have the following syntax:

T j= q a v  v

in S such that , and .

::= > j 9p" j 8p" j ^ 2 Since constraints may contradict each other, or may even 1 be incompatible with the structure of the graph, the notion where the additional rules for the satisfaction of constraints

of consistency becomes relevant.

L u T

of ALE in a node of a -database are:

a

L S =

Definition 7 For a T -schema with -constraints

u j= 9p" 9u ! v 2 Edges D  T j= pa ^ v j=  c

c iff .

u 2 Nodes G  G ; C 

, a node is consistent if there is at a

8u ! v 2 Edges D  T j= pa  v j=  u j= 8p"

0 0 c

c iff .

G ; C  G

least one T -database which conforms to ,where

0

root G  = u S

is equal to G except that . is consistent,if L

Observe that ALE is not local since the constraints im-

oot G  r is consistent. posed on one node may imply other constraints on adjacent The notion of subsumption remains unchanged. nodes. By exploiting this property and the hardness results We consider now different constraint languages, and in (Donini et al. 1992), we can show that consistency check- study consistency and subsumption checking for schemas ing is coNP-hard.

with constraints. Being conformance a special case of sub- S Theorem 9 Checking the consistency of a T -schema with

sumption, we do not explicitly deal with conformance.

L S T

ALE -constraints is coNP-hard in the size of , even if is

empty, i.e. all edges of S are labeled with true. Local Constraints

By observing that checking consistency can be reduced L We first consider a language l in which only local con- to checking subsumption wrt an inconsistent schema, we

straints can be expressed, i.e. only constraints on the edges immediately get that subsumption is NP-hard. Theorem 9 L directly emanating from a node. l is inspired by DLs with

shows also that consistency stays coNP-hard, even if T can

number restrictions and its formulae have the following syn- be used as an oracle for validity. The complexity of checking

p 2 tax ( , 1 and denote constraints, and denotes a formula consistency in the presence of non-local constraints lies in

of T ): the necessity to verify whether a database may exist, whose

1

topology is determined by the constraints. Since T cannot

::= > j 9p j :9p j 9 p j ^ 2

1 predict anything about the possible topologies of databases,

p u T Intuitively, a constraint of the form 9 on a node , called the validity checker of cannot be used to “hide” a poten-

edge-existence constraint, imposes that u has at least one tially exponential calculation. Note that this is different from

Bib are represented by the graph, whereas the content of the u Paper 4

pages is modeled by T . Notice the use of constraints to u

u Section 0

1 state complex conditions on the structure of the allowed

8 "X Section X Section

Body databases. In particular, the constraint .

u u

2 3 u

associated with 1 rules out all databases that have loops in

1 1

u  = 9 ^9 ^ X 8 "X C the connections of the various sections.

1 Section Bib . Section

1 1

C u  = 9 _9

2 Section Body L

Checking Subsumption for  We develop now a tech-

C u  = C u  = C u  = >

0 3 4

T L

nique for checking subsumption of -schemas with  -

8 9

: u: u:

Paper v Bib Section Body T =

< constraints, which works in the case where the theory can

: u:

Section v Bib Body ALC Q

be expressed in terms of axioms of  . In order to il-

T :

Body v Bib ;

: lustrate the features of the technique, we further assume that t

Index v Itemize Enumerate U

T is interpreted over a fixed finite universe , includes only

d

unary predicates, one distinct constant c for each element

d 2U pa L T , and is presented as a finite set containing either

Figure 1: A -schema with  -constraints

1

pa p a

or : for each predicate and constant . T the case of local constraints, where the aspects related to the The formulae of T that label the edges of a -schema are

topology enforced by the constraints can be embedded in an boolean combinations of atomic formulae in the language of

T self = a a

appropriate formula of T . and of expressions of the form ,where is a

pa

constant of T . We define when a formula labeling an

T j= pa

Fixpoint Constraints edge is valid in T , in notation , as follows:

0 0

j=self = a a a = a

We now extend our framework to a very expressive con- T iff

T 6j= pa T j= :pa ALC Q

straint language, which is a variant of  ,andshow iff

T j=p ^ p a T j= p a ^T j= p a

2 1 2

decidability of consistency and subsumption. 1 iff L

The constraint language  is the set of closed formulae

T T

constructed according to the following syntax (p denotes a It is immediate to view a -database as a -schema, sim-

a self = a

n X formula of T , a positive integer, and a variable): ply by replacing each edge label by . Therefore,

as in BDFS, conformance is a special case of subsumption.

n

::= X j 9 F j : j ^ j X 2

1 . The technique we use for checking subsumption is based

ALC Q

::= p j " j :F j F ^ F

F on a reduction to unsatisfiability in knowledge

1 2

bases. Differently from the previous cases, in what follows

X with the restriction that every free occurrence of X in . we consider T to be part of the input to subsumption check-

is in the scope of an even number of negations. ing.

_ :: ^: 

2 1 2

We introduce the abbreviations: 1 for ,

T S S 2

Given two -schemas 1 and , we reduce the problem

1

> _: 8p" :9 p ^": 

S vS 2

for ,and for . of deciding whether 1 , to the problem of deciding the

D T M T

u: ALC Q  S

Let be a -database, and be a model of .A unsatisfiability of the concept S in the

2 1

 D

ALC Q  

T S S

valuation on is a mapping from variables to subsets knowledge base T ,where , ,and are

1 2

des D  [X=E ]

of No . We denote by the valuation iden- defined as follows.

[X=E ]X  = E

tical to  except for . For each node

T

T :encodingof and of the general properties of BDFS

2 Nodes D  u u , we define when satisfies a constraint

graphs To encode the general properties of BDFS graphs,

 ; u j=

under a valuation , in notation c , as follows:

T exploits reification of edges, as used in (Buneman et al.

E

; u j= X u 2 X 

c iff 1997). Specifically, we use a special role and split each

a

n

a E E

; u j= 9 F fu ! v 2 Edges D  j

c iff

u ! v u ! e ! v

labeled edge into two edges uv ,byintro-

a

; u ! v j= F gn

c

e a T

ducing an intermediate node uv labeled by . contains

; u 6j= ; u j= : c

c iff

> > >

E D

the following axioms ( N , ,and are new atomic

; u j= ^ ; u j=  ^ ; u j= 

1 2 c 1 c 2 c iff

concepts, and L is a new role):

; u j= X 8E Nodes D  8v 2 Nodes D 

c . iff . .

> v > t> t> > v :>

N E D N E

[X=E ];v j=  [X=E ];v j= X   [X=E ];u j= X

c c c

> v :> > v :>

E D D N

> > v 8E E

where N .

> v 8E > u = 1 E > u8L > u = 1 L >

E N D

a . . . .

; u ! v j= p T j= pa

c iff

a

! v j= " ; v j=

; u Intuitively, these axioms partition the interpretation domain c

c iff

a a

> > E

into objects denoting nodes ( N ), edges ( ), and con-

; u ! v j= :F ; u ! v 6j= F c

iff c

a a a

T >

stants of ( D ), and specify the correct links for those

; u ! v j= F  ^ ; u ! v j= F  ; u ! v j= F ^ F

c 1 c 2 1 2 c iff

object denoting nodes and edges. T

L In addition, in order to encode the theory , we introduce

Since the constraints in  are closed formulae, satisfaction

C T O a

is independent of the valuation, and we denote it simply by one concept p for each predicate of , and one concept

a T

j= u (called an object-concept) for each constant of , and, for

c .

C O

a T each pair p , we add to the axiom: Example The schema shown in Figure 1 represents a set of web pages such as those generated by “latex2html” when 1We point out that we restrict ourselves to such simple kinds of A translating a LTEX article containing nested sections and theories for the sake of simplicity, but our approach works when T possibly a bibliography. The connections between the pages has a more general form. Conclusions

The result of our investigation is that very expressive DLs

> u O v C T j= pa

a p

D if

u O v:C T j= :pa

> are interesting tools for modeling and reasoning about semi-

a p D if structured data with constraints. The analysis presented in

the paper shows that the complexity of subsumption rises

j j jT j

Observe that, T is linear in .

even when simple non-local constraints are added to BDFS. S  This justifies our approach that aims at adding as much

S : encoding of the schema In order to define the en-

T S =G ; C   expressive power as possible in specifying the constraints,

coding S of a -schema we define a mapping

ALC Q  from constraint expressions to formulae as fol- without loosing decidability. lows:

Acknowledgments This work was partly supported by ES-

 X =X

p p= 8L

 PRIT LTR Prj. No. 22469 DWQ and the Italian Space Agency. n

 .

 F  9 F = n E

 .

" =8E   

 .

 : =:  

:F =: F 

 References

  ^ =   u   

1 2 1 2

 F ^ F = F  u  F 

2 1 2

1 Abiteboul, S.; Hull, R.; and Vianu, V. 1995. Foundations of

 X =X    . .

Databases. Addison Wesley Publ. Co., Reading, Massachussetts.

2 Nodes G =fu ;:::;u g

u Abiteboul, S. 1997. Querying semi-structured data. In ICDT-97, h We construct for each node 1

2 1–18.

ALC Q a characteristic concept u as follows : Consider

Baader, F. 1996. Using automata theory for characterizing the u the set of mutual recursive equations, one for each node i

semantics of terminological cycles. Ann. of Math. and AI 18:175–

des G  in No

219.

F

> u  C u  u8E > u 8L p u8E X  X

p

N 1 E v

u . . . Borgida, A., and Patel-Schneider, P. F. 1994. A semantics and

1

u !v

1 complete algorithm for subsumption in the CLASSIC description 

F logic. J. of Artificial Intelligence Research 1:277–308.

X > u  C u  u8E > u p 8L p u8E X 

u N E v

h . . .

h

u !v h Buneman, P.; Davidson, S.; Fernandez, M.; and Suciu, D. 1997. and eliminate, one at the time, each of the above equations, Adding structure to unstructured data. In ICDT-97, 336–350.

Calvanese, D. 1996. Finite model reasoning in description logics. X except the one for u as follows: Eliminate the equation

i In KR-96, 292–303.

X = C X

j u

u and substitute each occurrence of in the j

j Christophides, V.; Abiteboul, S.; Cluet, S.; and Scholl, M. 1994.

X C X = C

j u i

remaining equations by u . .Let be the re- i

j From structured documents to novel query facilities. In ACM SIG-

X C

u i

sulting equation. The concept u is . . The encoding i

i MOD, 313–324.

 S  =

S S

oot G  of is r .

De Giacomo, G., and Lenzerini, M. 1996. TBox and ABox rea-

j j Observe that, in the worst case, S is exponential with soning in expressive description logics. In KR-96, 316–327. respect to jS j. De Giacomo, G., and Lenzerini, M. 1997. A uniform framework Properties of the encoding The following three proper- for concept definitions in description logics. J. of Artificial Intelli- ties of the encoding establish decidability and complexity gence Research 6:87–110.

Donini, F. M.; Hollunder, B.; Lenzerini, M.; Spaccamela, A. M.;

T L of checking subsumption between two -schemas with  -

Nardi, D.; and Nutt, W. 1992. The complexity of existential quan-

S S 2 constraints 1 and .

tification in concept languages. Artif. Intell. 2–3:309–327.

S S 2

Theorem 10 1 is subsumed by if and only if there is no Emerson, E. A. 1996. Automated temporal reasoning about re-

 u:

S S

model of T that satisfies and interprets every active systems. In Logics for Concurrency: Structure versus Au- 2 object-concept as a singleton.1 tomata, volume 1043 of LNCS. Springer-Verlag. 41–101.

Kozen, D. 1983. Results on the propositional -calculus. Theor.

 

S S

Theorem 11 Let T , , and be as defined above. 2

1 Comp. Sci. 27:333–354.

0

ALC Q Then there exists a  knowledge base whose size

Lloyd, J. W. 1987. Foundations of Logic Programming (Second,

j j + j j + j j  u:

S S S S

is polynomial in T such that:

2 1 2

1 Extended Edition). Springer-Verlag. is satisfied in a model of T that interprets every object-

Mendelzon, A.; Mihaila, G. A.; and Milo, T. 1997. Querying the

 u: S

concept as a singleton, if and only if S is satisfi-

1 2

0 World Wide Web. Int. J. on Digital Libraries 1(1):54–67. able in .

Park, D. 1970. Fixpoint induction and proofs of program prop- S

Theorem 12 Checking whether 1 is subsumed erties. In Machine Intelligence, volume 5. Edinburgh University

S Press. 59–78.

by 2 is EXPTIME-hard and decidable in time

pj j+j j+j j

T S S

1 2

O 2 . Quass, D.; Rajaraman, A.; Sagiv, I.; Ullman, J.; and Widom, J.

1995. Querying semistructured heterogeneous information. In

j j jS j

Since S may be exponential with respect to , it follows DOOD-95, 319–344. Springer-Verlag. L that subsumption checking in the presence of  -constraints Steffen, B., and Ing´olfsd´ottir, A. 1994. Characteristic formu- can be done in deterministic double exponential time with lae for processes with divergence. Information and Computation respect to the size of the two schemas. 110:149–163.

2 Stirling, C. 1996. Modal and temporal logics for processes. In This construction is analogous to the one used in Process Al- Logics for Concurrency: Structure versus Automata, volume 1043 gebra for defining a characteristic formula of a process (Steffen of LNCS. Springer-Verlag. 149–237. &Ing´olfsd´ottir 1994), i.e. a formula which is satisfied by exactly Streett, R. E., and Emerson, E. A. 1989. An automata theoretic all processes that are equivalent to the process under bisimulation.

decision procedure for the propositional -calculus. Information  In a certain sense, we may say that S characterizes, exactly all and Computation 81:249–264. databases that conform to S .