Constraint databases: A tutorial introduction∗

Jan Van den Bussche Limburg University, Belgium

1 Introduction yielding a new goal

We give a tutorial introduction to the basic defini- S(u, u). tions surrounding the idea of constraint databases, In constraint logic programming, one starts from and survey and indicate some of the achieved research the observation that matching is only a very simple results on this subject. This paper is not written kind of constraint solving, and generalizes standard as a scholarly piece, nor as polished course notes, logic programming by replacing matching by con- but rather as something like the transcript of an in- straint solving. For example, in the system CLP(<) vited talk I gave at a meeting bringing together re- developed at IBM Watson, you can program with lin- searchers from finite model theory, database theory, ear inequality constraints over the reals. So, if you and computer-aided verification, which was held at match a rule Schloss Dagstuhl in October 1999. R(x, y) ← x

z =2andu = x = y, The Universe of Discourse

∗ neglected Database Principles Column. Column editor: Leonid Libkin, Bell Laboratories, 600 Mountain Avenue, Murray Hill, NJ 08974. E-mail: In our tutorial introduction to the notion of a con- [email protected]. straint database, let us begin by reviewing rela-

1 tional databases as they are classically formalized in The relation R in D could contain, for example the database theory. two tuples A relational database is traditionally formalized as (john, 30) and (mary, 25), a collection of finite relations over some universe U of atomic values. This “Universe of Discourse” typ- among others. ically has a structure of its own. For example, you We can now continue to use the popular logic- will have numbers in it, with predicates like < and based query languages such as first-order logic (FO), functions like + defined on them, and strings, with Datalog, etc. The structure on U is now automat- predicates such as like in SQL and operations such ically fully visible to queries expressed in these lan- as concatenation. In SQL we routinely write queries guages, because it is part of the databases themselves like in the way we formalized databases. For example, the SQL query above is written in FO as follows: select x + y from R {x + y | R(x, y) ∧ x

2 7 Constraint databases: First relations in the query formula. For example, if we definition have the query ∀x((∃yx= y + y) → S(x)) In databases we usually insist that output relations can later be used as input relations. This principle on a database where relation S is defined by formula of compositionality, or closure as we often call it in ϕS, we plug this formula in the query formula and databases, is important for example in views. obtain a definition of the result of the query on that Hence, we should allow, instead of only finite re- database: lations over U, any definable relations over U in a database! Again, such relations will be represented ∀x((∃yx= y + y) → ϕS (x)) symbolically by their defining formulas. This idea U of a database consisting of definable relations repre- Note that this is a formula over only: it does not sented by formulas is a second major idea in con- mention any database relations, and thus is indeed straint databases. the symbolic representation of a definable relation So we come to our first definition of what a con- as we introduced above, and as we agreed to rep- straint database is (we will later refine it). A con- resent symbolically using formulas. Note also that straint database over U is a finite collection of first- the query was actually a yes/no query, and corre- order formulas over U: spondingly, our result formula is a sentence (formula without free variables) which will, on U evaluate to (ϕR,ϕS,ϕT ,...). either true or false, as desired.

Each formula defines a (possibily infinite) relation over U: 9 Testing emptiness of defin- (R,S,T,...). able relations

The constraint database represents the infinite struc- So far, so good. It is all well to just use formulas over ture obtained from expanding the structure of the U as symbolic representations of possibly infinite, de- U universe with these relations, for example: finable relations over U. But what can we do with these representations? Can we compute with them? (U; <, +, ; R, S, T ). For one thing, can we even decide whether the rela- tion defined by a given formula over U is empty or We should point out at this point that classical fi- not? nite relation databases can be viewed as special cases We can do this if the first-order theory of U,that of constraint databases, as follows. The only struc- is, the set of all sentences over U that evaluate to true ture on U that we have are constant symbols for all 1 on U, is a decidable set. There are many examples elements of U, and the equality predicate of course. of structures that do not have a decidable theory. A finite relation like, for example, R: Two famous examples are the natural numbers with john 30 plus and times: (N;+, ×), and the strings over some mary 25 alphabet Σ with the single-letter strings as constants ∗ and concatenation as function: (Σ ;(a)a∈Σ, concat). can then always be defined by a formula, for example However, there are also a few structures that do ϕR: have a decidable theory: • Z , , ,< (x1 =john∧ x2 = 30) ( ;+ 0 1 ): the integers with their order and addition, but without multiplication; ∨(x1 = mary ∧ x2 = 25) • the same with the rationals Q instead of the in- Z 8 Plug-in evaluation tegers ; • (R;+, ×, 0, 1,<): the reals, even with multipli- How do we solve the problem of effective evaluation cation! described above? Answer: we simply don’t evaluate! • All we do is plug in the definitions of the database The algebra of all terms built using a given sig- nature of function symbols; 1Recall that in logic, constant symbols are 0-ary function symbols. • Boolean algebras.

3 It is with such universes that we can play the game 11 Constraint databases: Sec- of constraint databases. ond definition Of course there is also the issue of computational complexity. Typically, decidable theories have a huge We are now ready for an improvement of our first def- complexity: it is typically not easy to determine the inition of what constraint databases are. The idea is truth of a sentence. However, this complexity can to use only structures U with a decidable theory, and often be isolated in the number of quantifiers of the which have quantifier elimination. Formulas in the sentence. For example, Basu’s algorithms for the the- constraint databases can then without loss of expres- ory of the reals [2] split up into a combinatorial part, sive power be required to be quantifier free.Query which deals with the actual polynomials occurring in evaluation is still plug-in evaluation as explained ear- the formula (obtained by applying plus and times to lier, but the result of the plug in still has the quan- variables), and in an algebraic part, which deals with tifiers from the query formula. We eliminate these the actual number of variables and quantifiers, the (remember that we now assume that the universe has degree of the polynomials, etc. It is only in the alge- quantifier eliminiation), and the resulting quantifier- braic part that the algorithms are exponential. The free formula is the real result of the query. combinatorial part is polynomial-time. In constraint Working only with quantifier-free formulas helps databases, we would like similarly to keep the alge- tremendously in keeping the complexity of testing braic part really simple so that the algorithms have a emptiness of our definable relations down. Although better chance of being efficient in their totality. This quantifier elimination algorithms typically have a brings us to the next point. very high computational complexity, we now have a handle to control this complexity. Indeed, as we just saw, we only have to eliminate the quantifiers that 10 Quantifier elimination were in the query, and the query is fixed, regardless of how large the database is to which we apply it. So, the number of quantifiers to be eliminated is fixed, The method of quantifier elimination is an old tech- and hence complexity stays manageable. nique from the field of model theory in mathematical We should immediately admit, however, that we logic [11, 16]. When we try to apply it to a struc- are painting here a very optimistic picture of con- ture U, we try to express every formula over U as a straint query evaluation. In practice we still have to boolean combination of “base formulas”. If we can do worry about other things, such as the sizes of the logi- with just the atomic formulas as base formulas, then cal terms resulting from quantifier elimination, which we say that “U has quantifier elimination”: in this could also become very big. But, it remains indis- case every formula can be written as a quantifier-free putable that having only quantifier-free formulas in one. Often, however, the atomic formulas alone are the database helps a lot. not enough, and other formulas must be used as base formulas as well. So, the idea of applying quantifier elimination to 12 Research in constraint U amounts to expand U with suitable definable func- tions or predicates, until we actually have quanti- databases fier elimination. For example, take the structure (Z;+, 0, 1,<). We cannot express the formula Research in constraint databases could be classified as follows: ∃xy x x x = + + +2 1. Classical database theory topics reconsidered, now that the universe has a structure. So here without quantifiers, but if we add all modulo func- we are dealing with the special case of finite tions, we can: then it becomes equivalent to the databases over U. quantifier-free formula 2. New research topics made possible by the new possibilities given by full constraint databases, y mod 3 = 2. notably, to represent infinite relations.

The method of quantifier elimination can be suc- However, numerous links between 1 and 2 exist. cessfully applied to all the structures we mentioned Sometimes we have to do type-1 research before we above with a decidable theory. even can begin to do type-2 research.

4 In the remainder of this paper I will give some However, in this particular example, we can samples of the two kinds of research in constraint rewrite the query so that it is okay to let quantifiers databases. range over the active domain only. Indeed, we note that the common radius (or rather, its square), if it 2 2 exists, equals x0 + y0 for some arbitrary pair (x0,y0) 13 O-minimal structures in relation R. So, we can equivalently rewrite the query as follows: If we know something more about our universe U, apart from what we already agreed on (decidable the- ∃x ∃y ∀x∀y R x, y → x2 y2 x2 y2 , ory, quantifier elimination), we can obtain further 0 0 ( ( ) + = 0 + 0) results. One nice property is that of o-minimality. This property is an abstraction of the nice properties and here it is perfectly okay if the quantifiers range the universe of the reals has in connection with the over the active domain only. “tame” topological behaviour of definable sets in the Now is this just a peculiar example, or can we al- reals [26]. ways pull such a trick? If U is o-minimal, you can Concretely, a structure U is called o-minimal if it indeed always do this, and there even is an algorithm is ordered: one of its predicates is a total ordering on that will do the rewriting for you [6]. U, and in addition the following holds: U 1. Take a formula over : 15 Safety ϕ(x1,...,xk). Another issue of type 1, so finite databases only. Let’s 2. Substitute values a2,...,ak for x2,...,xk.Note return to the safety problem: is the result of my query that we leave x1 free! always finite on finite inputs? For example, the fol- lowing query is not safe if the universe is the rationals 3. Then the set or the reals (which are densely ordered):

{x1 ∈ U | ϕ(x1,a2,...,ak)trueinU} {z |∃x∃y(R(x, y) ∧ x

5 16 Conjunctive query contain- • Do S1 and S2 overlap? Is simply the same as ment asking whether their intersection has dimension two (see first query). Still on finite databases, another natural question we may ask is whether the classical problem of conjunc- • Do S1 and S2 “touch”? Is imply the same as tive query containment remains decidable when we asking whether all points on the intersection lie now have a structure on our universe U.Givenour on the borders (see second query). basic assumption that the theory of U is decidable, the answer is “yes” [17, 3]. These examples serve to illustrate that you can al- Of course conjunctive queries also make perfect ready express quite a lot with just the basic opera- sense on general constraint databases with infinite re- tors of first-order logic FO. There is no need to add a lations. Much less is known about containment test- special order for interior, for border, for overlap, for ing in this setting, however. touch, and so on, as is the case in more traditional approaches to spatial database query languages. However, other interesting spatial database queries 17 Spatial databases as con- are not expressible in FO. A case in point is topo- straint databases logical connectivity. This inability can be rigorously proved, and the proof is a nice illustration of a link Now some attention to type-2 research. Using the between infinite constraint databases and finite ones. reals as our universe U, and using definable k-ary re- To give an idea of this proof, we have to talk first lations over the reals, we get what is known in real about another phenomenon which is interesting in its algebraic geometry as the “semi-algebraic sets in Rk. own right. This is a very well-behaved class of geometrical fig- ures, and a lot is known about their topological and geometrical properties [10, 4]. So, it seems very nat- ural to use constraint databases over the reals to rep- 18 Arithmetical collapse resent spatial data [22]. So, every spatial datum is a semi-algebraic set, which is symbolically stored as a Let’s go back to finite databases for a second. Al- U defining formula over the reals. though we now have structure on our universe , More traditional approaches to spatial databases some queries are not interested in this structure, and [27, 23] represent spatial data as polyhedral subdi- continue to view the elements in the database as ab- visions of the real space, or using a finite number stract atomic data elements. A simple example is the of spatial abstract datatypes such as point, line seg- query on a given finite set “are there at least five el- ment, polyline, circle, arc segment, etc. This allows ements in this set?” Clearly, for the outcome of this for more efficient implementations of specific oper- query it does not matter whether the elements are, ations on spatial data. However, elegant, flexible, say, integers on which addition is defined, or the el- closed, logical query languages are much harder to get ements are just abstract elements without any inter- in the traditional approaches than in the constraint preted predicates and functions. We call such queries approach. generic. Indeed, consider the following spatial queries: Now suppose we have an FO query which happens to be generic. This query might still use the inter- • Set S of points in the plane has dimension two: preted predicates and functions in its formula, but that seems not very useful, as the outcome of the 2 ∃x0∃y0∃ε>0 ∀x∀y(d((x, y), (x0,y0)) >ε → S(x, yquery)), is indifferent to these. This basic intuition can be formally proven: generic FO queries can always 2 where d((x, y), (x0,y0)) stands for (x − x0) + be expressed by a classical relational calculus formula 2 (y − y0) (Euclidean distance). that does not mention the structure of U, but only the database relations [21]. This is the principle of • Give me the border of S: arithmetical collapse. To be correct we should add that there is one small {(x, y) |∀ε>0 ∃x1∃y1∃x2∃y2 exception: if the universe if ordered, then we some- times still need to use the order, even though the d((x, y), (x1,y1)) <ε∧ d((x, y), (x2,y2)) <ε query is generic. But that’s okay, it won’t hurt for ∧ S x ,y ∧¬S x ,y } ( 1 1) ( 2 2) what we are going to say next.

6 19 Why the topological connec- • Concrete algorithms and implementations. The tivity query is not in FO DEDALE system stands out in the linear con- straint context. We need good implemented sys- Consider finite sets S of, say real numbers (we are tems also for other universes. back to spatial database queries so the universe is • Play the constraint database game for other uni- the reals). It turns out to be possible to write an FO verses than the typical numeric ones, for example ψ S query that, for any such input , outputs an infinite term algebras [12, 9]. region in the plane that is topologically connected if and only if the cardinality of S is even [15]. Now sup- • In practice we will need “mixed” constraint pose topological connectivity were expressible in FO, databases, where the universe is many-sorted, by some query formula γ. Then the composed query containing many different types of atomic ele- ψ; γ clearly is the parity query. The parity query is ments, all with their own interpreted functions generic, so by arithmetical collapse and natural-active and predicates, and possibly some functions and collapse we can rewrite ψ; γ to an FO query that is predicates mixing the different types. This has a standard relational calculus query: it does not use not yet been adequately formalized. the structure on the reals: addition, multiplication. But this is a contradiction: it is well known that such a standard relational calculus query cannot express References the parity query [1]. [1] S. Abiteboul, R. Hull, and V. Vianu. Founda- tions of Databases. Addison-Wesley, 1995. 20 Other constraint database [2] S. Basu. Algorithms in Semi-Algebraic Geome- research topics try. PhD thesis, New York University, 1996.

Research topics that were also actively studied, but [3] M. Baudinet, J. Chomicki, and P. Wolper. which we did not touch upon in this tutorial, are the Constraint-generating dependencies. Journal of following. See the book [20] for information. Computer and System Sciences, 59(1):94–115, 1999. • Aggregate operators, such as volume in a spatial context. [4] R. Benedetti and J.-J. Risler. Real Algebraic and Semi-Algebraic Sets. Hermann, 1990. • Linear constraint databases: we consider only regions definable by formulas that do not use [5] M. Benedikt, M. Grohe, L. Libkin, and multiplication, only linear equations and in- L. Segoufin. Reachability and connectivity equalities. Some nice theoretical results have queries in constraint databases. In Proceedings been obtained, for example on encoding such 19th ACM Symposium on Principles of Database databases into finite databases, and it is an im- Systems, 2000. portant practical case where a rather complete implemented constraint database system exists [6] M. Benedikt and L. Libkin. Relational queries (the DEDALE system). over interpreted structures. Journal of the ACM, 47, 2000. • The question of understanding exactly which topological queries can be expressed in FO, and [7] M. Benedikt and L. Libkin. Safe constraint how to extend FO towards those we cannot ex- queries. SIAM Journal on Computing, 29:1652– press (two recent papers in this direction not cov- 1682, 2000. ered by the book are by Grohe and Segoufin [14] and by Benedikt, Grohe, Libkin and Segoufin [8] F. Benhamon and A. Colmerauer, editors. Con- [5]). straint Logic Programming: Selected Research. MIT Press, 1993. • Adding recursion to constraint query languages. Here a major issue is termination of the recur- [9] A. Blumensath and E. Gr¨adel. Automatic struc- sion, given that universes and relations are now tures. In Proceedings 15th IEEE Symposium on infinite. Two recent papers are by Kreutzer [19] Logic in , 2000. and by Geerts and Kuijpers [13]. [10] J. Bochnak, M. Coste, and M.-F. Roy. Real Al- Just a few open research directions are: gebraic Geometry. Springer-Verlag, 1998.

7 [11] C.C. Chang and H.J. Keisler. Model Theory. [24] J. Ullman. Principles of Database and North Holland, 3rd edition, 1990. Knowledge-Base Systems, volume I. Computer Science Press, 1988. [12] E. Dantsin and A. Voronkov. Expressive power and data complexity of query languages for trees [25] J. Ullman. Principles of Database and and lists. In Proceedings 19th ACM Symposium Knowledge-Base Systems, volume II. Computer on Principles of Database Systems, 2000. Science Press, 1989.

[13] F. Geerts and B. Kuijpers. Linear approxima- [26] L. van den Dries. Tame Topology and O-Minimal tion of planar spatial databases using transitive- Structures. Cambridge University Press, 1998. closure logic. In Proceedings 19th ACM Sympo- [27] Michael F. Worboys. GIS: A Computing Per- sium on Principles of Database Systems, 2000. spective. Taylor & Francis, 1995. [14] M. Grohe and L. Segoufin. On first-order topo- logical queries. In Proceedings 15th IEEE Sym- posium on Logic in Computer Science, 2000.

[15] S. Grumbach and J. Su. Queries with arithmeti- cal constraints. Theoretical Computer Science, 173(1):151–181, 1997.

[16] W. Hodges. Model Theory. Cambridge Univer- sity Press, 1993.

[17] O.H. Ibarra and J. Su. On the containment and equivalence of database queries with linear con- straints. In Proceedings 16th ACM Symposium on Principles of Database Systems, pages 32–43, 1997.

[18] P.C. Kanellakis, G.M. Kuper, and P.Z. Revesz. Constraint query languages. Journal of Com- puter and System Sciences, 51(1):26–52, August 1995.

[19] S. Kreutzer. Fixed-point query languages for lin- ear constraint databases. In 19th ACM Sympo- sium on Principles of Database Systems, 2000.

[20] L. Libkin, G. Kuper, and J. Paredaens, editors. Constraint Databases. Springer, 2000.

[21] M. Otto and J. Van den Bussche. First- order queries on databases embedded in an in- finite structure. Information Processing Letters, 60:37–41, 1996.

[22] J. Paredaens, J. Van den Bussche, and D. Van Gucht. Towards a theory of spatial database queries. In Proceedings 13th ACM Symposium on Principles of Database Systems, pages 279–288. ACM Press, 1994.

[23] D. Thompson R. Laurini. Fundamentals of Spa- tial Information Systems. Number37inAPIC Series. Academic Press, 1992.

8