<<

Background

A model contains the means for

¡ Specifying particular data structures

¡ Constraining the data sets Database Theory: Lecture 1 ¡ Manipulating the data

Introduction and the A DDL (Data Definition Language) provides the means to specify the structures and constraints. Dr G.M. Bierman A DML (Data Manipulation Language) provides the means to manipulate the data; specifically querying and updating. www.cl.cam.ac.uk/Teaching/2002/DBaseThy

Database Theory 2003 1 c GMB/AD Database Theory 2003 2 c GMB/AD ¢

Database models This lecture course

Database models are typically characterized by the prominent data structure, e.g. We will study some of these database models from a theoretical perspective. ¥ £ Graphs: Study closely the data model

Network model – ¥ Study closely the associated query languages:

– Semantic model – Choices of query languages

– Object model – Relationships between them

– XML (?) – Their characteristics (typically expressivity)

£ Trees: Three database models considered in this course: – Hierarchical model 1. Relational

£ Relations: 2. Object – Relational model 3. Semi-structured/XML

Database Theory 2003 3 c GMB/AD Database Theory 2003 4 c GMB/AD ¦ ¤ Warning Books

This is a brand new course! Recommended textbooks:

There will be eight lectures, two examples classes, and two lecturers! ¨ Foundations of Abiteboul, Hull and Vianu. Addison Wesley, 1995 Schedule: ¨ Data on the web Abiteboul, Buneman and Suciu. Morgan Kaufmann, 2000. 1. GMB - (now!) relational model Also: Some useful information in any of the database books from the IB 2. AD - Relational calculus Databases course. 3. AD - Deductive databases

4. AD - Recursion and negation

5. AD - Expressivity and complexity

6. GMB - Complex values

7. GMB - Object model

8. GMB - Semistructured/XML model

Database Theory 2003 5 c GMB/AD Database Theory 2003 6 c GMB/AD © §

What is the relational model?

Codd 70:

– Data structure: flat relations

– Simple algebra of queries The relational model – No constraints – No updates

Codd 72-:

– Same data structure

– Second (based on FOL)

– Proof of equivalence to algebra language

– Integrity constraints - functional dependencies

Database Theory 2003 7 c GMB/AD Database Theory 2003 8 c GMB/AD

What is the relational model? cont. Basics     Therafter a whole host of extensions. Thus the relational model means the class Fix a countably infinite set of attribute names, with a total order   of database models that have relations as the core data structure and incorporate   Fix a single countably infinite set  : the underlying domain some of Codd’s approach.     Fix a countably infinite set    of relation names     Assume that there is a sort for each element of    , i.e. a function           

! . " #  The arity of a relation $ is then defined %&' ( 01 & ( + + ) * ,.- / * , /  A relation schema is just a relation name. We may write $ 3 to denote $ is 2 4 5 of sort 3 , and $ to denote it is of arity 265 4

 A is just a nonempty finite set of relation schema, e.g. 8 < < < 3 $ 3 $ 7 : : = = ; ; 4> 2 9 2 4

Database Theory 2003 9 c GMB/AD Database Theory 2003 10 c GMB/AD ?

Example relations Choicepoint: names

Q: Are the attribute names part of the explicit database schema or not? F H P AB CD E IJ A CK IQ ER A I B T AU C @ M M S M O O G L N L N L NV In SQL they are available, e.g. SELECT p.Name FROM Persons p Movies Title Director Actor

Magnolia Anderson Moore Guide Title Cinema Time But are they compiled away by the system to just integers?

Magnolia Anderson Cruise Rocky Warner 12:00 More theoretically: Spiderman Raimi Maguire Spiderman Picturehouse 19:00 X _ ]^ : A tuple over relation schema Y [ is a map from [ to

Named Z \ Spiderman Raimi Dunst ...

... Spiderman Phoenix 19:00 X : A tuple over relation schema Y is a element of the cartesian

Unnamed Z6` \ Rocky Avildsen Stallone Magnolia Picturehouse 22:00 _ba product ]^ RockyII Stalone Stallone The choice of model impacts on the query language.

Location Cinema Address Tel Having assumed a total order on the attribute names provides us with a Picturehouse Cambridge 504444 correspondence between the two. Thus we’ll flip between the two during this Phoenix Oxford 512526 course Warner Cambridge 560225

Database Theory 2003 11 c GMB/AD Database Theory 2003 12 c GMB/AD c W Unnamed: SPC algebra Example

(Simplest possible algebra first) List the name and addresses of cinemas playing a Almodovar film f e e g

SPC query d Base relation

Singleton constant relation h ij6k lm Ž “ ‘ Š ’ ‹ ˆ”• – Š ˆ— | | ˆ‰Š ‹Œ  ƒ ƒ } ‚ } „ ‚  ƒ     €6 € €6 €6 } €‡† Ž ~ Almodovar n p q

d Select (constant) h r s o Alternatively: n p t d Select (attribute) h r s o u x x x

d Projection h v r s y w w o o z Ž Ž “ ‘ Š ’ ‹ ˆ” • – Š ˆ — |  ˆ‰Š ‹Œ  d d

Cartesian product ƒ ƒ h „ „ ‚ ˜ ™ ƒ     €6 €6 € } €‡† ~ Almodovar — What is this? (Assume š is of arity ) Ž | • š ƒ ‚ › › › ‚ œ  œ ž €6 € Ÿ ¡¢   ~ ~

Database Theory 2003 13 c GMB/AD Database Theory 2003 14 c GMB/AD £ {

Arity judgements for SPC queries Semantics © ¤ ¦ ¨ Ö ÔÕ ¥ § × Ø Ñ Ñ Ó Ù Ú Ð Ð Ò Ò ª « ± © ® ¬­ ¯° ª « © ¤ ¦ Ö ÔÕ × Ý Ý Ó ÛÜ Þß Ð ÐÛÜ Þß Ò Ò Ö ÔÕ × × å à å Ý â æ « ª Ó Ó « ª ¦ ¦ © ã ¦ ¦ » ® ¼ ¾ © ä ä è á Û Ð Ð Ù Ú Ò Ò Ð Ð Ò Ò ç Ù Ú ß ³ ³ ² ½ ² ´ ´ ¹ º Ö ÔÕ × × å å å ê à â æ Ó Ó é ä ä è Û Ù Ú Ò Ò Ð Ð Ò Ò ç Ù Ú Ù Úß Ð Ð á « ª ¦ © µ « ª ¦ © µ · · ¸ ¿ ² ² ¹ º ¶ ¹ º ¶ Ö ÔÕ × ò ò ò å ë å å æ ð Ó î î î Ó é ñ ñ ä ä è è ìí í á á Ú Þ ç Ð Ð ÛÜ Ù Ú Ù Ð Ð Ò Ò ß Ù Ú Ò Ò ï Ö ÔÕ Á Á Á « ª ª « ª « ¦ ¦ » ® ¼ © © ¦ © » À À Ç ¿ ³ ½ ½ ² ² ² ´ ´ º ¹ × ò ò ò ò ò ò ó å å ö ÷ ÷ ø å ÷ æ æ ð ô Ó ð Ó ô Ó ñ ñ ñ ñ ñ ä ä ä ñ ä ÛÜ Ù6õ Ú Ù Ú Ù6õ Ú Ù Ú Þ ç Ð Ð Ò Ò Ð Ð Ò Ò Ð Ð Ò Ò ß ª È « « ª © ¦ » ¾ © Â É Å Å Å À Ç ² ² ² ÃÄ Ä ¶ ¶ ¹ º ú û Æ ù ö ð

where ä ú û ù ø ð Í Î Ì Ì Ë Ë ä Ê A query Ê is well-formed wrt a schema if there exists such that .

Aside: Some queries can return ü for all instances, e.g.

þ ¤ ¥¦§

ý

ÿ ÿ ¨ ¡£¢ ©     â  â ð ã Ó ð        

Database Theory 2003 15 c GMB/AD Database Theory 2003 16 c GMB/AD  Ï Generalized SPC algebra Normal forms    1. Intersection   Define an SPC normal form to be a query of the form

2. Generalized Selection % % % " " U U U U U U ) !  ' , $ $   T T T T T ( & * G P P W Y Y +

Of form where # # and and # is or  K K K Q X Q V Z I MNO RS N O RS [ [ M L J J H H " - + d d d d d d ^ \ ^ _ a e Q ` g ] ] ] ] ] ] ] where f f S N S Ncb 3. Equijoin

Proposition 1 For every (extended) SPC query h , there exists an extended % % % " " ! . / - $ $     * & + # where # # where is . i i

query h in normal form such that h h . k k l lnm k k l l j j 1 4 1 4 ? ? 5 5 5 8 9: <= > 5 8 9: A = > 0 0 3 3 6 6 3 6 2 2 7 7 ; ; @ @ ; ; @ @ 1 4 5 5 0 BC E 3 6 3 6 D 2 2

Exercise 1

1. Code up these operators

2. Formalise the semantics of the equijoin operator

Database Theory 2003 17 c GMB/AD Database Theory 2003 18 c GMB/AD o F

Named perspective Named: SPJR algebra p t rs q ‚

Recall: Tuples are functions to   ƒ

SPJR query € Base relation  p ˆ w z w x { Singleton constant relation „ †c‡ ‰Š We write them as, e.g. y ucv | ‹  Œ Ž € Select (constant) „   p Note: We do not allow repeated names in a relation sort ‹  Œ ‘

€ Select (attributes) „  

p We have to ensure that the result of a query does not have repeated names, ’ Œ • • • Œ € Projection „ “   – ” ” ~ e.g. } } —

€ Renaming „   – We’ll need to be able to rename attributes ˜ ™š

€ € Natural join „

p A natural join operator reveals itself: join on the common attributes

Database Theory 2003 19 c GMB/AD Database Theory 2003 20 c GMB/AD ›  Sort judgements for SPJR queries Natural join œ ž ¡

È Ë È Ë Ç Ì Ç Ì  Ÿ Ê Í Ê Í É É ¢ £ £ ¦ ¡ ¨ ¤¥§¦ ©ª ¢ £ ¡ œ ž È ÎÏ Ë Ç Ì Ð Ì Ê Í Ê Í É É ¢ £ ¦ ¢ £ ¦ ³ ¡ ž ¡ ž ž ž

« « ² Ó Ó Ñ Õ Î Ï Ì Ò Ì ¢ £ ¢ £ ¡ ¬ ž ¡ ¬ ž ® ® Ô ­ ­ ´ ¯ Ê Í

« « If then ° ± ° ± Ó Ó Ó Ñ Î Ï Ì Ò Ì Ì Ò Ì Í Í Ê If Í then ¢ £ ¦ ¶ ¶ ¶ ¦ ¦ ¡ ž ž µ · ¹ ¸ « ² ² distinct ¤ ª ¢ £ ¦ ¶ ¶ ¶ ¦ ¡ º ­ ½ ½ ½ ­ µ · « ² ² ¾ »¼ ¼ ° ± ¢ £ ¡ ž ¢ £ ¢ £ ¡ ž ¡ ž µ ¿ µ ¿ « « « ¢ £ ¡ à ž ¢ £ ÀÁ ¡ ž  ž µ ¿ µ ¿ « Å « « ° ± ° ± Ä

Database Theory 2003 21 c GMB/AD Database Theory 2003 22 c GMB/AD Ö Æ

More Union

× Can define similar generalizations to SPC algebra We’d like to add some disjunctive flavour to our query language, e.g.

× Normal form: Where can I watch “Magnolia” or “Spiderman”?

1. Union é é é é é é çè çè ã çè çè ã çè í Ø ä ä ë ï ï   þ   ¦

 â Ù Ü Ü Ü Ù â â ì ò ê ê ¡¢ £¤ ð Þ ð ð ð Ú Þß àcá åæ ß àcá å æ Þ§í Ú Þ ÿ ñ Ý

Û Û î î ¥   ¥§¦ ¨ ¥  © ¢ ¨ Magnolia © ¢ Spiderman á ø ø ø á ú ø ø ø ú á ó õ ÷ â ù â ö ê û ü ô ô ô ô ô where ô and are distinct ß æ ß æ 2. Extended (disjunctive) selection ×   þ A corresponding normal form theorem ¡¢ £ ¤ ÿ

¥   ¥§¦ ¨ ¨ © ¢ © ¢ Magnolia  Spiderman

3. Non-singleton constant relations          þ  

¡¢ £ ¤ ÿ Magnolia  Spiderman        ¥

The addition of Union to our algebras, leads to SPCU and SPJRU algebra

Aside The addition of union to the corresponding calculus is non-trivial!

Database Theory 2003 23 c GMB/AD Database Theory 2003 24 c GMB/AD

ý Further power Difference operator

Even with Union, our algebra is quite weak, e.g. Unnamed Named Which movies where Stallone did not act were directed by Stallone? 0 3 0 3 0 3 0 3 Put more formally, our query algebras are monotonic. / 4 / 4 / 7 / 7 2 5 2 5 1 1 1 1 ( 6 6 0 3 0 3 / 4 / 7 ! ! % & & 2 5 2 5 " $ ' 1 1 1 1

Definition 1 An algebra is monotonic if # If then - + + ,

# # . ) ) * * ) ) * * Our example becomes, e.g.

To rectify this we’ll add the difference operator. 6 8 ? FGH IJ FGH IJ 8 ? 9 9 :;<= : ; < = D D > @ >§E K K >§E K K > L :A = B ;C A Stallone B ; C A Stallone

M The (unnamed) is SPCU with difference operator.

M The (named) relational algebra is SPJRU with difference operator.

(It is traditional to add to these the generalizations, e.g. N for unnamed relational algebra)

Database Theory 2003 25 c GMB/AD Database Theory 2003 26 c GMB/AD O .

What’s coming up

P Relational model

– Different query language perspective: logic

– Obvious questions:

Q Equivalence with algebra Database Theory: Lecture 2

Q Expressivity?

Q Natural extensions and their expressivity The Relational Calculus

P Different data models Dr A. Dawar

www.cl.cam.ac.uk/Teaching/2002/DBaseThy

Database Theory 2003 27 c GMB/AD Database Theory 2003 28 c GMB/AD S R Queries Queries in Algebra

A query is a formal syntactic rendering of a question we wish to pose to a In the (SPC or SPJR) relational algebra, a query is a term in the algebra. database. Each term denotes a relation. Examples Relations are combined using various unary operations (selection, projection) and

T Who is the director of “Spiderman”? binary operations (Cartesian product, union).

T Where is “Spiderman” playing?

A Boolean query is one where the output schema is a relation of arity Z .

T Is there a film directed by Almodovar playing in Cambridge? [ [ Formally, a query defines a query mapping U . It has an input schema (typically

So, for any instance , \ is either or . ] ^ _` _ab ` a database schema) and an output schema (typically a relation schema). V V

It maps an instance over the input schema to an instance U over the output W X schema.

Database Theory 2003 29 c GMB/AD Database Theory 2003 30 c GMB/AD c Y

Conjunctive Queries Conjunctive Calculus

The various relational calculi (of which the conjunctive calculus is the Formally, a conjunctive formula x is given by the following syntax z y y € € € } { ~    x atom simplest) are based on first-order predicate logic. |§} ‚ „ x x

e l conjunction j k ƒ f h f h d d i

If is a relation of arity and g , then g is a fact that is either m n o

true or false in a given database instance . † x existential quantification ƒ } † { o p f p ‡ d where is a relation name, is a variable and each is either a variable or a

If is a variable, then g is true or false given an instance and a m n Š ˆ‰ valuation for p . constant in . A has the form p f p d

g is a simple query. q r m ns € € € } } ~    x Example ‹Œ  ƒ Ž

t Who is the director of “Spiderman”? where 

x is a conjunctive formula; p v p v g Movies “Spiderman” g ns m q ru  }

each ‡ is either a variable or a constant; and  € € € ‘’ ’ } } ~    x the set of variables in is exactly  . | ‚

Database Theory 2003 31 c GMB/AD Database Theory 2003 32 c GMB/AD “ w Semantics Active Domain š º»¼ ½ ”• ˜™ — – – ¹ The of a formula ¹ , denoted is the set of constants A valuation is a function , where is a set of variables. active domain ¾ ¿

appearing in ¹ . œ –

If includes all free variables in a formula › , we can define satisfaction › Ÿž À À inductively by: Similarly, the active domain of an instance , denoted º »¼ ½ is the set of all ¾ ¿

constants appearing in any relation of À . £ ž œ ” ¥ ¥ ¡ ¡ §

› and ¦ ¢ ¦ ¢ ¢¤£ ¦ À Å Ã Ä Show, by induction, that if Ä , then for any free variable of , ÁŸÂ ž œ œ © ¨ ª ¨ ª › › › and › and › Ÿž Ÿž ž À Æ º »¼ ½ Å Ç š œ « ˜™ ® ¬ § ¾ ¿È ¾ ¿

° ² › ­ and there is a with ­ Ÿž ¯ ± ³

£ Exercise! œ ¥ ´ The query mapping defined by the query › maps the instance to µ  ¶

It follows that in evaluating a query ¹ , we only need to consider valuations whose ž ž œ ” œ ” ¥

´ › for some ¢ ¦ µ ¢·£ ¦   ¶ À º»¼ ½ º»¼ ½ É

range is in ¹ . ¾ ¿ ¾ ¿

Database Theory 2003 33 c GMB/AD Database Theory 2003 34 c GMB/AD Ê ¸

Normal Form Equality

A conjuctive formula is said to be in normal form if all quantifiers precede all Every query in the calculus defined so far is satisfiable. conjunctions. That is, it’s of the form ã ã â That is, for every â , there is an such that is not empty. ä å Ö Ö Ö Ï Ë Ï Ï Ï Ì Ì Õ Õ Ò Ò Í Í Í × × Ð Î Î æ Ñ§Ó Ô Ñ§Ó Ô Ô æ Ñ ë è è ç ç é Compare ê ä ä å å

Every query in the conjunctive calculus can be defined with a formula in normal The calculus could be extended to express such queries by allowing equalities ì ì form. ( çîí ï ) as atomic formulae.

Use, í í ñ õ ñ ô ñ ó ó ë ö å ð§ñ ò ä Ë Ë Ì Ì Ø Ù Ù Ø Ù 1. Ø is equivalent to if does not occur in . Ú Û Ü Ë Ë Ë Ë Ì Ì Ì Õ Õ Ý Ù Ù Ø Ý 2. Ø ÙÝ is equivalent to if is not free in , is not Ñ Ô Ñ Ô Ñ Ô Ì Ù Ø

free in and Þàß .

Database Theory 2003 35 c GMB/AD Database Theory 2003 36 c GMB/AD ÷ á Equivalence Algebra to Calculus ùûú Every query definable in SPC algebra is definable in the conjunctive calculus Given a query ù in the SPC algebra, we define a query in the calculus.

Moreover, if the query is satisfiable, then it is definable without the use of ÿ ýþ ú

¦ ¦ ¦ ¦ ¦ ¦ ¢ ¢ ¢ ü ü ¤ ¤ § § ¥ ¥ ¥ ¥

¨ © equality. ¡£¢ ÿ ýþ ú

true ¡ ¨ ¡ 

Every query definable in the conjunctive calculus is definable in the SPC algebra ú ú

¦ ¦ ¦ ¦ ¦ ¦   ¤  ¤ ¤ §  ù ¥ ¥ ù ¥ ¥ If  and  and there are no ¨ ¨ ¤ ¡£  ¡£ variables in common, then ÿ ýþ ú

 ¦ ¦ ¦ ¦ ¦ ¦     ¤  ¤ ¤ ¤   § ¥ ¥ ù ¥ ¥ ¥ ù   ¨ © ¡£

ÿ ýþ ú

¦ ¦ ¦ ¦ ¦ ¦  ¤ ¤ ¤    ¥ ¥ ¥ ¥ ù      ¨ ¡£ © ©£         ¦ ¦ ¦ ¦ ¦ ¦   ¤ ¤ § ¥ ¥ ¥ ¥ where   enumerates all variables among that are not in ¦ ¦ ¦  

¥ ¥ .    

Database Theory 2003 37 c GMB/AD Database Theory 2003 38 c GMB/AD

ø

Selection Calculus to Algebra $ ) ) ) & ' * !#" ( ( If , , Given a query %£& + - ; G G G = @ F F B B C C H H 798 A£D E A£D E EI :<; >? A " G G G 1 J J B B 2 ! .£/ . 3 3 C H 0 take the cross-product , 4 is obtained by replacing& by . O D K 0 M P apply a selection N for each constant appearing among the , L For D K M P apply a selection Q for every repeated variable among the , L " 1 5 ! .£/ . 3 3 0 ; R @

apply a projection U U U to get rid of the columns corresponding to , S V T T L L & &

if either or 5 is a variable, it can be replaced by the other. If they are both 0 apply further selections corresponding to constants and repeated variables in constants, we need equality! ;

= .

Database Theory 2003 39 c GMB/AD Database Theory 2003 40 c GMB/AD W 6 Disjunction Negation

To extend the conjunctive calculus to one equivalent to SPCU algebra, it would The full relational calculus is obtained by allowing negation in addition to appear we need disjunction. disjunction. This allows the full power of first-order predicate logic in specifying a query. Where can I watch “Magnolia” or “Spiderman”? k k k h i l e9f j j n m o g£h

with rules for formulae (in addition to those of conjunctive formulae): Y Y ` ^ ^ ^ ^ \ Guide “Magnolia” \ Guide “Spiderman” \ X£Y Z[ ] ] ] _ _ _a f q p p n n negation Aside: How would you prove this cannot be done in the conjunctive calculus? r

n n disjunction m t s

n universal quantification However, allowing unrestricted disjunction can result in unsafe queries, i.e. m queries with possibly infinite answers. In the presence of negation, the last two are not strictly necessary, but they are convenient. b Y b c ` c ^ ^ ^ ^ \ \ \ Z ] _a X£Y _ ]

It is no longer the case that a satisfying valuation is restricted to the active domain.

Database Theory 2003 41 c GMB/AD Database Theory 2003 42 c GMB/AD u d

Safety Domain Independence Œ Š ‡ ‡‰ ˆ Using negation and disjunction, one can write queries that admit infinite relations If we fix a domain , the interpretation of a query ‹ over an instance

as answers. can be relativized to ‡ . w

Movies “Kika” | “Almodovar” | { }~ v£w xzy All valuations are restricted to have range in ‡ . w w € | | | | |   Movies “Kika” “Almodovar” Movies “Almodovar” “Maura” }~ } { v x { Œ  The resulting interpretation is ‹ . Ž  ƒ ‚ Œ Š ‡ ‡’‘ ‡‰

One possibility might be to restrict to some fixed finite set. Still, some ˆ A query ‹ is domain independent if for any  and any , queries are not domain independent. • Œ Œ   ‹ ‹ Ž ”“ Ž – w „ |   }~ v£w x { It is undecidable whether a given expression of the relational calculus defines a domain independent query.

If a query is domain independent, it can be evaluated by restricting all valuations to the active domain.

Database Theory 2003 43 c GMB/AD Database Theory 2003 44 c GMB/AD — † Equivalence

Every query expressible in the relational algebra is equivalent to a domain independent query in the relational calculus.

Every query ˜ in the relational calculus interpreted with domain restricted to Database Theory: Lecture 3 Ÿ ™š› œ

˜ž is equivalent to a query in the relational algebra.  Deductive Databases

Dr A. Dawar

www.cl.cam.ac.uk/Teaching/2002/DBaseThy

Database Theory 2003 45 c GMB/AD Database Theory 2003 46 c GMB/AD ¢ ¡

Rules Syntax

Rules provide an alternative syntax (inspired by logic programming) for Formally, a conjunctive rule is: conjunctive queries. ½ ½ ½ ¶ ¶ ¶ » » ¾ ¾ ¼ ¼ ·£¸ ¹ ·£¸ ¹£º ·£¸ ¹ £ ª ¥ ª ª ª ¬ ® « « ­ ¦ ¦ ¦ ¦ ¦ ¦ ¦ ¦ ¦ § § Movies “Almodovar” Guide Location ¨ ¤ ¨ ¤£¥ ¨£© ¤£¥ ¨ ¤£ª where expresses the query ¿ ¶ is a relation name not in the input database schema. ² ² ² ª ª ª ª ¥ ª ª ³ ³ ® « ¬ « « ® ­ ¦ ¦ ¦ ¦ ¦ ¦ ¦ ¦ ¦ § § Movies “Almodovar” Guide Location ¤£¥ ¨´ ¤£ª ¨ ¤£¥ ¨ ¯ °± ¿ ¶

Each À is a relation in the input schema. ¿ ¸ ¸

Each of and À is a tuple of terms (i.e. variables or constants). and, in addition, gives the resulting relation the name £ . ¿ ¸ ¸

Every variable occurring in occurs in at least one of the À .

Database Theory 2003 47 c GMB/AD Database Theory 2003 48 c GMB/AD Á µ Semantics Disjunction

The meaning of the rule Disjunction is easily added by allowing multiple rules for the same relation

É É É Where can I watch “Magnolia” or “Spiderman”? Â Â Â Ç Ç Ê Ê È È Ã£Ä Å£Æ Ã£Ä Å Ã£Ä Å

is given by Î Ö Û Ø Ý Ý Ü Guide “Magnolia” Ü ×£Ø Ù£Ú × × Ù É É É Ï Ä Ä Â Ð Ð Â Ç Ç Ê Ê Ã ÅÑ Ë ÌÍ Ã£Ä Å Î Ö Û Ø Ø Ý Ý Ü Ü Guide “Spiderman” × Ù × Ù£Ú × É É É Ä Ï Ä Ä Ç Ê where is the sequence of all variables in È È not occurring in .

Note: Occurrences of a variable in distinct rules are to be treated as distinct

 is called the head of the rule. Ã£Ä Å variables. É É É Â Â Ç Ç Ê Ê È È is the of the rule. Ã£Ä Å Ã£Ä Å body

Ó This form of disjunction is always safe. Ò

Aside: by allowing equalities ( Ô ) in the body, we can ensure that there are no constants or repeated variables in the head.

Database Theory 2003 49 c GMB/AD Database Theory 2003 50 c GMB/AD Þ Õ

Multiple Relations Negation

As the syntax of conjunctive rules allows us to name the result of the query, it is One possible extension would be to allow negation on atomic formulae in the easy to simultaneously define multiple relations. body of a rule.

This can allow unsafe queries. ä æ ä ê é ß à ç ã ã ã ã ã è è á£â å á£â å á å ô ä æ â ê ê ì ì ä ß ß é ß ï ñ ë à à ã ã ã ã ã ã ã ã è è õ õ á£â å á å á å á å Movies “Kika” “Almodovar” ð ò ð£ñ ò£ó

As long as the dependencies among the relations are non-circular, each However, there is a simple syntactic restriction that guarantees safety. individual relation is still a conjunctive query. For instance, ê ì ì ä ê ê ê æ â ä é é é ß ë ë ë à à à ë ã ã ã ã ã ã ã ã ã ã ã ã ç ç ã ã è è è è è è å á åí å á å á å á á£â å á é

Relations (such as ç ã ) occurring in the database schema are called extensional database relations. Relations that occur in the heads of rules are intensional database relations.

Database Theory 2003 51 c GMB/AD Database Theory 2003 52 c GMB/AD ö î Range Restriction Universal Quantification

A Range-restricted conjunctive rule with negation is If negation is restricted to edb (extensional database) relations, not all relational calculus queries are expressible. ÷ ü ÿ ÿ ÿ ü ý ý

þ þ ø£ù ú ø£ù ú£û ø£ù ú Who are the actors that appeared in every Almodovar film? where

        Movies  “Almodovar” Movies  “Almodovar”         ¤ ¡ ü £ £ £

Each ¢ is a literal, i.e. either or for some relation in the database schema. Informally, a set of conjunctive rules with negation only on edb relations is equivalent to a formula that is the disjunction of formulae of the form ¡ ù ù

Every variable occurring in occurs in at least one of the ¢ . " " "    

!     # #         ¡ Every variable in the body appears in some positive literal.

¥ § ¥ § A literal is positive if it’s of the form ¨ . It is negative if it’s of the form ¨ . ¦ © ¦ © Simulating the universal quantifier requires placing a negation outside the existential quantifier.

Database Theory 2003 53 c GMB/AD Database Theory 2003 54 c GMB/AD $

Universal Quantification Example: Path Finding

Who are the actors that appeared in every Almodovar film? A railway database contains the relation Railway Service From To

WAGN Cambridge Ely & * ' ' ' ( Almo-actor ( Movies “Almodovar” % ) %& ) WAGN Cambridge King’s Cross

Central Ely Birmingham & & & * - , - ' ' ' ' '/. Not-all Almo-actor Almo-actor ( Almo-actor %+ ) %+ ) % ) %+ ) Virgin Euston Birmingham

... & * GNER Peterbrough Newcastle

All Almo-actor ' '/. Not-all ) %+ %+ )0 %+ ) Great Western Paddington Cardiff

This is dynamically updated, to reflect cancellations and line failures.

Is it possible to get from Cambridge to Glasgow?

Database Theory 2003 55 c GMB/AD Database Theory 2003 56 c GMB/AD 2 1 Path Finding Recursion

In the relational calculus, we can ask: We can allow recursion, by allowing a relation name to appear in both the head and body of a rule. Which stations are reachable from Cambridge without change? Example: 5 3 9 9 Reach Railway 8 “Cambridge” 6 45 67 4

Which stations are reachable from Cambridge with one change? ? E F B B C C @A D @A D ; 5 ; : < : ? ? 9 9 9 9 9 E G G 8 8 Reach Railway “Cambridge” Railway F 6 4 6 4 45 67 B B B B C C @A D D @ @A DH

Which stations are reachable from Cambridge with two changes? ? Here, F is a relation given in the database. The query defines a relation that is

the transitive closure of F . ; 5 ; ; ; < : : < : < < = 9 9 9 9 9 9 9 9 Reach Railway 8 “Cambridge” Railway 8 Railway 8 6 6 6 4 4 4 45 67

But, there is no way to ask “which stations are reachable from Cambridge?”. is the language of conjunctive rules with recursion.

Database Theory 2003 57 c GMB/AD Database Theory 2003 58 c GMB/AD I >

Route Finding Syntax

Which stations are reachable from Cambridge? A Datalog rule is an expression Z Z Z K M S S S X X [ [ O O Y Y Reach Railway N “Cambridge” L J JK L TU V TU VW TU V Q K Q M P O O O Reach Railway N Reach J L J L JK L where \ Z Z Z U U U X [

The tuples of terms Y Y Y satisfy the restrictions on conjunctive rules. Is it possible to get from Cambridge to Glasgow? \ S

Each ] is either a relation in the input database scheme, or appears in the

Camb-Glas M Reach “Glasgow” L J head of some rule.

A Datalog program ^ is a sequence of rules. ^ We write _`a for the database schema consisting of the relation names that T V ^ ^ occur only in the body of rules in . bcd is the schema consisting of all T V

relations occurring in ^ .

Note: no negation.

Database Theory 2003 59 c GMB/AD Database Theory 2003 60 c GMB/AD e R Model-Theoretic Semantics Model-Theoretic Semantics contd. ‚ ‚ ‚ With a rule If is a program, a database instance over ƒ„ is said to be a of if

† ‡ model m m m f f f k k n n ˆ l l gh i gh ij gh i it satisfies ‰ . we associate the formula of first-order predicate logic: ‚ Š Š ‚ If is a database instance over ‹Œ , the semantics of over is defined to † ‡ s s s m m m p p h f r r f f o k k k n n n l l q g gh i g i it gh iu ‚ Š Ž Ž Ž be the minimum instance such that  and is a model of . m m m p p k n where l l enumerates all the variables appearing in the rule.

How do we know such a model exists? v w

For a program , we write x for the conjunction of all formulae associated

with rules in v . v w y z{|

Since there are no free variables in x , for any instance over , we g i have either y w y w

x satisfies x . }~

or y w y w x x

€ does not satisfy . }~

Database Theory 2003 61 c GMB/AD Database Theory 2003 62 c GMB/AD  

Fixed-point Semantics Fixed-point Semantics contd. ° ‘ ‘ ’ “”•

Let be a program and an instance over . The function ± is a on instances.

– — monotone map ˜ ’ ² ° ² ° Define the instance ™ as follows ´ ´ – — If ³ , then ± ³ ± . ¶ µ ¶ µ š ‘ ˜ ’ ’ œž › › ›

For every relation in , ™ . – — – — – —Ÿ – — Exercise: prove it! š ‘ œž › For every relation › not in , a tuple is in if, and only if, there is a

– — Fixed-point theorem ² ² ´ ´ rule For any instance , there is a least such that ³ which is a fixed-point of ° ¨ ¨ ¨ £ › › › ±

¥ ¥ . © © § § –¢¡ —¤ –¦ — –¦ — ¡ Ÿ Ÿ ’ ª ª ® ¦ £ › ¬ and a valuation such that and ­ ­ for each . « – — – — ° ¸ ´ ´ ± µ ¶· ˜ ’

That is, ™ is formed of the set of facts that can be by

– — immediately justified ’ the program ‘ and the instance .

Database Theory 2003 63 c GMB/AD Database Theory 2003 64 c GMB/AD ¹ ¯ Iteration

The least fixed point can be reached by iteration. º º » Given a program and an instance over ¼½¾ , take ¿ À » Á ÂÄÃ Ã È Á Á Ç É Å Å . Database Theory: Lecture 4 Æ À ¿ Ã Á Á Ç Å Repeat until Å . Æ Recursion and Negation

Dr A. Dawar

www.cl.cam.ac.uk/Teaching/2002/DBaseThy

Database Theory 2003 65 c GMB/AD Database Theory 2003 66 c GMB/AD Ë Ê

Datalog Monotonicity

Datalog extends the conjunctive calculus with recursion, but does not allow Every query definable in Datalog is monotonic. negation. Í Î ÎÐ Í ÑÒÓ

If is a Datalog program and Ï instances over , then Ô Õ

We can express Ð Î ÎÐ Í Î Í Î

If Ö then Ö . Ô Õ Ô Õ Which stations are reachable from Cambridge?

but not Ð Î Î Ö For which pairs of stations is there no direct connection? Ð × Î × Î Ø Ø Ö Ô Õ Ô Õ Ð × × Î × × Î Ø Ø Ø Ø Ö Ô Ô Õ Õ The expressive power of Datalog and the relational calculus are incomparable. Ô Ô Õ Õ Ð ×ÚÙ Î ×ÚÙ Î Ö Ô Õ Ô Õ Ø Ø . .

Database Theory 2003 67 c GMB/AD Database Theory 2003 68 c GMB/AD Û Ì Example Adding Negation

Edge Source Target Allowing negation in datalog programs is not a trivial task, as the fixed-point

a c semantics was based on monotonicity. Given a graph as a binary relation a b

b d We now consider a series of ever more expressive query languages obtained by different ways of incorporating negation into datalog.

We can ask for which pairs of nodes there is a path: All have the proviso that negation is range-restricted. á Ý Þ Þ ß Path Edge ß ÜÝ à Ü à ç ê That is, if a è appears in the body of a rule, every variable in éê ë á Ý â â Þ Þ Þ Þ ß Path Edge Path ß ÜÝ à Ü à Ü àã appears in some positive literal in the body of the same rule. Ý Þ But we cannot ask: for which pairs ß is every path of even length. å

Adding the pair Þ to the above table would require removing it from the result à Üä of the query. This shows that the query is not monotone.

Database Theory 2003 69 c GMB/AD Database Theory 2003 70 c GMB/AD ì æ

Semipositive Datalog Semipositive Datalog contd.

ÿ

The simplest method of extending datalog is to allow negation only on edb For a semipositive datalog program , the map ¡ on instances is still monotone predicates. in a restricted sense. ¢ ¢¤£ ÿ A semipositive datalog program is a collection of rules If and are instances over ¥¦ § such that: ¨ ©

¢ ¢¤£

ò õ õ õ ò ï í , and ó ó ö ö ô ô îï ðñ îï ð î ð

ÿ ¢ ¢¤£   for each in , . ¨ © ¨ © ¨ © ø ò í í

where if ÷ is ÷ , then ÷ does not appear in the head of any rule. and the usual restrictions on variables apply. then £ ÿ ¢ ÿ ¢

¨ © ¨ © ø ù ñ ü ô ô û û îú ð îú ð This suffices for proving the existence of minimal fixed-points and therefore giving ø õ ù ù ý ñ ý ü ô ô ô ô û û ð îú ð î îú ð a semantics for semipositive datalog.

This computes the transitive closure of the complement of ü .

Database Theory 2003 71 c GMB/AD Database Theory 2003 72 c GMB/AD  þ Semipositive Datalog contd. Stratified Datalog

While semipositive datalog can express some queries that are neither in the A stratified datalog program 2 is a set of rules relational calculus nor in datalog, it cannot express universal quantification. 8 ; ; ; 8 5 3 9 9 < < : : 45 67 45 6 4 6 ? ? ? = 2 2 2 > > 9 , which can be partitioned into sets < such that        Movies “Almodovar” Movies “Almodovar” ! "#      @ A For every relation 3 , there is an such that every rule such that all rules with 2 A 3 3

in the head are in B (we call the defining stratum for ; for edb $ $¤& '() +

Suppose % are two instances over such that for any tuple over *

& relations, the defining stratum is 0). $ $ $ + + , (- . 0 0 , / if, and only if, / .    & @ 8 3 3 D * $ * $ If is a positive literal , then the defining stratum of the defining C C C Then % .  

stratum of 3 . The proof of this is similar to monotonicity of ordinary datalog programs. @ 8 3 If is a negative literal 3 , then the defining stratum of the defining CFE To get universal quantification, we have to find a way of alternating rule C C

stratum of 3 . application with negation.

Database Theory 2003 73 c GMB/AD Database Theory 2003 74 c GMB/AD G 1

Example Semantics Z U If is a database instance over VWX , where is a stratified datalog program YZ [ _ _ _ \ Z Z Z ^ ^ ] ` , we define the following sequence of instances for I M J J J K Almo-actor K Movies “Almodovar” H L HI L c d a

b b . I I I M h fg P O P J J J J JRQ Not-all Almo-actor Almo-actor K Almo-actor HN L HN L H L HN L \ U U e h fg \ Z U U U ^ ] ] i i i i j j Y [ I M

All Almo-actor J JRQ Not-all HN L HN LS HN L h fg \ Z U U

Then, ` . Y [ This example does not involve recursion.

Indeed, stratified datalog without recursion is exactly equivalent to the relational For a given program k , there may be more than one valid stratification. Show that calculus. the semantics does not depend on which stratification is chosen.

Database Theory 2003 75 c GMB/AD Database Theory 2003 76 c GMB/AD l T Example: Game Fixed-Point Calculus

Relational equations such as q no p tu p w z {| m y v x v} x~ rs — ™

Win-pos Win ˜ “White” ”• – ”• – Move Player Position Next Win Position Player š ž • ž ž œ  ˜ ˜ ˜ ˜ › › Move “White” › Move “Black” Win-pos –Ÿ ” – – ” – ” ”

Black n   White have solutions.

White n € . . . If

White  

¢ £ ¥

. . . ¡ is a relational expression over schema ; ¤ ¦

¥

no occurrence of in ¡ is in the scope of a negation; and ˆ †‡ q „ „ ‚

Win Win y “White” ƒ ƒ

§ ¢ ˆ †‡ is an instance over schema . q Œ „ Ž   ‘  „ ‹ ‰ ‰ y y y y Š   Win  Move “White” Move “Black” Win ƒ ƒ ƒ ƒ ƒ § ¨ ¥

There is a minimum extension of satisfying ¡ .

Each Win ’ is definable in stratified datalog but the set of all winning positions for White is not definable.

Database Theory 2003 77 c GMB/AD Database Theory 2003 78 c GMB/AD © “

Fixed-Point Calculus Definition Extending the Algebra ¬ ª « « Predicate ­ Base relation One way of extending relational algebra to express recursive queries like transitive closure ­ ° ² ²

¯ where occurs only positively in ® ± ³ ½ Ã Ä À À Á Á ¾¿  ¾¿  ¬ ª « « ¶ ½ ½ Ã Å Å Ä ² with matching arities

Formula ±µ´ ³ À À À À Á Á ¾¿  ¾¿  ¾ ÂÆ · ² ² ® is to add while loops. ¸ ¹ ² ® È ½Ç Ä º ² ® while change do » ­ begin ° ² ²

For the semantics, interpret ¯ as the least relation satisfying . ± ³ È Ñ ½ ½Ç ½ Ê Ä É Ì Ì Ë Í Ï Ð Â Â ¾ Every query definable in stratified datalog is expressible in the fixed-point calculus. ¾Î end

Database Theory 2003 79 c GMB/AD Database Theory 2003 80 c GMB/AD Ò ¼ While Programs

Every query definable in the fixed-point calculus is definable by a while program of the form. Õ Ô Ö Ó × while change do Database Theory: Lecture 5 begin Õ Ô Ö Ó

Ø Expressivity and Complexity end Dr A. Dawar Ö Ö

where × and Ø are terms of the relational algebra. Ú Ú Ú Ö ÓÙ

Moreover, if Ø is of the form , the converse holds as well. www.cl.cam.ac.uk/Teaching/2002/DBaseThy

Database Theory 2003 81 c GMB/AD Database Theory 2003 82 c GMB/AD Ü Û

Evaluating a Query Running Time

Given a datalog program Ý , a na¨ıve method of evaluating it on a database Let

instance Þ would proceed as follows: ý û § ü ¡¢ ¦ þÿ £¥¤ ¨ þ á à Þ Ý é êë â ä ä è æ 1. ß . ì íî ã å ç ý û ¤ © the maximum number of distinct variables occurring in any rule in Ý Ý Þ ï ð êñò

2. For every valuation of the variables occurring in over ó , if ì í ý û ¤ the maximum arity of any relation in the head of a rule in  ö ä body  ìõô í÷ ¤ § ü Then, the na¨ıve evaluation of on takes steps. £ ¨ á û ß ø is a rule such that body, then ü ç We only need to consider distinct evaluations for any given rule.  á û ü à ï ö ä â ä ß ß The number of iterations is at most . ì í ã ìõô íîù ì í ¤ ¤

For any program , there is a polynomial  such that can be evaluated in time 3. Repeat step (2) until no new tuples are added. § §

 on an instance . £ þ þ ¨

Database Theory 2003 83 c GMB/AD Database Theory 2003 84 c GMB/AD  ú Running Time NP-complete Query

The upper bound on the running time also holds for stratified datalog, as each Is there a way to travel around Britain so that I visit every railway station stratum is evaluated to completion just once. exactly once?

Since the query expressed above is an NP-complete property of a database One can also derive such bounds for relational algebra. instance, we cannot hope to express it in datalog.  For any relational algebra query  , there is a polynomial-time algorithm to Indeed, one can prove that it is not definable. evaluate it on all instances.

 The degree of the polynomial is bounded by the maximum arity of any Is every polynomial-time computable query expressible in datalog? In stratified datalog? In the fixed-point calculus?

sub-query of  .

Many optimization techniques are designed to reduce the arities of intermediate relations.

Database Theory 2003 85 c GMB/AD Database Theory 2003 86 c GMB/AD  

Polynomial-time Queries Generic Queries  

Give me the set of elements in the active domain which are at even Given database instances  over the same schema, a bijection positions when listed lexicographically.       This query is computable in polynomial time.  !

is an isomorphism from to if for any relation and any tuple :  ! !

" "

Unless the lexicographical order is explicitly given as a relation, it is not definable  iff  # $ # $ # $¥% in any language we have considered.    A query mapping & is generic if, whenever is an isomorphism from to , it is The query is not generic    & also an isomorphism from & to . # $ # $ # $ ) ) ' ( '( denotes the query obtained from by replacing every constant by

* + .

Every query language we have considered only defines generic queries.

Database Theory 2003 87 c GMB/AD Database Theory 2003 88 c GMB/AD ,  Order Fixed-Point with Order 0 ./

Let - be a total order relation on . The fixed-point calculus with order can express non-generic queries: 0 (for instance, if ./ is the set of natural numbers) H E H J K

Is there a pair G in the relation with ? I DFE Allow queries in the fixed-point calculus to use the order. Immerman and Vardi showed: 3 1 2 2 Predicate 4 Base relation A query is expressible in fixed-point calculus with order if, and only if, it is 4 7 6 9 9 where occurs only positively in 5 8 : computable in polynomial time. 3 2 2 1 = In the absence of order, there are polynomial-time, queries that cannot Formula 9 with matching arities generic 8<; :

= = be expressed in the fixed-point calculus. > 5 ? 9 9 5 Is the number of tuples in J even? @ A 9 5 B 9 5 Open Question: Is there a language that can express exactly the polynomial-time computable generic queries?

Database Theory 2003 89 c GMB/AD Database Theory 2003 90 c GMB/AD L C

Inexpressivity

The transitive closure query: N T U Q Q R R OFP S OFP S N N T V V U Q Q Q Q R R OFP S OFP S O S¥W Proving Inexpressivity is not definable in the relational calculus, without recursion.

How do we prove this?

We can show that the query:

Is the graph U strongly connected?

is not definable. N N P X X Q If were definable, this query would be R R . OFP S

Database Theory 2003 91 c GMB/AD Database Theory 2003 92 c GMB/AD Y M Quantifier Rank Proving Inexpressibility t q q \ s s u r v [ For a formula Z , we define its Z by: For two instances , we write to denote that for any formula with

quantifier rank ] ^ x | { w v and no free variables: y z \ ` [ Z 1. if Z is atomic then , ] ^ _ ~ q  s

v iff v } ~ } _ a \ \ [ [ b Z 2. if Z b then , ] ^ ] ^ _ _ _ f d c e c e Z b b 3. if Z b b or then g i \ h \ \ e c j [ [ [ b b Z . ] ^ ^ ] ^ ] ^ _ ] To prove that strong connectedness is not definable in relational calculus, we  | € u show that for each there are graphs and u such that _ _ o k \ l \ l m n [ [ b Z b Z 4. if Z b or then ] ^ ] ^ _ t  € u u u \ [ That is, Z is the depth of nesting of quantifiers in Z . ] ^  €

but u is strongly connected while u is not.

Database Theory 2003 93 c GMB/AD Database Theory 2003 94 c GMB/AD ‚ p

Ehrenfeucht Games Ehrenfeucht Games contd. Ÿ Ÿ Ÿ œ œ › š 

We define a game played between two players Spoiler and Duplicator on two If, for any relation and any tuple taken from ž ž we have ƒ

database instances and „ over the same schema. ¢ ¥ › › š š ¡ ¡ if, and only if, ¦ £ ¤ £ ¤ £ ¤ † The game consists of rounds, where in each round : then Duplicator has won the game. Otherwise Spoiler has won. 1. Spoiler chooses an element from the active domain of one of the two ƒ ‡ Š‹Œ ‰ ‰ „ ˆ ˆ

instances: either or  . ‘ Ž  Ž

Duplicator has a strategy for winning the § -round Ehrenfeucht game on 2. Duplicator responds with an element from the active domain of the other ¢

and ¦ if, and only if, ‡ ˆ ˆ

instance:  or . ¨ ¢ Ÿ ¦

” ” ” ” ” ” ‡ ‡ ’ ’ • • “ “ “ “ At the end of play, we have two sequences and   . — ‡ ˜ – ˆ ˆ

Let denote the mapping  .

Database Theory 2003 95 c GMB/AD Database Theory 2003 96 c GMB/AD © ™ Cycles « ¬ ­ ª

Take « to be a graph consisting of a single cycle of length and « to be a «

graph consisting of two disconnected cycles, each of length ® . ¯ ­ ª

We can show « « « Database Theory: Lecture 6

This proves that the transitive closure query is not definable in the relational Complex value model calculus. Dr G.M. Bierman

www.cl.cam.ac.uk/Teaching/2002/DBaseThy

Database Theory 2003 97 c GMB/AD Database Theory 2003 98 c GMB/AD ± °

History Foundations

² First Normal Form: Assumption in original Codd 1970 paper The notions of relation names, attributes and constants are as in the relational model ² But...lots of application areas require more flexible model (even our Cinema ¸ · · » ¹º database from lecture 1!) Sorts ¶ Base relation · Á Á Á ¾ · ¶ ¶ Ç Æ ¿   ² À À Tuples È ÄFÅ First proposal: Makinouchi [1977] ¼ ½¥¾ Ã

¶ Sets ¼ É Ê ² Main activity in 80s:

– Jaeschke & Schell [1982] Ë An element of a sort is called a complex value Ë » – Thomas & Fischer [1983] Useful shorthand: Often eliminate occurrences of ¹º , e.g. – Abiteboul & Bidoit [1984] ÍÎ Ñ Ó Î Ñ Î ÍÎ Ñ Ö Î Ñ Í Ó Î Í Ö ÏÐ ÏÐ Ï Ð ÏÐ Ô Ô Ò Ò Ò Ò Ò Ò – Roth, Korth & Silberschatz [1986/8] Ì Õ Ì ×Ø ×ÚÙ Ì Õ Ì ×Ø ×

Ë Note that a complex value may belong to more than one sort – Schek & Scholl [1986]

Some differences, but basic idea: drop requirement that entries are atomic ³ Other names: NFNF, NF ,´ 1NF, nested relational model, ...

Database Theory 2003 99 c GMB/AD Database Theory 2003 100 c GMB/AD Û µ Extended language ALGcv: Complex value algebra é Ü è è ê

Relational algebra deals with sets of tuples. Thus complex value algebra ALGcv ç Base relation

í Constant values ë ì should deal with sets of complex values. î ï

ç ç Union ë Ü A complex value relation of sort Ý is a finite set of values of sort Ý .

çñð ç Set difference ë Ü á ò â õ ó ö ç Select (constant) Common confusion: A complex value relation of sort à à is a set of ë ÷ ø Þ¥ß ã ô ò õ ß á â ó ó

ç Select (attribute) ë ÷ ø

tuples over attributes . The entire relation is also a complex value of ô ù ò ó ó á ç Select (attribute mem) ú â ë ÷ ø ô

sort . ù ä Þ¥ß ãå ò õ ó ó û ü ç Select (attribute comp) ë ÷ ø ô ù

Ü Note: Sort of a relation need not be a tuple. ý ó û û û ó

ç Projection ë ÷ ø þÿ ÿ

£ ¥ ¢ ¤ ¦ ¤§

¡ ç Powerset ë ÷ ø ¥

§ ¨ © ¤ § ¤

 ¡ ç ç Tuple ë ÷ ø ó û û û ó þ ÿ ÿ

¥ ¦ ¤§ © ¤ § ¤

ç Set creation ë ÷ ø  ¥ § ¨ ¤ ¦ § ¢ ¡  ç Tuple destroy ë ÷ ø  ¥ ¦ ¤§ ¤ ¦ § ¢

 ç Set destroy ë ÷ ø

Database Theory 2003 101 c GMB/AD Database Theory 2003 102 c GMB/AD  æ

Semantics of ALGcv Sort judgements R T U U Q Q S means that S denotes a relation of sort , given database schema .                X V [  Z   W Y         \ ] c _ ab [  ^ ` " " ! # ! #   

\ ] X             [ V   $ $  ! # ! #   

            \ ] \ ] \ ] \ ] X X X X [ [ [ [ e f e f d d d d     -. * * %  + (   & / )

         ,   ' h   \ ] \ ] X X [ [ g e f e f d d d d   -. - . % * * * + ( / & &  

           , 1 ' 0    -. -. % * * * + + / & &  

  2          , 1 ' 0     - . - -. % * * * 6 + ( & &  /  5

           , 1 ' 043    -. - . - - - . < . < * * * 7 + ! !   & 3 3 3 & > >

= = ? ,         ;   89 9 :    B D C E CF G G A  

H @          ,      ------. < < - - - . D G G G G CK F C FI J + + ! ! ! ! !    > > > > >

= = = = = @     ? ,        ; & 3 3 3 &    89 9 :    D E CF J CK F C  

              N < D L G G C E F A FI +  

M @  , ; ?              D L C E F A E CF  

M           O

Database Theory 2003 103 c GMB/AD Database Theory 2003 104 c GMB/AD i P Sort judgements cont. Sort judgements cont. k m o q q q s m w q q q Œ Ž Ž – – – Ž › – – – › – – – n n uv ’ ’ j ‹     ‘ ‘ œ ‘ t “ — “ — l r r  • • • • ž • • p x  ™ š  š  ” ˜ Œ k m Ž › Ž – – – › Ž Ÿ y n ’ ’ ‹ j | œ ‘ ¢ ¢ ¢ z

} l  • • ~  ¤ ¥  ™ £ £ ”¡ ¡ ” { Œ Ž Ž – – – Œ ’ ’ ‹ ‹ k m o q q q s m w q q q s w q q q n n u v u v j œ œ ‘ ‘   t l r r r r p x €m Œ ª Ž › Ž – – – – – – › Ž ¦§ © «¬ ¦ « ’ ’ ‹ k m y n j ‘ ‘ œ ‘ œ œ | ¨  • •  • • z z ¤ ¥  ™ ¢ ¢ ¢ l ~  £ ”¡ ¡ { ‚ Œ Ž › Ž ’ ‹ k m o q q q s m q q q s m q q q n n n n j   ™ t t t l r r r r p ƒ „ x € Œ ­ ª Ž ¦ § «® ¦ ¯ ’ ‹ m k n y j ¨ °  ¤ ¥ z z l

~  { ‚ Œ Ž ’ ‹ Œ Ž ’ ‹   š  k m o q q q s m q q q s q q q m q q q q q q n n n n j ‡ t t t l r r r r r r p p x x €†m Œ ª Ž ® « ¦ © «¬ ¦ « ’ ‹ Œ ­ ª Ž ® « ¦ «® ¦ ¯ ’ ‹  °  ¤ ¥ š  ¤ ¥ k m y n j | z z ‰ l ~  { ‚ˆ Œ Ž ’ ‹  ± ª Ž Œ ¯ « ® « ¦ ’ ‹ ¨  ¤ ¥ š 

Database Theory 2003 105 c GMB/AD Database Theory 2003 106 c GMB/AD ² Š

Generalized algebra Further extensions: Nest and Unnest

As in the relational algebra, there are a number of useful generalizations that can (Simplified forms) be coded up in ALGcv Ê Ê Â Ì Â À Â Â Ç Â ÅËÊ Å Å É ¿ Æ Á Æ ÍÎ Í Ã†Ä È Ã ³ · º · ¸

Complex Constants, e.g. ¹ ´µ†¶ ´4» ¼½ ¼ Ê Ê Ê À Ð Ð Â Â Â Ì Â Ï ÑÒÓ Å Å Å ¿ É Ô Á Æ Æ Õ Ö Ã†Ä Í

³ Renaming Ê Ê Ê ³ À Â Â Ç Â Â Å Å Å ¿ É

Cross Product Á Æ Æ Ã†Ä Í Ê À Ð Â Ä Â Ì Â Ç Â Â Ñ Ò Ó Å ÅËÊ ÅËÊ ¿ É Ø × Ô Ú Á Æ Æ ³ Õ Ö Ã È Ã ÍÎ Í Join Ù

³ N-ary set creation

Aside: Clear that ALGcv subsumes relational algebra.

Database Theory 2003 107 c GMB/AD Database Theory 2003 108 c GMB/AD Û ¾ Examples Further examples è è Ý § § ¦ ã æ ì ã æ ã æ â ã æ äå äå äå äå ë ¨ Ü ß

1. Renaming of attributes to © ç Assume ç ç . àá†ì Þ í éê í Þ àá†â éê  ß #        ! 1. Union of and some constant tuples:        !         

       " "  " " "      ö ö # $% ! & &  ( (    '    '           " " "   ó ò ó ó ò ó îï ô ÷ ù ù õ õ õ ðñ4ò ø ñ4ò øú *,+ )

2. Tuples inë where the first component is containing in the second component 2. Unnesting of on ? þ û ü ü ý ÿ ¡

3. Cartesian product ofß andë ? Ý â ì 4. Join ofß andë on ¤ û î £ ¢ ü ÿ ¡

Database Theory 2003 109 c GMB/AD Database Theory 2003 110 c GMB/AD - ¥

CALCcv: Complex value calculus Extensions to CALCcv 0 / / Term . 1 Constant value Similar to relational calculus

3 Variable 2 G I I P P P K I N N H M M Q Q e.g. Assume O O

Complex constants JLK R 45 3 Tuple element 2 ^ ] d d d _ _ _ X X X ` X Xc Xc S U S b U U b b U V V V Y Y Y W W a a a a T T Z Z\[ 0 / / 6 . 7 Relation atoms

Literal 8 9 G O e.g. h egf i 0 Complex terms . . Equality 2 . . : Membership 2 G l j Complex literals e.g. k . .

; Subset 2 G P P P o m q M Q O O O h h Queries as terms e.g. where p i r s,t e egm nLo i 0 < / / Formula 6 Literal < < < < < = > Logical connectives 2 2@? 2 A 4 < 3 Powerset 2 4 < 3 B Tuple 2 0 < / / 3

Query C D 2 E

Database Theory 2003 111 c GMB/AD Database Theory 2003 112 c GMB/AD u F Examples Transitive closure! ¥ ¥ « §®­ ¥ « ©ª ©ª ¤ ¨

Recall ¬ . Define Again assume ¦L§ ¯ ‚ ‚ w } € | } € } € † } € ~ ~ ~ ~ y v » ¹º

   . x z{L| ƒ„ z{L† x ‡ ƒ„ ‡ ¼  ŠÄÇÆ Å Å ÄÇÆ Å ½¾ À Á ¾ À ·Ê À Á · °±² ³´µ · É É ¿ ¿ ¿ ¿ ÃgÄ È ÃgÄ È ¶ ¸ y

1. Union of and some constant tuples: ¼ Å ÄÇÆ Å ¾ Á · Ë É ¿ ÃgÄ È   ’ ‘–• ’ ’ ‘–• ’ ‰ ‹ ‰ Ž ‰ “ — Ž ‰ ™ ™ ” ” ˆ Š Œ  g‘ ˜ g‘ ˜š » ¹º ¼ ¼ Â Ñ · ½Ñ Ð Ñ ° ² ÌÍÎ Ï Ì ³Ð · Ë É ¶ ¸ ¶ ¸

2. Tuples in where the first component is contained in the second component Then transitive closure can be defined as follows: • ž Ÿ ž Ÿ › œ ›  › ›

ˆ Š Œ  š ¼  ½ · °±² ³´µ · Ê ° ² ÌÍÎ Ï Ì ³ Ð · · Ë É Ó Ó Ò Ô ¶ ¸ ¶ ¸ Õ

3. Cartesian product ofy and ? Recall that this can not be expressed in the relational calculus/algebra!! w | † 4. Join ofy and on

• • Where did this extra expressive power come from?  ž ’ ž ‘ ‘–• ’ ž ‘–• Ÿ ’ ž Ÿ Ÿ ’ ž Ÿ ¡ ‰ › ‹ ‰  œ ›  ¡ ‰ ‰ › › ” ” ” ” ˆ Šg¢ Œ  Œ  g‘ ˜š

Database Theory 2003 113 c GMB/AD Database Theory 2003 114 c GMB/AD Ö £

Equivalence: From ALGcv to CALCcv Equivalence: From CALCcv to ALGcv  Ü Given an ALGcv expression × we can define a CALCcv query × . This proof proceeds in exactly the same way as for the relational calculus to ØgÙ ÚÛ Û Ý relational algebra ã áâ ä æ ß ß  à Þ Þ å ç (Again we will only consider domain-independent queries) ã áâ ä ä é æ é à Þè ê Þ  ã áâ The only extra complication is in the creation of the active domains ä ä òó ë æ é ñ î à à ì ô ï ð ð Þ å ç Þ Þ Þ å ç í ã áâ ä ä òó ò ó ë æ æ ñ î à à ì ì ô ð ð Þ å ç Þ Þ Þ å ç ö í õ ò ò ò ã áâ ä ä ü ò ó ò ó ò ò ò ó òó æ ÷ ñ ÿ ÿ à ì ú ú ú ì ¢ ¢ ð ý ý ¡ ¡ ý ð Þ ç Þ å þ £ Þ å ç Þ ¤ øù ù û ã áâ ä ¥ ñ ¦ à ÿ ¦ à ÿ à ð ð ð ð Þ Þ Þ Þ Þ Þ ã áâ ä ä ò ò ò ü ò ò ò ò ó ò ò ò ó § æ § ¨ à ÿ ÿ ÿ ÿ ¢ ¢ ¢ ¢ ð ¡ ¡ ð ý © ý ý ¡ ¡ ý ç Þ ì ú ú ú ì å å þ £ Þ øù ù û    ñ ñ ñ ÿ ¢ ð ð Þ Þ Þ Þ ç ø ¤ ¤ û ã áâ ä ä ü ò   æ  §  § ¨ ñ à ð ý  ð ý © Þ ç Þ å þ £ å ç Þ Þ ¤ ò ò ò

Database Theory 2003 115 c GMB/AD Database Theory 2003 116 c GMB/AD   Constructing the active domains Variants of Complex Value Model   Define a function  that maps a relation of sort  to a relation of sort Nested relational model: Set and tuple constructors are required to alternate, (essentially just returns the ground elements). e.g. l i g i n h f f f f ok op op o j d+m j d+k d+e l g i h f f f not ok o o d+k d+e $ "# % ! (  &' ' $ "#

: further restriction of the nested relational model where at least % V-relation model ( 4 7 - &' . 12 56 1 8 ' . , , , 3 9 0 0 : : / )+* $ "#

one component of every tuple sort must be atomic, e.g. A A A % ( - - - ( ( ( * - &' . = ' @ @ . = . . ' , , > > , , < < < < ; ? 0 0 )+* / : 0 0 : : : 0 0 )+* / : 0 )+* / 0 : : : B $ "#

% l i g i n h ( 4 7 . &' . 6 5 1 5 6 1 8 ' f f f f ok op op o j d+m j d+k d+e 9 C D 0 0 : : g i l i n h f f f not ok o p op o j d j d+m d+e F M M M   H E H I I N N

Then for a relation schema L L and instance G J K J KO Further assumption: atomic values in a tuple form a key F M M M Q P I N L L GRQ O Work on the complex value model has naturally lead to work on object-relational ] \ ` ` ` F Q Q P b ^ ^ STU V _ _ _ I N X X L a Y W Y W I W Y W Y[Z N and object models (next lecture!). b X X where a are the free elements in . W Y

Database Theory 2003 117 c GMB/AD Database Theory 2003 118 c GMB/AD q c

Manifestos

By the late 80s following from work on Complex Value Models, and from work in the OO community, came suggestions for adding objects to database systems.

1. The OODBMS manifesto Atkinson, Bancilhon, DeWitt, Dittrich, Maier and Database Theory: Lecture 7 Zdonik. 1989 Object model 2. The third-generation database system manifesto Committee for advanced DBMS function. 1990

Dr G.M. Bierman (Date and Darwen have subsequently suggested a “third manifesto”)

www.cl.cam.ac.uk/Teaching/2002/DBaseThy/

Database Theory 2003 119 c GMB/AD Database Theory 2003 120 c GMB/AD s r The OODBMS manifesto Mandatory commandments

v Thou shalt support complex objects t Paper attempts to give a generic definition of an OODBMS

v Thou shalt support object identity

t It describes the main features and characteristics for a system to qualify as an

v Thou shalt encapusulate thine objects OODBMS

v Thou shalt support types or classes

t These were grouped as follows:

v Thine classes or types shalt inherit from their ancestors 1. Mandatory Features that your system must have

v Thou shalt not bind prematurely 2. Optional Features that might make your system better v Thou shalt be computationally complete 3. Open Features where you have free design space v Thou shalt be extensible

These have had enormous influence both on OODBMSs and RDBMSs too! v Thou shalt remember thy data

v Thou shalt manage very large databases

v Thou shalt accept concurrent users

v Thou shalt recover from hardware and software failures

v Thou shalt have a simple way of querying data

Database Theory 2003 121 c GMB/AD Database Theory 2003 122 c GMB/AD w u

ODMG: history The ODMG architecture

x Essentially an industrial response to the manifesto & a number of competing OIF products

x Founded by Rick Cattell (Sun) in 1991 to produce an industry standard for object databases (but “avoid design by committee”!)

Object Model x Published various releases of the standard (1.1–Jan 94, 1.2–Jan 96, 2.0–May OQL ODL 97, 3.0–Jan 2000)

x Links with the OMG, and X3H2 (SQL) standard committee (amongst others) Language Binding

x Now (2001-) refocused on the JDO design effort (Persistent objects for Java)

x Whilst now dormant: still an influential design and most OODBMS products Java C++ Smalltalk are “compliant”

Database Theory 2003 123 c GMB/AD Database Theory 2003 124 c GMB/AD z y ODL: Object Definition Language ODL: Primitive types

{ Language for specifying the structure of database objects A whole host of primitive (non-object) types, e.g. } { Independent of your host programming language float, long, short, unsigned short, unsigned long, ... } { Class-based model boolean, char, string } { (Superset of IDL - CORBA definition language) date, time, timestamp, ... } { A class consists of enumerations and structs (like in C)

– Attributes (cf. fields) } set, bag, list, array, dictionary – Methods (The latter are like the polymorphic constructor types in SML) – Relationships (specific to OODBMS)

Database Theory 2003 125 c GMB/AD Database Theory 2003 126 c GMB/AD ~ |

Example ODL definitions ODL methods Movie class  ODL allows declarations of methods (code is written in your favourite host attribute string title; language) attribute integer year; class Movie Film color,BW filmType; ‚

attribute enum  € attribute string title; ; € attribute integer year; attribute enum Film color,BW filmType; ƒ Star ‚ class  float lengthInHours(); attribute string name; set(string) starNames(); attribute Struct Addr ; string street, string city address; ƒ €  ; €

Database Theory 2003 127 c GMB/AD Database Theory 2003 128 c GMB/AD „  ODL relationships Classes

ODL provides binary relationships. As in Java and ˆ C , we allow single inheritance between classes, and provide class Movie interfaces with mutiple inheritance. attribute string title; class Cartoon attribute integer year; extends Movie ‰ attribute enum Film color,BW filmType; relationship Set voices † float lengthInHours(); inverse Star::VoiceIn; ; set(string) starNames(); Š relationship Set stars class MurderMystery inverse Star::starredin; extends Movie ‰ ; attribute string weapon; † ; Š class Star

attribute string name; attribute Struct Addr string street, string city address; † relationship Set starredin inverse Movie::stars; ; †

Database Theory 2003 129 c GMB/AD Database Theory 2003 130 c GMB/AD ‹ ‡

Extents Expressivity

ODL provides us with a handle on the collection of all current objects in a class: Consider the relational schema extent ”  ‘’“  Cinema,Title,StartingTime from Lecture 1. In ODL: class Movie ( )

extent Movies Œ attribute string title; class Movie (extent Movies) attribute integer year; { attribute enum Film color,BW filmType; Œ  string Cinema; float lengthInHours(); string Title; set(string) starNames(); relationship Set stars Time StartingTime; inverse Star::starredin; }; ;  What about Complex Value Model? The extent name, e.g. Movies, can be used in queries, cf. relation name in SQL.

Database Theory 2003 131 c GMB/AD Database Theory 2003 132 c GMB/AD • Ž More types An introduction to OQL

Rather confusingly, ODL also provides collection classes ™ Based on SQL’s SELECT FROM WHERE construct – Set, Bag, List, Array and Dictionary ™ Looks like SQL-92 – These are classes ™ Properly orthogonal: — – In fact they are generic classes (not yet supported in Java/C , but like “OQL is a functional language whose operators can freely be templates in C++) composed, as long as the operands respect the type system.”

Database Theory 2003 133 c GMB/AD Database Theory 2003 134 c GMB/AD š ˜

Some example queries Further features

1. Movies; 1. Casting

2. select s.name from Stars as s; select (Movie)m from MurderMysteries as m 3. select m.year from Movies as m where where m.weapon="AK47"; m.title="Magnolia";

4. select s.name from Movies as m, 2. Object creation m.stars as s select new Studio(owner: s.name) where m.title="Magnolia"; from Stars as s;

5. select t.name from (select m.stars from Movies as m where m.title="Magnolia") as t where t.address.city="Hollywood";

Database Theory 2003 135 c GMB/AD Database Theory 2003 136 c GMB/AD œ › Language binding Java language binding

 Intention is to embed the database operations transparently into the host Ÿ ODMG specify a host of interfaces, e.g. for ODL types, OQL queries, e.g.

programming language public interface OQLQuery public void create(String query) ... public void bind(Object parameter)

... public Object execute ...... ;  Database programming now just looks like normal programming

Ÿ Your vendor supplies the classes that implement them  For most languages this is via an API

Ÿ Java database programming now becomes just like normal programming  Runtime language objects persist by backend magic OQLQuery query; query.create("select ... where p.salary>$1"); x=new Double(50000.0); query.bind(x); RichProfs=(DBag)query.execute();

Database Theory 2003 137 c GMB/AD Database Theory 2003 138 c GMB/AD

ž

Formalizing: MiniODMG MiniODL: Design choices

¡ One of the big problems with the ODMG was a lack of formalization 1. No interfaces

¡ So let’s formalize! (Cutting edge [SIGMOD 2003!]) 2. Limited primitive types £ ¡ We’ll define Just integers and booleans

– MiniODL 3. No relationships

– MiniOQL 4. No collection classes (requires generics!)

¡ We’ll assume that methods have been written in our favourite programming language, MiniDFlat.

Database Theory 2003 139 c GMB/AD Database Theory 2003 140 c GMB/AD ¤ ¢ MiniODL: Definitions MiniODL: Example class Employee extends Person Class definition (extent Employees) ¨ § § ¥¦ class C © extends C ª { attribute int EmpID;

extent ¬ « ­ attribute int GrossSalary; ° ° ° ¯ ¦ ¯ ¦ © ± attribute Manager UniqueManager; ® ° ° ° ² ¦ ² ¦ © ³ int NetSalary (int TaxRate); } ´

Attribute definition ¨ § § ¶ ¯ ¦

attribute µ ·

Method definition ¨ § § ° ° ° ² ¦ ¸ º º ¼ ¼ ¹ ¹ µ µ » » µ · « ­

Database Theory 2003 141 c GMB/AD Database Theory 2003 142 c GMB/AD ¾ ½

MiniOQL: Definition MiniOQL: Definition cont.

Query expression Query expression cont. Õ Á Ô Ô Ö Ö Ö À À Â Ó ¿ integer

true false booleans C Ó cast × Ø Ù Ã Ã Ö Ú Ä

identifier Ó attribute access × Ã Ö Ö Ö Ö È È È Û Ü Æ Þ É Ó Ó Ý Ý Ó ¿ Ç Ç ¿ set method invocation × Ø Ù Ã Å Ê Ô Ö Ö Ö Ô Ú Ú Ë Ì Ü Ü Þ Þ Ó Ý Ý Ó ¿ sop ¿ set ops new C object creation × Ø Ù Ã ß à Ë Ì á Ó Ó Ó ¿ iop ¿ int ops if then else conditional × Ã â Ë Ì Ó Ó Ó ¿ = ¿ int equality select from in where select × Ã Ë Ì

¿ == ¿ object equality à À È È È À Î Î Ë Ë É É ¿ Ç Ç ¿ record Ã Í Ï È Î

¿ record access Ã

size ¿ set size Ð Ñ Ã

Database Theory 2003 143 c GMB/AD Database Theory 2003 144 c GMB/AD ã Ò Type system for MiniOQL Type system

A query typing judgement is written

MiniOQL type ó ö ø ù æ ôõ ÷ ï ï ï å å å å ì ä ä ä ä í í ð ð î î ç set ñ é ê èë+ì è

where ú ó is a map from extent identifiers to their type ú

õ is a map from free identifiers of the query ÷ to their type

Database Theory 2003 145 c GMB/AD Database Theory 2003 146 c GMB/AD û ò

Typing judgements Typing judgements

(Int) (Bool) " " "     ! ! ü ÿ ¡ ü ÿ ¡

# #     ýþ

ýþ true false bool

int ¢ (Rec)   ( ( ( ( ( ( % ! % ! # # # #   ' '  ' ' $&% ) $&% ) (Id) (Extent) ü ¡ ÿ ¡ ¦ ¦ ü ¡ ÿ ¡ ¤ ¥ ¤ ¥ £ ýþ

ýþ £ C set C § ¨   ( ( ( , ( ( ! % ! * - + # #    ' ' $&% ) (Rec access)   ( % ! . .   ü ÿ ¡ ¦ ¥

ýþ set § ¨ (Size) ¡ ü ÿ ¦

ýþ size int 0 § ¨   /

  C C C (Upcast) 0 0

 

  C C 2 1 ü ÿ ¡ ü ÿ ¡    ¥ ¥  ¥ ¥    

ýþ © ýþ © (Set) ¡ ü ÿ    ¥

ýþ © £ £ © set § ¨         ! 4 ! ! 4 ! 3 5   '   '    set bool 1 2 (Select)

  ! 4 3 5     select from in where set 1 2 ü ÿ ¡ ü ÿ ¡ ¥ ¥   ýþ © ýþ © set set § ¨ § ¨ (Union) ¡ ü ÿ ¥    © ý þ © set § ¨

Database Theory 2003 147 c GMB/AD Database Theory 2003 148 c GMB/AD 6  Observable nondeterminism Even worse... class P extends Object class F extends Object class P extends Object ( extent Ps) ( extent Fs) ( extent Ps) { attribute int name; { attribute int name; { attribute int name; }; attribute P pal;}; F loop(); /* This method always loops! */ }; SELECT (if size(Fs)<1 then (new F(name:007,pal:p)).name else p.name) SELECT (if size(Fs)<1 and p.name=001 FROM p in Ps; then p.loop() else new F(name:007,pal:p) Assume we have two P objects (001 and 002) and zero F objects. FROM p in Ps;

What is the result of this query? What is the result of this query?

Database Theory 2003 149 c GMB/AD Database Theory 2003 150 c GMB/AD 8 7

Can it be fixed? Method support

Yes! In two ways: Notice that MiniOQL allows you to invoke methods, which have been written in MiniDFlat. 1. Remove object creation from queries!

: ALL DBMS maufacturers think you need this! 2. Make object creation invisible to current query (!)

: What is the meaning of a query? 3. Use a fancy type-based analysis (GMB) – Dependent on meaning of arbitrary code!

– How does this tie-in with semantic techniques so-far in DBT?

– Queries can loop - which set is this denoted by?

Database Theory 2003 151 c GMB/AD Database Theory 2003 152 c GMB/AD ; 9 Semantics of queries revisited Useful URLs > < Dependent on meaning of arbitrary code! www.odmg.org Fine - we have operational semantics for Java/SML/...! > www.service-architecture.com/articles

< How does this tie-in with semantic techniques in DBT? > java.sun.com/products/jdo Maybe it’s been wrong all along!

< Queries can loop - which set is this denoted by? Either move to domain theory; or use operational semantics!

Database Theory 2003 153 c GMB/AD Database Theory 2003 154 c GMB/AD ? =

Schema-less data

A Underlying assumption so far: Data has a fixed schema declared in advance

A E.g. CREATE TABLE in SQL, ODL in ODMG object databases

Database Theory: Lecture 8 A Obvious question: What happens if we drop this assumption?

Semistructured model and XML A Note: This predates the web-based applications!

A First prototype system: TSIMMIS/Lore at Stanford [1995] Dr G.M. Bierman

A [Aside: Another assumption is persistent data sets. Dropping this yields a data stream model—a HOT area!] www.cl.cam.ac.uk/Teaching/2002/DBaseThy

Database Theory 2003 155 c GMB/AD Database Theory 2003 156 c GMB/AD B @ Is this a good idea? Is this a good idea?

The world is really quite unstructured, for example: To query the database one currently needs to understand the schema

C Data exchange formats (pre-XML) But...lots of users either don’t understand the schema or don’t want to know about the schema

C The World-Wide Web

E Where in the database is there information on “Magnolia”?

C ACeDB - a database used by biologists (from Sanger Centre, Cambridge)

E Are there integers less than -256?

E Which database objects have attribute names beginning with “Name”?

So let’s move to schema-less data model!

Database Theory 2003 157 c GMB/AD Database Theory 2003 158 c GMB/AD F D

Semistructured data SSD and graphs

G Rather than devising a syntax for types, we need a general syntax for values

G First start with simple idea: records {name:"Britney", email:"[email protected]"} person person

G Thus a value can be a collection of label-value pairs, e.g. {name:{first:"Britney",second:"Spears"}, email:"[email protected]"}

G (We’ll drop the standard convention that labels must be unique) {person:{name:"Britney", email:"[email protected]"}, name email name email person:{name:"Avril", email:"[email protected]"}}

"Britney" "[email protected]" "Avril" "[email protected]"

Notice that these values can be represented using edge-labelled trees

Database Theory 2003 159 c GMB/AD Database Theory 2003 160 c GMB/AD I H Relational model Object model

It’s pretty obvious that relational database instances can be represented using Z Objects have identity this model. Z Let’s extend our semi-structured model to label values with oids and allow K R W M S U J

For example Q Q Q Q . An instance could be stored as L NPO T NPV TX oids to be values, e.g. {R:{row{A:"Britney",B:"St Johns", C:"E1(a)"}, {person: &o1{name:"Britney", row{A:"Christina",B:"Trinity",C:"F7"}}, friend: &o2, S:{row{D:"Myleene",E:"Thursdays"}, ex-bf: &o3}, row{D:"Avril",E:"Fridays"} person: &o2{name:"Gavin", row{D:"Britney",E:"Saturdays"}} friend: &o2, } friend: &o3}, person: &o3{name:"Justin", friend: &o2, ex-gf: &o1} }

Database Theory 2003 161 c GMB/AD Database Theory 2003 162 c GMB/AD [ Y

Graphs! Semi-structured data model _ ^ ^ Drawing this example graphically: Data is now a graph! SSD ] ` Value

&o ` Value with identity a

&o Object identity a

person person person _ ^ ^ Value ` b Atomic values friend

c Complex value a &o1 &o2 &o3 _ ^ ^ ^ f f f ^ c d ] d ]

Complex value { e e } friend friend

ex−bf ex−gf friend name name

"Britney" "Gavin" "Justin"

Database Theory 2003 163 c GMB/AD Database Theory 2003 164 c GMB/AD g \ Coding up ODMG schema Some example SSD compliant data class Movie (extent Movies) h {Movies:{Movie:&m01{title:"Magnolia", attribute string title; year:1999, attribute integer year; attribute enum Film color,BW filmType; film:{colour}, h i relationship Set stars stars:{Star:&s01,Star:&s02}}, inverse Star::starredin; Movie:&m02{title:"Safe", ; i ...}}, class Star (extent Stars) Stars:{Star:&s01{name:"Julianne Moore", h attribute string name; Addr:{street:"Sunset Blvd",city:"LA"}, attribute Struct Addr starredin:{Movie:&m01,Movie:&m02}}, string street, string city address; h i relationship Set starredin Star:&s02{name:"Tom Cruise", inverse Movie::stars; ...}} ; i }

Thus it is easy to embed ODMG data into SSD model

Database Theory 2003 165 c GMB/AD Database Theory 2003 166 c GMB/AD k j

Expressivity cont. SSD systems l We have seen that we can code up relational and object model instances into n A number of these flourished in the mid to late 1990s. SSD. What about the other direction? n Most notable work at Stanford:

l Not so easy! The point of the SSD model is its flexibility, e.g. – TSIMMIS [1995-] {Friend:{Name:"Britney", Tel:337890}, – Lore [1997-] Friend:{Name:"Emma",

n Other interesting systems include ACeDB: Email:"[email protected]"}, Friend:{Email:"[email protected]", – Built at the Sanger Centre (Cambridge) Tel:898989}, – Originally to store genetic data about C. elegans organism Friend:{Name:"Myleene"} } – Essentially a SSD model (has a schema language, but allows labels to be missing) l SSD model offers a very flexible data model!

l How might we store SSD values in a relational model?

Database Theory 2003 167 c GMB/AD Database Theory 2003 168 c GMB/AD o m XML XML graph model

p Standard for data exchange on web

p Textual format, consisting of elements wrapped with matching tags (mark-up), person person e.g. Britney 21 name email [email protected] age "Britney" "21" "[email protected]" p This clearly has a corresponding SSD value name age email {person:{name:"Britney",age:21,email:"[email protected]"}} "Britney" "21" "[email protected]"

r XML is a node-labelled graph

r SSD is an edge-labelled graph

Database Theory 2003 169 c GMB/AD Database Theory 2003 170 c GMB/AD s q

Building graphs in XML Julianne Moore XML provides special attributes to build graphs: ID and IDREF, e.g. Sunset BlvdLA id="&s02" Magnolia Tom Cruise 1999 ... colour Safe ... ...

Database Theory 2003 171 c GMB/AD Database Theory 2003 172 c GMB/AD u t Querying XML XPath x v There were a series of proposals, XML-QL, XQL, UnQL, Quilt... http://www.w3.org/TR/xpath x v Lead to current proposal: XQuery Building block for other W3C standards (not just XQuery) e.g. XLink, XPointer etc.

v XQuery consists of two parts

1. XPath: A language for describing nodes in an XML tree

2. XQuery: Query-like language surrounding XPath

Database Theory 2003 173 c GMB/AD Database Theory 2003 174 c GMB/AD y w

XPath Example XML document z Inspiration is that navigation of a tree is like navigating a Unix-style directory Addison-Wesley

z All paths start from a context node Serge Abiteboul Rick

z Result is (approximately) the set of nodes reachable from the context node Hull using the given path expression Victor Vianu Foundations of Databases 1995 Freeman Jeffrey D. Ullman Principles of Database and Knowledge Base Systems 1998

Database Theory 2003 175 c GMB/AD Database Theory 2003 176 c GMB/AD | { Example XPath XPath: Functions

Query: /bib/book/year Query: /bib/book/author/text()

Result: 1995 1998 Result: Serge Abiteboul Functions available: Jeffrey D. Ullman

Query: /bib/paper/year ~ text() – matches the text value Result: ~ node() – matches any node

~ name() – returns the name of the tag Query: //author

Result: Serge Abiteboul Rick ...

Query: /bib//first-name

Result: Rick

Database Theory 2003 177 c GMB/AD Database Theory 2003 178 c GMB/AD  }

XPath: Wildcard XPath: Attribute Nodes

Query: //author/* Query: /bib/book/@price

Result: Result: 55

NB: price is an attribute

Database Theory 2003 179 c GMB/AD Database Theory 2003 180 c GMB/AD  € XPath: Qualifiers XPath summary

Query: /bib/book/author[first-name] bib matches a bib element Rick Result: * matches any element Hull / matches the root element

/bib matches a bib element under root Query: /bib/book[@price<"60"] bib/paper matches a paper in bib

bib//paper matches a paper in bib, at any depth Query: /bib/book[author/@age<"25"] //paper matches a paper at any depth Thus we get a primitive form of querying (although result is a set of nodes, not valid XML) paper|book matches a paper or a book @price matches a price attribute

bib/book/@price matches price attribute in book, in bib

Database Theory 2003 181 c GMB/AD Database Theory 2003 182 c GMB/AD ƒ ‚

XQuery Simple XQuery

XPath is central to XQuery. In addition, XQuery provides FOR $x IN document("bib.xml")/bib/book WHERE $x/year > 1995 „ XML glue to turn XPath results back into XML RETURN $x/title

„ Variables to communicate between XPath and XQuery Returns:

„ Fancy database features, e.g. joins, aggregates etc. abc def

„ Fundamental structure: ghi for let where return

Database Theory 2003 183 c GMB/AD Database Theory 2003 184 c GMB/AD †

Simple XQuery Bindings

FOR $a IN distinct(document("bib.xml") ˆ FOR $x IN e – binds $x to each element in the expression e /bib/book[publisher="Morgan Kaufmann"]/author)

ˆ LET $x = e’ – binds $x to the entire expression e’ RETURN $a, FOR $t IN /bib/book[author=$a]/title RETURN $t

distinct is a function that removes duplicates

Database Theory 2003 185 c GMB/AD Database Theory 2003 186 c GMB/AD ‰ ‡

Example LET/WHERE query XQuery

‹ Lots of other features (sorting, filters, recursive function definitions, FOR $p IN distinct(document("bib.xml")//publisher) conditionals, ...) LET $b := document("bib.xml")/book[publisher = $p]

WHERE count($b) > 100 ‹ Really cool thing: RETURN $p – http://www.w3.org/TR/query-semantics/ – This contains an operational semantics and type system for the core count is an aggregate function that returns the number of elements fragment of XQuery! – Industrial application of 1B Semantics material!

Database Theory 2003 187 c GMB/AD Database Theory 2003 188 c GMB/AD Œ Š