<<

Chap. 2: Datalog and SQL 2

IntelligentIntelligent Information Information Systems Systems

WS 2018/19

DeductionDeductionDeduction ininin DatalogDatalogDatalog andandand SQLSQLSQL

Chapter 2

© 201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 1 At the Core of Intelligent : Managing Derived Data 2

• The essential topic of this lecture is to understand the importance of the idea of using the concept of derived data (and of the corresponding technology). derivedderived

• We already learnt last week, that data - bases „containing “stored as well as derived data are called deductive databases in scientific terminology.

• In order to manage derived data, we need special abilities of our DBMS that turn it into a deductive DBMS .

• Managing derived data means • designing a deductive DB schema • incl. specifying derivation rules • evaluating queries over derived data storedstored • controlling consequences of changes of derived data

© 201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 2 Stored or Derived (or both?) 2

• Storing data is the „normal “ way of keeping data (in some storage device).

• Every data element can be stored (in principle). derivedderived • But there are data elements that do not (necessarily) have to be stored , but (possibly) can be computed from stored (or other) derived data elements instead. We call such pieces of data „derived “ data (elements).

• It would be more precise to speak about „derivable “ data, because may be a derivable piece of data is actually never derived later on (or has never been derived up till now). We will use the term „derived “ data nevertheless.

• Part of the derivable data can redundantly be stored (after having been derived first) in order storedstored to avoid costly rederivation. We call such data materialized (derived) data.

© 201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 3 Another Topic of the Lecture: How to Derive Data? 2

derivedderived

DBMS

Inference Engine

How do inference (derivation) engines look like? Should they be part of a DBMS or external tools? storedstored

WeWe follow follow a a specific specificapproach approach in in IIS! IIS!

© 201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 4 Learning by Doing 2

IntelligentIntelligent Information Information Systems Systems WS 2018/19

2.2. Datalog Datalog and and SQL SQL

2.12.1 Datalog Datalog and and SQL: SQL: Learning Learning by by Doing Doing 2.22.2 Datalog: Datalog: Syntax Syntax and and (Basic) (Basic) Semantics Semantics 2.32.3 From From SQL SQL to to Datalog Datalog (and (and Back) Back)

© 201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 5 Case Study: The British Royal Family 2

A case study on using derived data will give us a concrete start into using the techniques of Deductive technology: A Genealogical Database

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 6 The Royal Family Tree: Our Example Application Domain 2

Husband -Wife Parent -Child

TheThe current current family family tree tree of of the the BritishBritish royal royal family family (starting (starting fromfrom the the Queen Queen and and her her husband) husband)

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 7 The Royal Family Tree: More Recent Updates 2

Husband -Wife Parent -Child

SinceSince last last year, year, several several „ „updates“updates “ happened: happened: ••William William and and Kate Kate have have a a third third child, child, Louis Louis ••Harry Harry married married Meghan Meghan ••Eugenie Eugenie married married Jack Jack ••(to (to come: come: Meghan Meghan is is pregnant pregnant...) ...)

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 8 From Family Tree to Genealogical Database? 2

??

HowHow to to „ „putput a a family family tree tree into into a a (relational) (relational) database database“?“? (i.e., how to design a basic genealogical database)

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 9 Conceptual Modeling of the Family Tree DB 2

• Traditional way of designing a relational DB: Start by conceptually modelling the resp. application domain

• Corresponding ER diagram (Entity -Relationship approach):

father husband

parents child person marriage wife married mother title divorced name birth death sex

• Basic technique for logical design : deriving a relational DB schema from an ER diagram • per entity type one table: Each attribute of the E -type turned into attribute of table • per relationship type on table, too: Each attribute of the R -type into attribute of corr. table, plus key attributes of of participating E -types (as foreign keys)

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 10 Logical Modeling of the Family Tree DB (1) 2

• The basic relational schema for this ER diagram looks as follows:

• person (title, name , birth, death, sex) • parents ( father , mother , child ) • marriage ( husband , wife , married , divorced)

• Roles in R -types are turned into attributes of the corr. R -table.

father husband

parents child person marriage wife married mother title divorced name birth death sex

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 11 Logical Modeling of the Family Tree DB (2) 2

• person (title, name , birth, death, sex) • parents (father, mother, child ) • marriage ( husband , wife , married , divorced)

The initial design can be improved by integrating certain relati onship tables into entity tables in case of 1:N functionalities : father husband 1 N

M parents child person marriage N wife

1 mother

• person (title, name , birth, death, sex , father, mother) • marriage ( husband , wife , married , divorced)

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 12 Corresponding Genealogical Database 2

Person Key One possible (initial) relational format for family trees: just two tables! Title

Marriage

The entire family tree is „in “ these two tables.

Disadvantage(?): Many empty cells ( „nulls “)

© 201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 13 Just Stored Data in Our Genealogical DB – Up Till Now 2

• All the data representing what we know about the Royal family members (in the family tree) have to be stored in tables

Title of our relational DB. • All the information in these two tables are „given “ data reported from the „real world “. • None of this could have been derived from other parts of the database.

CanCan we we turn turn this this DB DB intointo a a deductive deductiveDB DB byby extending extending it it with with Marriage additionaladditional derived derived data data??

HowHowto to do do this, this, ...... Person ...if...if we we can? can?

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 14 Genealogical Terminology: Potential for Deduction 2

Blood relation (or: consanguinity) only!

Numbers indicate percentage of „common blood “

In this graph, relatives from the maternal side are given only – paternal relatives analogously.

• Data about „existence “ (birth, death, sex, name etc.) cannot be derived , but have to be entered manually and stored in tables . The same applies to marriage and divorce. • All other data about family relationships, however, are derivable from these stored data, provided we can express them using suitable derivation rules .

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 15 Example: Deriving Uncle Data from the Stored Family Tree Data 2

e.g.: HowHow to to derive derive data data about about the the unclesunclesof of a a person? person?

(Aunts could be treated analogously.)

UncleUncle (from(from Latin: Latin: avunculus avunculus "little"little grandfather", grandfather", the the diminutive diminutive of of avus avus "grandfather")"grandfather") isis a a family family relationship relationship or or kinship, kinship, between between a a person person and and his his or or her her parent's parent's brother brother,, parent'sparent's brother-in-law brother-in-law or or parent's parent's cousin. cousin. (from: en/wikipedia) (Here we restrict ourselves to uncle in the narrow sense)

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 16 „Uncle “ as a Derived Table 2

Person UncleUncle

Title

ItIt would would be be nice nice to to havehave the the data data about about unclesuncles in in the the Royal Royal familyfamily in in a a table table of of itsits own own – –best best in in a a derivedderived table table!! (If(Ifpossible!) possible!)

stored derived (Uncles of Charlotte and Mia still missing.)

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 17 Uncle Relationships in the Royal Family Tree?

??

AnAn uncleuncle ofofa a person person is is (one (one of) of) hishis (or (or her) her) parent's parent's brothers brothers..

© 201 8 Prof. Dr. Rainer Manthey Uncle Relationships in the Royal Family Tree! 2

!!

AndAnd several several additional additionaluncle uncle relationshipsrelationships – –how how many? many?

(Do Savannah and Isla have any uncles?)

© 201 8 Prof. Dr. Rainer Manthey How to Derive a Particular Uncle Row? 2

UncleUncle Person HowHow to to derive derive the Uncle table Title the Uncle table fromfrom the the Person Person table?table?

A derivation strate - gy a human would use, might serve as a suitable strategy of a deduction „engine “.

(Is this some kind of „artificial intelligence “ ?)

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 20 Relational Views as Specifications of a Derivation Strategy 2

Derived tables in a relational DB are Title obtainable as the result of a query . Each query thus expresses a deduc - tion „strategy “ of a table.

Named queries can be stored in a relational DB, introducing a named derived table. In SQL, these queries are called views .

HowHow does does the the SQL SQL viewviewlook look like like thatthat can can „ „do“do “ this this Derived table kindkind of of deduction? deduction?

Stored table

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 21 „Uncle “ in SQL – 1st Attempt 2

CREATECREATE VIEW VIEW Uncle UncleAS AS (( (SELECT(SELECT P1.Name, P1.Name, P3.NameP3.Name AS AS Uncle Uncle self FROMFROMPerson Person AS AS P1, P1, self father paternal uncle PersonPerson AS AS P2, P2, uncle PersonPerson AS AS P3 P3 WHEREWHEREP1. P1.FatherFather == P2.Name P2.Name ANDAND P2.Mother P2.Mother = = P3.Mother P3.Mother ANDAND P2.Father P2.Father = = P3.Father P3.Father ANDAND P2.Name P2.Name <> <> P3.Name P3.Name ANDAND P3.Sex P3.Sex = = ' m'm')') ? UNIONUNION (SELECT(SELECT P1.Name, P1.Name, P3.NameP3.Name AS AS Uncle Uncle self FROMFROMPerson Person AS AS P1, P1, mother PersonPerson AS AS P2, P2, maternal uncle uncle PersonPerson AS AS P3 P3 uncle WHEREWHEREP1. P1.MotherMother == P2.Name P2.Name ANDAND P2.Mother P2.Mother = = P3.Mother P3.Mother ANDAND P2.Father P2.Father = = P3.Father P3.Father ANDAND P2.Name P2.Name <> <> P3.Name P3.Name ANDAND P3.Sex P3.Sex = = ' m'm')') )) ApplyApply this this code code to to the the Person Person table table on on the the previous previous slide! slide!

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 22 „Uncle “ in SQL – 2nd Attempt 2

The direct derivation of Uncle from the base table Person is very complex – why not try to simulate the Wikipedia definition , which uses auxiliary concepts „parent “ and „brother “?

UncleUncle ...... is is a a family family relationship relationship ...... between between a a person person and and his his or or her her parent parent's's's brother brother......

Assume we had views Parent(Name, Parent) and Brother(Name,Brother) available – then Uncle could be specified like this:

CREATECREATE VIEW VIEW Uncle UncleAS AS (SELECT(SELECT P.Name P.Name AS AS Name, Name, B.BrotherB.Brother AS AS Uncle Uncle FROMFROMParent Parent AS AS P, P, BrotherBrotherAS AS B B WHEREWHEREP.Parent P.Parent = = B.Name) B.Name)

Much, much simpler – but we still need specifications of the two other views ...

Parent( 'Peter ', 'Anne ') e.g.: Uncle( 'Peter ', 'Charles ') Brother( 'Anne ', 'Charles ')

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 23 „Parent “ in SQL 2

• Both, father and mother of a person are the parent s of this person.

• In English, the term parent exists as a singular noun, too – unlike, e.g., in German.

• So the SQL specification of Parent needs two cases , one for paternal and maternal parent each:

CREATECREATE VIEW VIEW Parent ParentAS AS (( (SELECT Name, paternal parent (SELECT(SELECT Name, Name, FatherFather ASAS Parent Parent FROMFROMPerson) Person) UNIONUNION (SELECT(SELECT Name, Name, maternal parent MotherMother ASAS Parent Parent FROMFROMPerson) Person) ))

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 24 „Brother “ in SQL 2

• The brother of a person is a male with exactly the same pair of parents.

• If just one of mother or father is the same, we speak of a half -brother .

• The SQL view Brother looks like this:

CREATECREATE VIEW VIEW Brother BrotherAS AS (SELECT(SELECT P1.Name, P1.Name, P2.NameP2.Name AS AS Brother Brother FROMFROMPerson Person AS AS P1, P1, self PersonPerson AS AS P2, P2, brother WHERE P1.Mother = P2.Mother WHERE P1.Mother = P2.Mother same parents ANDAND P1.Father P1.Father = = P2.Father P2.Father ANDAND P1.Name P1.Name <> <> P2.Name P2.Name different persons ANDAND P2.Sex P2.Sex = = ' m'm')') male

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 25 Two -level Specification of „Uncle “ 2

CREATECREATE VIEW VIEW Uncle UncleAS AS (SELECT(SELECT P.Name P.Name AS AS Name, Name, View based on B.BrotherB.Brother AS AS Uncle Uncle two „lower level “ FROM Parent AS P, views . FROM Parent AS P, BrotherBrotherAS AS B B WHEREWHEREP.Parent P.Parent = = B.Name) B.Name)

CREATECREATE VIEW VIEW Parent ParentAS AS CREATECREATE VIEW VIEW Brother BrotherAS AS ((SELECT((SELECT Name, Name, (SELECT(SELECT P1.Name, P1.Name, FatherFather ASAS Parent Parent P2.NameP2.Name AS AS Brother Brother FROMFROMPerson) Person) Views based on FROMFROMPerson Person AS AS P1, P1, two tables . UNIONUNION two tables . PersonPerson AS AS P2, P2, (SELECT(SELECT Name, Name, WHEREWHEREP1.Mother P1.Mother = = P2.Mother P2.Mother MotherMother ASAS Parent Parent ANDAND P1.Father P1.Father = = P2.Father P2.Father FROMFROMPerson)) Person)) ANDAND P1.Name P1.Name <> <> P2.Name P2.Name ANDAND P2.Sex P2.Sex = = ' m'm')')

PersonPerson Table

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 26 First Lessons About Using Views 2

• Using views enables us to extend a given database of stored tables by means of additional derived tables that are automatically generated by the DBMS on demand .

• We call this generation process deduction (as it follows laws of inference invented in ). Thus, databases using views are deductive databases .

• View definitions are pieces of code in SQL – i.e., you „program “ SQL, if specifying views. View definitions are the counterpart to procedures (or methods) in imperative programming.

• If using views properly, each view is a declarative specification of a concept (or: a term) of the respective application domain. Each of these specifica tions has to be constructive , i.e., usable for generating all instances of the concept defined over the given base data.

• It is not (always) easy to correctly and completely specify a concept! You need a precise understanding of the definition of the resp. concept in na tural language before coding SQL.

• Quite often there are various alternatives how to formulate a specification. They may differ in elegance of style, ease of understandibility and – most important – in efficiency of orga - nizing deduction processes.

• A multi -level specification, using intermediate concepts, is often preferable .

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 27 Datalog vs. SQL: Some Basics Ahead 2

• Research in deductive databases has a nearly 40 -years history (as old as SQL), but has been using a different declarative language (not SQL!) most of the time, strongly influenced by the language : Datalog

• Nearly all publications in this area have been using Datalog – that ‘s why we will use Datalog during this lecture, too (and you will have to learn it!).

• Many results of DDB research have been transferred to the SQL world recently! That ‘s why SQL will also be appearing throughout the lecture in vario us places.

SQL: Datalog: • used in industry and commerce • used in academia only • supported by many DBMS products • just few academic protoypes • standardized • no standards • user -friendly („controlled English “) • mathematical style • rich set of syntactic features • minimalistic syntax

© 201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 28 About this Section 2

• In 2.1: Informal introduction to Datalog by means of an extended example .

• Simultaneously: Example -based introduction to . . . • . . . specifying concepts in a declarative DB language • . . . answering queries over deductive rules • . . . deduction using SQL views

• At first: • No rigorous treatment of concepts and ideas – „wetting “ your appetite is the goal. • Just core concepts presented and used in examples, more to come.

• In 2.2: More in -depth systematic treatment of Datalog (in full).

• Aim : Enable students to start „speaking “ the language straightaway.

• So, do make use of this chance: Exercise yourselves – don ‘t wait to be „forced “!

• There is no use reflecting about theory if you do not have a ny first -hand experience in practice – try doing things in SQL, too, using your favorite DBMS.

Intelligent Information Systems 29 Datalog: A Little Bit of Background 2

•"Data log " : " Data base + Pro log ", a notion coined around 1984 in the USA.

• Syntactically: Strong influence by logic Prolog (but: Only simple form of "pure" Prolog adapted) • Semantically: Strongly different! Set -oriented like other languages, e.g., SQL (instead of instance-oriented like Prolog)

• „Lingua franca" in research on deductive databases („de facto“ standard) • But: Up till now not used commercially, no standardisation, no DBMS product!

• Rather uniform syntax and semantics of facts and rules • Various different proposals for queries, updates, constraints, and schemas

• Datalog is based on the domain relational calculus (DRC), while SQL is based on tuple relational calculus (TRC) and (RA). Datalog uses just a minimal set of logical operators: ConjunctionConjunctionand andNegation

• Later in this chapter: Some more background about the three formal relational DB languages just mentioned – RA , DRC , and TRC ! A bit more about Prolog will follow.

© 201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 30 Datalog vs. SQL: Comparison in a Nutshell 2

SQLSQL

Datalog CREATE VIEW s AS Datalog (SELECT a FROM p) UNION ← (SELECT b FROM r); s(X)s(X) ← p(X,Y).p(X,Y). ← s(X)s(X) ← r(Y,X).r(Y,X). CREATE VIEW t AS ← SELECT a, b, t(X,Y,Z)t(X,Y,Z) ← p(X,p(X,YY),), r( r(YY,Z).,Z). FROM p, r ← WHERE p.b = r.a , w(X)w(X) ← s(X),s(X), not not q(X).q(X). CREATE VIEW w AS (TABLE s) MINUS (TABLE q);

Views in SQL (as named, stored queries) have rules in Datalog as analogous counterparts. Three derived relations are defined in both cases: s, t, w. In b oth cases, three tables are used: p, q, r. View w depends on view s, too.

© 201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 31 Basic Datalog Syntax on a Single Slide 2

ConstantsConstants FactsFacts ImplicitImplicit disjunction disjunction (or) (or) p(1,a).p(1,a). p(2,b). Rules p(2,b). ← RulesRules p(3,c). s(X)s(X) ← p(X,Y).p(X,Y). p(3,c). ← s(X)s(X) ← r(Y,X).r(Y,X). q(2).q(2). t(X,Y,Z) ← p(X,Y), r(Y,Z). q(5).q(5). VariablesVariables t(X,Y,Z) ← p(X,Y), r(Y,Z). w(X) ← s(X), not q(X). r(a,1).r(a,1). w(X) ← s(X), not q(X). r(a,2).r(a,2). r(b,3).r(b,3). ConjunctionConjunction (and) (and) Negation (not) RelationRelation Names Names Negation (not) p, q, r: Base relations s, t, w: Derived relations

© 201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 32 No Nulls in Datalog (1) 2

Person First problem (?):

DatalogDatalog cannot accomodate null values , Title cannot accomodate null values, i.e.,i.e., no no empty empty cells cells in in a a table! table!

Reasons for nulls in the „SQL database “:

• Everyone except Diana is still alive.

• No ancestors for the royal couple.

• No ancestor data for spouses of royals.

• Some spouses don ‘t have any title.

Therefore: „Datalog DB “ needs more than two (normalized) tables

© 201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 33 No Nulls in Datalog (2) 2

Person (SQL) Person (Datalog) Three null -free tables rather than one

Title Child (Datalog)

Death (Datalog)

(Title dropped from now on)

© 201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 34 No Nulls in Datalog (3) 2

Marriage (Datalog)

Marriage (SQL)

Divorce (Datalog)

Similar normalization required for Marriage (as most royal couples are not – yet – divorced).

© 201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 35 „Brother“ in SQL Using the „Datalog“ Schema 2

CREATECREATE VIEW VIEW Brother Brother AS AS In order to compare the SQL views with (SELECT(SELECT P1.Name, P1.Name, P2.Name AS Brother the corresponding Datalog rules, we first P2.Name AS Brother have to „translate “ view definitions to the old version FROMFROMPerson Person AS AS P1, P1, have to „translate “ view definitions to the PersonPerson AS AS P2, P2, new 5 -table schema. WHEREWHEREP1.Mother P1.Mother = = P2.Mother P2.Mother ANDAND P1.Father P1.Father = = P2.Father P2.Father ANDAND P1.Name P1.Name <> <> P2.Name P2.Name ANDAND P2.Sex P2.Sex = = ' m'm')') new version CREATECREATE VIEW VIEW Brother Brother AS AS (SELECT(SELECT C1.Child C1.Child AS AS Name Name,, C2.NameC2.Name AS AS Brother Brother FROMFROMChild ChildAS AS C1, C1, Person information in Datalog: ChildChildAS AS C2, C2, person(Name, Birth, Sex) PersonPerson AS AS P2 P2 child(Father, Mother, Child) WHEREWHEREC1.Mother C1.Mother = = C2.Mother C2.Mother death(Name, Year) ANDAND C1.Father C1.Father = = C2.Father C2.Father ANDAND C1. C1.ChildChild <> <> C2. C2.ChildChild ANDAND C1.Child C1.Child = = P2.Name P2.Name ANDAND P2.Sex P2.Sex = = ' m'm')') Person table in SQL: Title

© 201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 36 „Brother“ in Datalog 2 SQL: Datalog : CREATECREATE VIEW VIEW Brother Brother AS AS (SELECT C1.Child AS Name, (SELECT C1.Child AS Name, ← C2.NameC2.Name AS AS Brother Brother brotherbrother(N,B)(N,B) ← FROMFROMChild Child AS AS C1, C1, child( F,M,N), Child AS C2, child(F,M,N), Child AS C2, child(child(F,M,B),F,M,B), PersonPerson AS AS P2 P2 person( B,_, 'm'), WHEREWHEREC1.Mother C1.Mother = = C2.Mother C2.Mother person(B,_, 'm'), ANDAND C1.Father C1.Father = = C2.Father C2.Father NN <> <> B. B. ANDAND C1.Child C1.Child <> <> C2.Child C2.Child ANDAND C1.Child C1.Child = = P2.Name P2.Name ANDAND P2.Sex P2.Sex = = ' m'm')')

brother (N,B) ← brother (N,B) ← child( F,M,N), child( F,M,N), child( F,M,B), child( F,M,B), person( B,_, 'm'), person( B,_, 'm'), N <> B. N <> B.

In SQL: Variables for rows of tables. In Datalog: Variables for attribute values of rows.

© 201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 37 First Observations About Datalog 2

• The basic building blocks of Datalog rules are called literals . They consist of a relation name and a list of parameters (either variables or constant s), e.g., child(F,M,N).

• Variables in Datalog stand for individual attribute values , i.e., elements of data types occupying a single cell in the table/view under considerati on.

• Datalog doesn ‘t make use of any attributes (column names), though. Columns are identified by their position within the parameter list. The order of columns matters and has to be fixed.

• Datalog rules are expressions of the form Head ← Body. where Head is a literal represen - ting name and column structure of the newly defined view ( derived relation). The rule body is a conjunction of literals accessing tables or other vie ws on which the new view depends.

• The same variable occurring in different places in a rule ← • The same variable occurring in different places in a rule brotherbrother(N,B)(N,B) ← always represents the same value (variable binding), child(child(F,M,N),F,M,N), whereas different variables may (but don ‘t have to) whereas different variables may (but don ‘t have to) child(child(F,M,B),F,M,B), represent different values. person(person(B,_,B,_, 'm'm'),'), NN <> <> B. B. • Underscores represent „unnamed “ variables „filling up “ irrelevant positions in a parameter list.

© 201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 38 „Parent“ and „Uncle“ in Datalog 2

New feature : underscore _ for „don ‘t care “ positions in parameter lists of literals CREATECREATE VIEW VIEW Parent ParentAS AS in parameter lists of literals ((SELECT((SELECT Child Child AS AS Name, Name, FatherFather ASAS Parent Parent ← FROMFROMChild) Child) parentparent(N,P)(N,P) ← child(child(P,_,N).P,_, N). UNION ← UNION parentparent(N,P)(N,P) ← child(_,child(_,P,N).P,N). (SELECT(SELECT Child Child AS AS Name, Name, MotherMother ASAS Parent Parent FROMFROMChild)) Child))

CREATECREATE VIEW VIEW Uncle UncleAS AS (SELECT(SELECT P.Name P.Name AS AS Name, Name, uncle (N, U) ←← B.Brother AS Uncle uncle(N, U) B.BrotherB.Brother AS AS Uncle Uncle parent(N, P), FROM Parent AS P, parent(N,P), FROM Parent AS P, brother( P,U). BrotherBrotherAS AS B B brother(P,U). WHEREWHEREP.Name P.Name = = B.Name) B.Name)

© 201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 39 Uncle Definition in Datalog: Full Dependency Graph 2

← uncleuncle(N,(N, UU) ) ← parent(N,parent(N,P),P), brother(brother(P,P,UU).).

← brotherbrother(N,B)(N,B) ← ← child( F,M,N), parentparent(N,P)(N,P) ← child(child(P,_,N).P,_, N). child(F,M,N), ← child( F,M,B), parentparent(N,P)(N,P) ← child(_,child(_,P,N).P,N). child(F,M,B), person(person(B,_,B,_, 'm'm'),'), NN <> <> B. B.

child/3 person/3

© 201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 40 SQL Version in Comparison 2

CREATECREATE VIEW VIEW Uncle UncleAS AS (SELECT(SELECT P.Name P.Name AS AS Name, Name, B.BrotherB.Brother AS AS Uncle Uncle FROMFROMParent Parent AS AS P, P, BrotherBrotherAS AS B B WHEREWHEREP.Name P.Name = = B.Name) B.Name)

CREATE VIEW Parent AS CREATECREATE VIEW VIEW Brother BrotherAS AS CREATE VIEW Parent AS (SELECT C1.Child AS Name, ((SELECT((SELECT Child Child AS AS Name, Name, (SELECT C1.Child AS Name, Father AS Parent C2.NameC2.Name AS AS Brother Brother Father AS Parent FROM Child AS C1, FROMFROMChild) Child) FROM Child AS C1, UNION ChildChild AS AS C2, C2, UNION Person AS P2 (SELECT(SELECT Child Child AS AS Name, Name, Person AS P2 Mother AS Parent WHEREWHEREC1.Mother C1.Mother = = C2.Mother C2.Mother Mother AS Parent AND C1.Father = C2.Father FROMFROMChild)) Child)) AND C1.Father = C2.Father ANDAND C1.Child C1.Child <> <> C2.Child C2.Child ANDAND C1.Child C1.Child = = P2.Name P2.Name ANDAND P2.Sex P2.Sex = = ' m'm')')

Child Person

© 201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 41 Using Negation in Datalog and SQL 2

• In our initial example comparing Datalog and SQL (when declaring the same derived relations/tables) we used one rule/view containing logical negation :

...... CREATE VIEW w AS ← (TABLE s) w(X)w(X) ← s(X),s(X), not not q(X).q(X). MINUS (TABLE q);

• In Datalog , the Boolean operator not appears in the rule body next to the operator and written in Datalog -style as a comma symbol. In the SQL view declaration the set -theoretic counterpart MINUS is used instead.

• Negation (regardless in which syntatic variant) potentially causes trouble in a deductive (relational) database – to be explained later. Therefore we will have to treat negation with particular care !

• SQL knows logical NOT, too. How to express this example rule in SQL using NOT?

© 201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 42 Negative and Positive Facts in Databases 2

In (relational) databases: • All stored facts are (supposed to be) true . + • All true facts are (supposed to be) stored (or derivable).

Complement(DB)

DB + −−

−− • All non -stored or non -derivable facts are (supposed to be) false . • No false facts are (supposed to be) stored (or derivable).

© 201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 43 Datalog: Negation by Example 2

• We need a bit of preparation before being able to use negation in our genealogy context.

• In the exercises, we will try to formalize the concept of bei ng ( currently ) married – a rather difficult affair. Now let us define who was ever married, and who has at least one child. For making things easier, we just look at the male case (hus band, 1 st parameter of marriage as well as child ): ← ← ever_marriedever_married(P)(P) ← has_childhas_child(P)(P) ← marriage(marriage(P,_,_).P,_, _). child(child(P,_,_).P,_, _).

• Now for the rule requiring negation : An uncle who has no children and was never married is interesting (because we might inherit his fortune, when he dies!).

← interesting_uncleinteresting_uncle(N,(N, UU) ) ← uncle(N,uncle(N,UU),), notnot ever_married(ever_married(U),U), notnot has_child(has_child(UU).).

© 201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 44 Corresponding Data over Royals DB 2

← interesting_uncleinteresting_uncle(N,(N, UU) ) ← uncle(N,uncle(N,UU),), notnot ever_married(ever_married(U),U), Uncle notnot has_child(has_child(UU).).

Has_child

Ever_married How to compute interesting uncles if you only have positive data at hand ?

© 201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 45 Discovering Negative Data (1) 2

← interesting_uncleinteresting_uncle(N,(N, UU) ) ← Although we do not have data uncle(N,uncle(N,UU),), about persons who never married, notnot ever_married(ever_married(U),U), Uncle or are child less . . . notnot has_child(has_child(UU).).

Has_child

Ever_married

. . . we can use the positive data about persons who did marry or do have a child for eliminating uncles who definitely are not interesting! No need to „access “ the complement of any table explicitly.

© 201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 46 Discovering Negative Data (2) 2

← interesting_uncleinteresting_uncle(N,(N, UU) ) ← uncle(N,uncle(N,UU),), notnot ever_married(ever_married(U),U), Uncle notnot has_child(has_child(UU).).

Has_child

At the end, only Henry is interesting (for George)! Ever_married

Interesting_uncle

© 201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 47 Translation to SQL 2

← interesting_uncleinteresting_uncle(N,(N, UU) ) ← uncle(N,uncle(N,UU),), notnot ever_married(ever_married(U),U), notnot has_child(has_child(UU).).

In SQL, the corresponding query defining the view requires two embedded subqueries correlated with the main query by means of NOT IN (or, similarly, NOT EXISTS ):

A formulation CREATECREATE VIEW VIEW Interesting_uncle Interesting_uncle AS AS using MINUS (SELECT(SELECT * * instead (and no FROMFROM Uncle Uncle WHERE part) WHERE Uncle NOT IN works, too, as WHERE Uncle NOT IN previously. (SELECT(SELECT * * FROM FROM Ever_married) Ever_married) ANDAND Uncle Uncle NOT NOT IN IN (SELECT(SELECT * * FROM FROM Has_child)) Has_child))

The evaluation strategy is the same in SQL: Reduce Uncle by eliminating those rows for which the NOT IN condition does not hold!

© 201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 48 Open Question Remaining about Negation 2

← ever_marriedever_married(P)(P) ← ← marriage(marriage(P,_,_).P,_, _). interesting_uncleinteresting_uncle(N,(N, UU) ) ← uncle(N,uncle(N,UU),), notnot ever_married(ever_married(U),U), has_child (P) ← has_child(P) ← notnot has_child(has_child(UU).). child(child(P,_,_).P,_, _).

WhyWhy not not simply simply use use one one rule rule rather rather than than three? three?

← interesting_uncleinteresting_uncle(N,(N, UU) ) ← There is a good – even though uncle(N,uncle(N,UU),), debatable – reason for using notnot marriage(marriage(UU,_,,_,_),_), the three rules rather than one. notnot child(child(UU,_,,_,_)._). This will be discussed in the Don ‘t use this in Datalog! next section, please wait !

© 201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 49 Traversing Family Trees: Fixed Number of Steps (1) 2

great -grandparent

grandparent

parent Don ‘t be surprised, this is the Royals family tree image from last year

© 201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 50 Traversing Family Trees: Fixed Number of Steps (2) 2

parent (N,P) ←← child(P,_,N). parent(N,P) child(P,_,N). 1 step down the tree parent (N,P) ←← child(_,P,N). 1 step down the tree parent(N,P) child(_,P,N). (in direction of ancestors, i.e. towards the root of the family tree)

grandparent (N,GP) ← grandparent(N,GP) ← 2 steps down the tree parent(N,P),parent(N,P), parent parent(P,GP)(P,GP) . .

These arrows are not referring to the family tree, but represent dependeny of relations on each other!) ← greatgreat-grandparent(N,GP)-grandparent (N,GP) ← 3 steps down the tree parent(N,P),parent(N,P), grandparent grandparent(P,GP)(P,GP) . .

WhatWhat to to do do if if we we don don‘t‘t even even know know howhow deep deepwe we can can go go in in the the tree? tree?

© 201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 51 Recursive Views: Generalizing Tree Traversal 2

Traversal of a family tree down to arbitrary depth can be expressed very elegantly using the most powerful syntactic feature of declarative query languages:

← Datalog: ancestor(X,Y)ancestor(X,Y) ← parent(X,Y).parent(X,Y). ← ancestor(X,Y)ancestor(X,Y) ← parent(X,Z),parent(X,Z), ancestor ancestor(Z,Y).(Z,Y).

CREATECREATE RECURSIVE RECURSIVEVIEW VIEW Ancestor Ancestor AS AS (((SELECT (SELECT Name, Name, ParentParentAS AS Ancestor Ancestor SQL: FROM Parent) Recursive views FROM Parent) are allowed in UNION UNIONUNION SQL since the (SELECT(SELECT P.Name, P.Name, standard of 1999 ! A.AncestorA.Ancestor FROMFROM Parent Parent AS AS P, P, AncestorAncestorAS AS A A WHEREWHERE A.Name A.Name = = P.Parent)) P.Parent))

© 201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 52 Ancestors and Descendants 2

All the children of a person X are ancestors of X, as well as all ancestors of the children of X (and so on): Ancestor is a recursively defined concept!

ancestor(X,Y) ← parent(X,Y). ancestor(X,Y) ← parent(X,Z), ancestor (Z,Y).

Descendant is the inverse concept of ancestor: descendant(X,Y) ← ancestor(Y,X).

The generation (or level) of an ancestor can be expressed by means of an addit ional parameter, which is recursively incremented :

ancestor(X,Y, 1) ← parent(X,Y). ancestor(X,Y, J) ← parent(X,Z), ancestor(Z,Y, I), J = I+1 .

descendant(X,Y, I) ← ancestor(Y,X, I).

© 201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 53 Computing Royal Ancestors Recursively (1) 2

← ancestor(X,Y,1)ancestor(X,Y,1) ← parent(X,Y).parent(X,Y). ← ancestor(X,Y,J)ancestor(X,Y,J) ← parent(X,Z),parent(X,Z), ancestor ancestor(Z,Y,I),(Z,Y,I), J J = = I+1 I+1 .. ParentParent

Computation of the ancestor table is done iteratively .

1 At the beginning , there are no ancestor facts yet, so that the recursive rule cannot „produce “ anything.

Just the non -recursive rule is able to provide an initial bunch of ancestor facts „copied “ from parent.

© 201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 54 Computing Royal Ancestors Recursively (2) 2

← ancestor(X,Y,1)ancestor(X,Y,1) ← parent(X,Y).parent(X,Y). ← ancestor(X,Y,J)ancestor(X,Y,J) ← parent(X,Z),parent(X,Z), ancestor ancestor(Z,Y,I),(Z,Y,I), J J = = I+1 I+1 ..

1 ParentParent AncestorAncestor 1

2 First application of the recursive rule produces further ancestor facts.

© 201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 55 Computing Royal Ancestors Recursively (3) 2

← ancestor(X,Y,1)ancestor(X,Y,1) ← parent(X,Y).parent(X,Y). ← ancestor(X,Y,J)ancestor(X,Y,J) ← parent(X,Z),parent(X,Z), ancestor ancestor(Z,Y,I),(Z,Y,I), J J = = I+1 I+1 ..

ParentParent

2 AncestorAncestor 2

3 Similarly in iteration 3: Combining generation 2 facts with parent facts.

Nothing new in iteration 4: stop !.

© 201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 56 Computing Royal Ancestors Recursively (4) 2

← ancestor(X,Y)ancestor(X,Y) ← parent(X,Y).parent(X,Y). ← ancestor(X,Y)ancestor(X,Y) ← parent(X,Z),parent(X,Z), ancestor ancestor(Z,Y).(Z,Y). 1 AncestorAncestor 1

AncestorAncestor 2 AncestorAncestor 2

3 AncestorAncestor 3

30 facts + 22 facts + 6 facts = 58 facts

© 201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 57 Degrees of Kinship 2

(Male form only, due to space limitations)

Great-Grandfather 4 3 Granduncle Grandfather 3 2 Father-in-law Father Uncle 2 4 1

Cousin Brother Person of Spouse Brother-in-law Reference 3 1 Nephew Son 4 2 Grandnephew Grandson 3

TheThe degree degreeof of kinship kinship (being (being relative relative or or relative relative-in-in -law)-law) Great-Grandson isis determined determined by by the the number number of of intermittend intermittend births. births.

© 201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 58 Rules of Kinship 2

The concept "relative of" can be defined via several Datalog rules, too, using the concepts ancestor and descendant introduced before (to be discussed in the exercises ): ← relative(X,relative(X, Y, Y, Degree) Degree) ← ancestor(X,ancestor(X, Y, Y, Degree). Degree). in the main ← line relative(X,relative(X, Y, Y, Degree) Degree) ← descendant(X,descendant(X, Y, Y, Degree). Degree). ← relative(X,relative(X, Y, Y, Degree) Degree) ← ancestor(Z,ancestor(Z, X, X, Degree1), Degree1), ancestor(Z,ancestor(Z, Y, Y, Degree2), Degree2), in the sideline notnot ancestor(X,ancestor(X, Y), Y), notnot ancestor(Y,ancestor(Y, X), X), notnot has_younger_common_anc(X,Y,Z),has_younger_common_anc(X,Y,Z), XX \= \= Y, Y, = + DegreeDegree =Degree1Degree1 +Degree2.Degree2. has_younger_common_anc(X,Y,Z)has_younger_common_anc(X,Y,Z) ←← ancestor(Z1,ancestor(Z1, X,X, _),_), ancestor(Z1,ancestor(Z1, X,X, _),_), ancestor(Z,ancestor(Z, Z1,Z1, _)._).

© 201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 59 Now for Some Theory! 2

2.12.1 Datalog Datalog and and SQL: SQL: Learning Learning by by Doing Doing 2.22.2 Datalog: Datalog: Syntax Syntax and and (Basic) (Basic) Semantics Semantics 2.32.3 From From SQL SQL to to Datalog Datalog (and (and Back) Back)

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 60 About this Section 2

• After an intuitive, example -based introduction to the most important principles of Datalog in 2.1, 2.2 will look at the features and convention s of this language from a more abstract, systematic point of view .

• The introduction will not be a formal one, however, even though it is no problem to come along with a formal grammar for the syntax of each construct.

• Semantics is more problematic and will be dealt with in chapter 3 (formal ly, at least in part).

• Many of the introductory slides will just repeat „officially “ what was already mentioned before – but there will be key features (such as safety , CWA , or negation -as -failure ) which will be discussed more in -depth as they are a bit intricate.

• Altogether, this is a section for your own reading more than for lengthy oral presentation within the lecture.

• Take this serious nevertheless – the details of the language will be relevant for the rest of the semester (including the exam).

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 61 Facts in Datalog 2

Facts are represented by atomic formulas, the parameters city('Berlin',city('Berlin', 030, 030, 'B', 'B', 3399). 3399). of which are all constants . Parameter list

city Name Phone Car Inhabitants Relation name Berlin 030 B 3399 Hamburg 040 HH 1700 München 089 M 1189 Köln 0221 K 963 Frankfurt 069 F 644 Essen 0201 E 603

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 62 Constants, Variables and Relation Names 2

• Variables : Capital letters or strings beginning with a capital:

e.g.e.g.: : X X X_1 X_1 City City X1a2b% X1a2b%

• Constants : Digits, lower case letters, or strings beginning with a lower case letter or a digit, . . .

e.g. : x 1 city 1a2b% („Inherited “ from e.g. : x 1 city 1a2b% conventions in most Prolog systems) . . . or arbitrary strings in apostrophes :

e.g.e.g.: : 'City' 'City' 'X' 'X' '?-abc-!' '?-abc-!'

• Relation names : Strings beginning with a lower case letter

e.g.e.g.: : city city p p q_1 q_1

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 63 Tables 2

city(city( 'Berlin', 'Berlin', 030, 030, 'B', 'B', 3399 3399 ). ). city(city( 'Hamburg', 'Hamburg', 040, 040, 'HH', 'HH', 1700 1700 ). ). Table (relation) as a set of facts city(city( 'München', 'München', 089, 089, 'M', 'M', 1189 1189 ). ). in Datalog city(city( 'Köln', 'Köln', 0221, 0221, 'K', 'K', 963 963 ). ). city(city( 'Frankfurt', 'Frankfurt', 069, 069, 'F', 'F', 644 644 ). ). city(city( 'Essen', 'Essen', 0201, 0201, 'E', 'E', 603 603 ). ).

City Name Phone Car Inhabitants

Berlin 030 B 3399 conventional table Hamburg 040 HH 1700 (à la SQL) München 089 M 1189 Köln 0221 K 963 Frankfurt 069 F 644 Essen 0201 E 603

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 64 Deductive Rules in Datalog: Structure 2

Implication symbol as in mathematical logic Rule head Implication symbol as in mathematical logic

population(City,population(City, Inhabitants) Inhabitants) ←← city(City,city(City, Phone, Phone, Car, Car, Inhabitants) Inhabitants)..

Rule body

population(population( 'Berlin', 'Berlin', 3399 3399 ). ). city(city( 'Berlin', 'Berlin', 030, 030, 'B', 'B', 3399 3399 ). ). population(population( 'Hamburg', 'Hamburg', 1700 1700 ). ). city(city( 'Hamburg', 'Hamburg', 040, 040, 'HH', 'HH', 1700 1700 ). ). city( 'München', 089, 'M', 1189 ). population(population( 'M 'Müünchen',nchen', 1189 1189 ). ). city( 'München', 089, 'M', 1189 ). population( 'K öln', 963 ). city(city( 'Köln', 'Köln', 0221, 0221, 'K', 'K', 963 963 ). ). population( 'Köln', 963 ). city( 'Frankfurt', 069, 'F', 644 ). population(population( 'Frankfurt', 'Frankfurt', 644 644 ). ). city( 'Frankfurt', 069, 'F', 644 ). city(city( 'Essen', 'Essen', 0201, 0201, 'E', 'E', 603 603 ). ). population(population( 'Essen', 'Essen', 603 603 ). ). Base relation Derived relation

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 65 Just a Convention, but . . . : The Dot at the End! 2

Another "syntactical tradition" from the Prolog world: Facts and rules are always terminated with a dot !

populationpopulation(( 'Berlin', 'Berlin', 3399 3399 ) )..

populationpopulation(City,(City, Inhabitants) Inhabitants) ← ← city(City,city(City, Phone, Phone, Car, Car, Inhabitants) Inhabitants)..

(As(As DatalogDatalog isis notnot standardizedstandardized –– like, like, e.g.,e.g., SQLSQL –– such such syntacticsyntactic rulesrules areare quitequite ofoftenten neglectedneglected byby authorsauthors followingfollowing aa traditiontradition ofof theirtheir own.own. TakeTake care!)care!)

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 66 Literals 2

• The basic constituents of all Datalog-expressions are positive or negative atomic formulas , from which facts and rules are built.

• Such formulas are called literals .

positive literals negative literals p(X,Y)p(X,Y) not not p(X,Y)p(X,Y) q(a,q(a, Y,Y, 1)1) not not q(a,Y,1)q(a,Y,1) r(a,b)r(a,b) not not r(a,b)r(a,b)

• Rule heads and facts are positive literals; rule bodies may contain both, positive and (possibly) negative literals.

• Literals without any variables are called ground literals . Therefore, all facts are ground literals in Datalog.

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 67 Literals (2) 2

• Literals play a double role in Datalog : • In facts and rule heads: For asserting confirmed or assumed data. • In rule bodies (and later on in queries): For finding confirmed or assumed data.

• A fact in a asserts, that something is the case: child_of( 'William ', 'Diana ', 'Charles '). !! • This is true for derived facts as well, thus rule heads play an asserting role, too: father_of(Y, Z) ← child_of(Z, X, Y). !! • Literals in role bodies, however, serve as questions if something is the case resp. for which variable substitutions something is the case: father_of(Y, Z) ← child_of(Z, X, Y). ??

• In order to derive father_of-facts with this rule, it is necessary to find child_of-facts and to transfer variable bindings found to the head literal, in order to assert new facts .

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 68 Meaning of Literals 2

• Literals on „asserting“ positions – i.e., in facts and rule heads – „mean themselves“.

• Literals in „querying“ positions – i.e., in particular in rule bodies – get their meaning via an evaluation over a certain set of facts only, e.g.:

← father_of(Y,father_of(Y, Z) Z) ← child_of(Z,child_of(Z, X, X, Y). Y). This is like simple ? ! pattern matching!

child_of('Charles',child_of('Charles', 'Elizabeth', 'Elizabeth', 'Philip'). 'Philip'). child_of('Anne',child_of('Anne', 'Elizabeth', 'Elizabeth', 'Philip'). 'Philip'). child_of('Andrew',child_of('Andrew', 'Elizabeth', 'Elizabeth', 'Philip'). 'Philip'). child_of('Edward',child_of('Edward', 'Elizabeth', 'Elizabeth', 'Philip'). 'Philip').

• The body literal represents an (atomic) query : For which (X,Y,Z)-combination is the resulting fact true?

• The expected answer is the set of all variable substitutions that can be obtained by evaluating the literal over the given set of facts.

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 69 Application of Deductive Rules: Principle 2

Transfer of variable bindings 2.2. from body to head

populationpopulation(City,(City, Inhabitants) Inhabitants) ← ← city(City,city(City, Phone, Phone, Car, Car, Inhabitants) Inhabitants)..

Rule head serves as a pattern Evaluation of the rule body 1.1. for derivable facts produces variable bindings 3.3.

populationpopulation(( 'Berlin', 'Berlin', 3399 3399 ). ). city(city( 'Berlin', 'Berlin', 030, 030, 'B', 'B', 3399 3399 ). )......

Derived facts Base facts or previously derived facts

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 70 Safety of Rules 2

• Rules are able to serve as "fact producers" only if the evaluation of the rule body generates variable bindings for all variables in the rule head.

• Thus : All variables in the head have to appear in the body of a rule, too!

• Rules satisfying this requirement are called safe .

All variables in the rule head . . .

← population(population(City,City, Inhabitants Inhabitants)) ← city(city(City,City , Phone Phone,, Car Car,, Inhabitants Inhabitants))..

. . . have to appear in the rule body , too! (But not necessarily vice versa!)

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 71 Unsafe Rules 2

unsafe ! ?? unsafe !

← in_state(in_state(CityCity , ,State State)) ← city(city(CityCity , ,Phone, Phone, Car, Car, Inhabitants) Inhabitants). .

• An unsafe rule (like this one) is not able to produce complete facts for the defined relation ‚in_state': Where to take values for binding variable 'State' from?

• Im principle : For ''State' any constant value could be substituted, so that the resulting relation ‚in_state' ought to be infinitely large !

• Later in this lecture : Unsafe rules will be admitted though under certain contraints!

• But for now : All rules are assumed to be safe – no unsafe rules !

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 72 Conjunctive Rules 2

is_a_capital(is_a_capital( 'München', 'München', 'M' 'M' ). ). ∧ CommaComma as as ∧-operator-operator

← is_a_capitalis_a_capital (City, (City, Car) Car) ← capital_of(State,capital_of(State, City City)) , , city(city(City,City , Phone, Phone, Car, Car, Inhabitants) Inhabitants)..

Concatenation via identical variables

capital_of(capital_of( 'Bavaria', 'Bavaria', 'M'M 'München'ünchen' ). ). city(city('M'M 'München',ünchen' , 089, 089, 'M', 'M', 1189 1189 ). )......

Substitution with identical values

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 73 Relations Defined by Several Rules 2

Derived Relations may be defined by more than one rule :

← european_city(X)european_city(X) ← situated_in(X,situated_in(X, 'Denmark' 'Denmark' ) ). . Implicit ← european_city(X)european_city(X) ← situated_in(X,situated_in(X, 'Germany' 'Germany' ) ). . disjunction ← (logical ∨) european_city(X)european_city(X) ← situated_in(X,situated_in(X, 'Russia' 'Russia' ) )......

Each rule is able to derive a partial relation .

Some derived facts may be simultaneously derivable by several rules. As relations are sets, just one „copy survives“!

The entire relation thus is the union of all these partial relations „Union semantics “ (i.e., subsets of the full relation).

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 74 Order Independence in Datalog 2

• Datalog has been conceived as a purely declarative language , for which any aspect of execution – in particular of efficient evaluation – is irrelevant for the definition of syntax and semantics.

• Thus : The order of notation is irrelevant . . . • . . . if several rules define the same relation. • . . . if several literals occur in a rule body.

• The following rule sets are considered equivalent in Datalog, even though they are syntactically different:

← ← p(X)p(X) ← q(X,Y),q(X,Y), not not s(Y).s(Y). p(X)p(X) ← t(X),t(X), w(X), w(X), r(X). r(X). ← ← p(X)p(X) ← r(X),r(X), t(X), t(X), w(X). w(X). p(X)p(X) ← notnot s(Y),s(Y), q(X,Y). q(X,Y).

• For any concrete evaluation strategy, however, an evaluation order for literals and rules has to be fixed, though, and to be planned well for efficiency‘s sake.

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 75 ∃-Position for Local Variables: Motivation 2

• The convention „Implicit ∃-quantifiers are always assumed to stand in front of all literals!“ seems to be unnecessary on first glance . For this reason, it it worthwhile to discuss alternatives by means of a more complex example:

← p(X)p(X) ← q(X,Z),q(X,Z), r(Z,X, r(Z,X, Y), Y), s(Y,W). s(Y,W).

∃ ∃ Y,Z,WY,Z,W • The only true alternative would be to keep the „scope“ (range of validity) of quantifiers as small as possible and thus to make them individually different:

← p(X)p(X) ← q(X,Z),q(X,Z), r(Z,X, r(Z,X, Y), Y), s(Y,W). s(Y,W). Disadvantage: • Considerably more complex ∃ ∃ • Different quantifier structure ∃ ZZ ∃ WW for each reordering of the rule ∃ body ∃ YY

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 76 Anonymous Variables (1) 2

• Frequent situation in Datalog-rules: Various local variables are not used for connecting literals or for representing output values, but just for „filling“ parameter positions.

is_a_capital(is_a_capital(City,City , Car Car)) ← ← capital_of(capital_of(State,State , City City)) , , city(city(City,City , Phone Phone,, Car Car,, Inhabitants Inhabitants).).

• Adopted from Logic Programming: Abbreviating notation for this kind of "fill-up parameters" by underscore

is_a_capital(is_a_capital(City,City , Car Car)) ← ← capital_ofcapital_of ( (_ _, ,City City)) , , city(city(City,City , _ _, ,Car Car,, _ _). ).

"anonymous"anonymous variables" variables" (aka : "don't care"-variables)

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 77 Anonymous Variables (2) 2

• Although the same symbol (underscore) is used for representing anonymous variables, each occurrence of such a variable stands for a different, completely new variable occurring in this position only (the „true name“ of which we do not know):

is_a_capital(is_a_capital(City,City , Car Car)) ← ← capital_ofcapital_of ( (_ _, ,City City)) , ,city(city(City,City , _ _, ,Car Car,, _ _). ).

is_a_capital(is_a_capital(City,City , Car Car)) ← ← capital_of ( X , City ) , city( City , X , Car , X ). capital_of ( X11, City) , city(City, X22, Car, X33).

∃∃X , X , X X11, X22, X33 • Each of these variables has its own implicit quantifier . All these quantifiers is placed (as usual) at the beginning of the rule body (see above).

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 78 Hierarchical Dependencies 2

Important in particular for constructing terminological hierarchies:

DerivedDerived relations relations may may depend depend on on other other derived derived relations relations (not(not only only on on base base relations), relations), i.e.: i.e.: InIn the the body body of of a a rule, rule, arbitrary arbitrary relations relations may may be be r referencedeferenced byby literals. literals.

e.g .: Corresponding dependency graph : ← p p(X,Y)p(X,Y) ← q(X,Y),q(X,Y), r(Y, r(Y, X). X). ← q(X,Y)q(X,Y) ← s(X,s(X, Y, Y, Z). Z). ← q r r(Y,r(Y, X) X) ← t(X,Y),t(X,Y), not not w(Y).w(Y). not

s t w

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 79 Recursive Dependencies 2

• Moreover : Rule-defined relations may depend on themselves , i.e., rules may be recursive .

• But : Recursive rules ought to come with at least one non -recursive rule defining the same relation („well-founded recursion")

e.g .: RestrictionRestriction to to be be liftedlifted later later on! on! ← reachable_from(A,reachable_from(A, B) B) ← Non -recursive adjacent(A,adjacent(A, B). B). ← reachable_from(A,reachable_from(A, B) B) ← adjacent(A, C), reachable_from Recursive adjacent(A, C), reachable_fromreachable_from(C,(C, B). B).

adjacent

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 80 Instances of Datalog Rules 2

• By stepwise replacement of variables by constants in a rule, various instances of this rule can be obtained:

← p(X,p(X, Y) Y) ← q(X,q(X, Z), Z), r(Z, r(Z, Y). Y). "Ground instance" X/ a X/ b, Y/ 1, Z/ c (All variables have been replaced by constants.) ← p(p(aa, ,Y) Y) ← q(q(aa, ,Z), Z), r(Z, r(Z, Y). Y). ← p(p(bb, ,1 1) ) ← q(q(bb, ,c c),), r( r(cc, ,1 1).).

• In each step, consistent replacements are admissible only, i.e., different occurrences of the same variable are always replaced by the same constant.

Not an p( a, Y) ←← q( b, Z), r(Z, Y). p( a, Y) q( b, Z), r(Z, Y). instance!

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 81 No Nulls 2

• In SQL, „empty fields “ in a table are permitted, in case information about the particular attribute is missing for the particular object represented in the resp. row.

• Theoretically, empty fields are considered to contain a special value – called NULL , or: a null value – representing an „existing, but unknown “ value in the resp. domain.

• Thus, NULL cannot (or better: ought not) be used for representing cases where, e.g., the resp. attribute does not apply , or where we do not know whether a value exists at all for the resp. field.

• Null values introduce quite sophisticated problems for query evaluation to SQL – ultimately, a 3-valued logic (with an extra value „unknown“) has to be introduced in order to properly deal with the implications of using NULL in the particular inter- pretation fixed in the SQL standard.

• Even though nulls are rather helpful in many practical cases, we do not allow NULL in Datalog , due to the problems resulting. Researchers in deductive DBs have been agreeing on this till now. NoNo NULL NULL in in Datalog Datalog ! !

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 82 Standardization of Deductive Databases 2

• A deductive database in Datalog is a set of facts and rules (later on in this lecture, we will reconsider this "definition" a bit).

• By now people are used to assume that each relation in a Datalog-DB is either defined by stored facts only (base relation) or defined by rules only (derived relation): "Standardization Assumption"

• If you want to extend a rule-defined relation with some facts (for expressing special cases), however, an auxiliary relation summarizing the special cases is necessary again:

p (a). p11(a). p(a). p (b). p(a). p11(b). p(b).p(b). ← ← p(X)p(X) ← q(X,Y).q(X,Y). p(X)p(X) ← pp1(X).(X). ← ← 1 p(X)p(X) ← s(X),s(X), t(X). t(X). p(X)p(X) ← q(X,Y).q(X,Y). ← p(X)p(X) ← s(X),s(X), t(X). t(X). Not admissible ! Standardized

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 83 "Closed World Assumption" (CWA) 2

• In databases up till now, one does not store negative information , but positive data only (or data derivable from stored data).

• Nevertheless, many (not all!) queries containing negation can be answered, e.g.:

WhichWhich citiescities areare nono majormajor cities?cities?

• This is possible, because we (tacitly) assume that all facts in DB-relations which are not stored are wrong in reality – and, of course, that all stored facts are true !

• This assumption is called the "Closed World Assumption" (CWA) in DDB-literature: • Each piece of knowledge about "the world" (i.e., the resp. application area) is represented in the DB in form of positive facts (stored or derivable). • There is no doubt about true and false information (2-valued logic). • In an "open world" it would be necessary to distinguish between negative and unknown information (e.g. by storing false facts explicitly in addition to true ones).

Obviously,Obviously, CWA CWA is is hopelessly hopelessly idealistic idealistic – –but but there there is is hardly hardly any any alternative! alternative!

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 84 CWA (2) 2

CWA : 1) All true facts are represented in the DB. 2) All facts in the DB are true . 3) All false facts form the complement (in set-theoretic terms) of the DB and, thus, exist implicitly only .

Complement(DB) Reference set for constructing the complement:

Set of all DB syntactically + constructable facts −−

Constructable from The complement of the DB is never • all relation names in the schema explicitly computed or even stored • all constant in all value domains (for efficiency reasons)!

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 85 Negation in Datalog 2

• Consequence of CWA: Negation in Datalog is admissible in rule bodies only , not in facts and not in the head of a rule (i.e., not for derivable facts).

• There is no directly expressible negative information in a Datalog-DB:

notnot capital_of('Germany',capital_of('Germany', 'Bonn' 'Bonn' ) ) ..

No stored "negative facts" !!

• Derivation of negative information is excluded, too:

← notnot is_a_capital(is_a_capital( X X ) ) ← situated_in(situated_in( X, X, 'Bavaria' 'Bavaria' ) ). .

No derivable "negative facts" !!

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 86 Negation in Rules 2

• Deriving positive information exclusively from negative information is not possible in Datalog, too, because the complement of the DB would have to be made explicit for this purpose (which is unrealistic, see above):

← north_german_city(X)north_german_city(X) ← notnot is_situated_in(X,is_situated_in(X, 'Bavaria') 'Bavaria') . .

Negation is "unproductive" !!

• Negation in Datalog is admissible in combination with positive facts only , i.e., only in connection with logical conjunction: and not

• Negation can be used for "testing" variable bindings only, which have been "produced" in the positive parts of the rule before : ← north_german_city(X)north_german_city(X) ← city(city(XX, ,Y1, Y1, Y2, Y2, Y3) Y3) , , not not is_situated_in(is_situated_in(XX, ,'Bavaria') 'Bavaria') . .

Production Test

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 87 Evaluating Negative Literals 2

normal_city(normal_city( 'Köln', 'Köln', 'K' 'K' ). ). 33

← normal_city(City,normal_city(City, Car) Car) ← city(City,city(City, _ _ , ,Car, Car, _ _ ) ), ,not not is_a_capital(City,is_a_capital(City, Car) Car)..

2a ? yes ! 11 City/'Köln' 2a Car/'K' is_a_capital('Kis_a_capital('Köln',öln', 'K')'K') ?? Positive city( 'K öln', 0221, 'K', 963 ). „side evaluation" city( 'Köln', 0221, 'K', 963 ). 2b ? city(city( 'München', 'München', 089, 089, 'M', 'M', 1189 1189 ). ). 2b no ! ...... is_a_capital('München',is_a_capital('München', 'M' 'M' ). ). is_a_capital('Düsseldorf',is_a_capital('Düsseldorf', 'D'). 'D')......

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 88 "Negation as Failure" 2

• The corresponding evaluation principle for negative literals is called "Negation as failure" (If we fail to find the positive literal to be tested in the DB, we assume it to be false.) • Prerequisite : • Before evaluating any negative literals, all variables contained in this literal have to be bound by evaluating "suitable" positive literals. • Only negative ground literals are evaluable via "negation as failure".

• In order to evaluate a literal not F, . . . • . . . try to answer its positive part F. • If F is true, then not F is false. • If F is false, then not F is true: "(Proof of the) negation (of 'F') by failure (to prove 'F')"

• Negative literals with variables are not evaluable by accessing DB-facts: • 'not p(X)': Find X-bindings, so that 'p(X)' is not true (in the DB)! • Inspecting the p-Relation produces only such X-bindings, for which 'p(X)' is true! • Where to find "all the other possible" X-bindings? (Complement remains implicit due to CWA!)

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 89 Safe Negation (1) 2

• Thus, we need an additional safety condition for negated literals:

EachEach variable variable occurring occurring in in a a negative negative literal literal has has t oto occur occur inin at at least least one onepositive positive literal, literal, too. too.

• Dangerous : Erroneous application of unsafe variables may easily happen, if misinterpreting the assuption about implicit existential quantification

← normal_city(City,normal_city(City, Car) Car) ← ∃ State: ∃ State: city(City,city(City, _ _ , ,Car, Car, _ _ ) ), , notnot capital_of(capital_of(State,State , City) City) . .

∃ ∃State:State: Would be intuitively meaningful, but doesn ‘t belong here according to the „quantifier rule “!

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 90 Safe Negation (2) 2

← normal_city(City,normal_city(City, Car) Car) ← ∃ State: ∃ State: city(City,city(City, _ _ , ,Car, Car, _ _ ) ), , notnot capital_of(capital_of(State,State , City) City) . .

∃ ∃State:State:

If the "forbidden" form of existential quantification is to be expressed in a different way, it is necessary to "swap" the existential quantifier into a separate rule:

← normal_city(City,normal_city(City, Car) Car) ← city(City, _ , Car, _ ) , ∃ city(City, _ , Car, _ ) , ∃State:State: notnot is_a_capitalis_a_capital(City)(City) . .

← is_a_capitalis_a_capital(City)(City) ← Auxiliary capital_of(capital_of(State,State , City) City). . relation

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 91 No Nesting in Datalog! 2

• Negation is admitted only if occurring directly in front of individual literals, not for negating entire conjunctive expressions. Thus, rule bodies may not contain nesting of logical operators!

← north_or_west_german_city(north_or_west_german_city( X X ) ) ← city(city(XX, ,_, _, _, _, _) _) , , notnot ((south_ofsouth_of ( X( X, ,'Hannover' 'Hannover' ), ), east_of(X, east_of(X, 'Lübeck') 'Lübeck') ) .).

• If such a rule is to be expressed differently in Datalog, it is necessary to introduce another auxiliary relation definied by a separate rule (without nesting):

← north_or_west_german_city(north_or_west_german_city( X X ) ) ← city(city(XX, ,_, _, _, _, _) _) , ,not not south_east_of(X)south_east_of(X). . ← south_east_of(X)south_east_of(X) ← south_ofsouth_of ( X( X, ,'Hannover' 'Hannover' ), ), east_of(X, east_of(X, 'Lübeck'). 'Lübeck').

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 92 Simulating Universal Quantifiers (1) 2 • We learnt previously, that all local variables in a rule body are implicitly existentially quantified. What about universal quantifiers („forall“ in logic – symbolically: ∀)?

• As a motivating example , consider the following natural language sentence (to be turned into a Datalog rule): AA student student is is successful successful if if (s)he (s)he has has passed passed exams exams o of fall allmandatory mandatory modules. modules. • For the corresponding Datalog formalization, consider relations students(MatrNr), exams(MatrNr,ModNr,Result), and modules(ModNr,Status). Further assume that exam results are either pass or fail , and that module status is either m(andatory) or o(ptional).

• There is no forall in Datalog , not even implicity (as for SQL, also lacking forall)!

• Instead, a law of equivalence from predicate logic has to be exploited for „simulating“ forall by means of not and exists : ∀ ≡ ¬ ∃ ¬ ∀x:x: F(x) F(x) ≡ ¬ ∃x:x: ¬F(x)F(x)

• In natural language, applying the same reasoning principle leads to the reformulation of the example as follows:

AA student student is is successful successful if if there there is is no nomandatory mandatory module module (s)he (s)he did did not notpass. pass.

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 93 Simulating Universal Quantifiers (2) 2

AA student student is is successful successful if if there there is is no nomandatory mandatory module module (s)he (s)he did did not notpass. pass.

• If Datalog would permit explicit quantifiers (and nesting), the following rule could be written, formalizing the above sentence:

successful(S) ← students(S) and not (∃ M: modules(M,'m') and not exams(S,M, ''pass')).

• Beware ! This is not Datalog, but hypothetical „extended Datalog“, we just use didactically!

• Applying techniques introduced before (unnesting, implicit existential quantification), we obtain the following (equivalent) version – using an auxiliary rule – which is proper Datalog :

successful(S) ← students(S), not has_ failed_module(S). has_failed_module(S) ← students(S), modules(M,'m'), not exams(S,M,'pass')).

∃ M

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 94 Built in -Predicates 2

• Doesn‘t belong to the „core“ of Datalog, by unavoidable in practice: << ComparisonComparison operators operators =<=< = • In logic : Relation names, too = (like names of DB-relations) \=\= >=>= >> • But: These „relations“ are obviously not definable by facts oder rules in extensional form, but have to be realized in external programming languages by means of suitable "test procedures" (i.e., from the perspective of Datalog as "built-ins")

• For better distinction between (DB-)relations and this kind of test relations we use another notion from logic (more or less synonym with „relation“): "Predicate"

• In Datalog, we use test predicates in test literals , e.g.

XX > >YY X X =< =< 11 not not aa = =bb

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 95 Safe Comparisons 2

• Comparison predicates are used exclusively for testing whether two elements of a certain data type denoted by two terms satisfy the resp. test.

• Comparisons are possible only if none of the two parameters of a test literal are still variable when performing the test.

• A test literal thus is subject to similar safety requirements as negative literals:

EachEach variable variable in in a a test test literal literal has has to to occur occur in in at at least least one, one, positivepositive DB DB-literal-literal within within the the same same conjunction conjunction which which containscontains the the resp. resp. test test literal. literal.

• Examples for safe resp. unsafe usage of comparisons:

p(X,Z)p(X,Z) , ,X X > >YY , ,q(Z,Y) q(Z,Y) p(X,Y),p(X,Y), Y Y < < Z Z Safe Unsafe

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 96 Built in -Functions 2

• Unavoidable, too: Arithmetic operations (and possibly other elementary operations on data types)

• Such "built in" -functions are to be realized in an external programming language, too. + • Evaluable (functional) terms in DB-literals are "disturbing" as they + have to be treated different from the "matching"-based evaluation of -- DB-literals over facts and rules: // ⇒ ** ⇒⇒ FunctionalFunctional terms terms are are (for (for now) now)admissible admissible inin test test literals literals only! only!

• Reason : Test literals are to be evaluated externally anyway; moreover, functions rely on "safety" of all parameters, too.

← ← p(X,p(X, X+1) X+1) ← q(X,q(X, X-1) X-1) . . p(X,Y)p(X,Y) ←YY = = X+1, X+1, q(X,Z), q(X,Z), Z Z = = X-1. X-1. Not really tests anymore!

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 97 Aggregate Functions (1) 2

• important class of „built -in “ functions in SQL: AggregateAggregate functions functions

COUNTCOUNT Number of SUMSUM Sum AVGAVG Average MAXMAX Maximum MINMIN Minimum

• Aggregate functions compute one scalar value out of a set of scalar values (the „aggregate “) originating from one column of one table:

Table Aggregate Function value

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 98 Aggregates in SQL: Reminder 2

SELECTSELECT P. P. Rank, Rank, AVG AVG(( P.Age P.Age ) ) AS AS AvgAge AvgAge This is how FROMFROM professors professors AS AS P P aggregates WHEREWHERE P.Name P.Name <> <> ‚Ken‘ ‚Ken‘ work in SQL: GROUPGROUP BY BYP. P. Rank Rank

GROUP BY

NameName Rank Rank Age Age NameName Rank Rank Age Age JimJim C4 C4 43 43 Lisa C4 39 AVG JimJim C4 C4 43 43 Lisa C4 39 RankRank AvgAge AvgAge JohnJohn C3 C3 33 33 C2C2 32.0 32.0 John C3 33 KenKen C4 C4 57 57 John C3 33 C3C3 34.5 34.5 Eva C3 36 LisaLisa C4 C4 39 39 Eva C3 36 C4C4 41.0 41.0 TomTom C2 C2 32 32 EvaEva C3 C3 36 36 TomTom C2 C2 32 32

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 99 Aggregate Functions (2) 2

• We need aggregate functions in Datalog as well – they ought to be treated similarly with other functions (at least as far as they exhibit comparable properties), e.g., terms containing aggregate functions should appear in test literals only.

• But aggregates are considerably different from, e.g., arithmetic functions as they require the prior computation of „the aggregate“, i.e., the collection of objects to which they are to be applied.

• Therefore, we need a special construct expressing an aggregate inside the body of the rule, which means a kind of nesting „hidden“ by a special syntax, e.g.:

← avg_salary(Dept,avg_salary(Dept, AvgS) AvgS) ←AvgSAvgS = = avg avg(Salary,(Salary, Dept, Dept, employee(E, employee(E, Dept, Dept, Salary)) Salary)) • There are special ternary aggregate functions: avg, sum, max, min, card. (We prefer card(inality ) rather than count .) • The 1 st parameter of each aggregate term is the variable to be aggregated about. • The 2 nd parameter is the grouping variable. • The 3 rd parameter is a literal defining the grouping condition. • Thus, the example reads: AvgS is the average salary per department computed from the employee relation.

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 100 Summary of Notations and Concepts 2

fact CWA rule negation as failure rule head dependency graph rule body recursion literal standardization ground literal built-ins retrieval literal aggregates test literal safety conjunction, disjunction negation quantifier variable anonymous variable local variable order independence instance ground instance

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 101 A Tale of Two Languages 2

2.12.1 Datalog Datalog and and SQL: SQL: Learning Learning by by Doing Doing 2.22.2 Datalog: Datalog: Syntax Syntax and and (Basic) (Basic) Semantics Semantics 2.32.3 From From SQL SQL to to Datalog Datalog (and (and Back) Back)

© 201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 102 About this Section (1) 2

• Section 2.3 plays an important role in various respects: • It tries to link the („academic“) key language of this lecture, Datalog , to the key language in the „professional“ DB world, SQL , in order to make sure that you know about the relevance of our topics for the „real world“. • It tries to provide you with a solid background on the theoretical basis of the languages covered, a background that is relevant for more recent approaches to IS/DB models and formalisms, too (such as XML and RDF, e.g.).

• At the same time, this section – extending just to the next lecture this week – is not able to provide you with a more detailed introduction to the languages addressed, in particular SQL. Effort of your own is needed, at least for making sure you leave the university with skills you can directly use in practice.

• For our exam expectations , however, we try to treat everything you need explicitly and deep enough.

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 103 About this Section (2) 2

• The order in which topics will be presented this week is not the one which will be chosen finally , due to „context reasons“. In particular, the foundations ought to be discussed first, not last (as we will have to do in the lecture).

• SQL basics would be necessary as second topic in 2.3, as Datalog has been well introduced enough. This topic will also appear in a bit more detail and in a different position within the slide order later.

• Even though the „Datalog to SQL “ transformation is the more important one (wrt „exporting“ results from this lecture to the SQL „world“) we will start the other way round and address “SQL to Datalog ” first, again for “context reasons”.

• We will „call“ this part SQL2Datalog in the headings. Slides are from last year (and thus directly „re-usable“).

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 104 SQL2Datalog: General Remarks Ahead 2

• Every result (to be) presented during this lecture in the context of Datalog can be trans - ferred to SQL – the only exception concerning unstratifiable rules to be discussed in 3.3.

• We will discuss translation of SQL views (vice versa) into Datalog rules first in this sec- tion. The examples chosen can be easily generalized.

• A core sublanguage of SQL is sufficient for mapping into Datalog (and back). Core SQL offers the following operators: • SELECT-FROM-WHERE queries (without nesting in SELECT and FROM) • JOIN and its variants are omitted: They can be expressed using „normal“ FROM and join conditions in the WHERE-part. • UNION is the only operator needed for forming complex queries. OR is not needed, just AND occurs in the WHERE-part. • Nested subqueries (after EXISTS with and without NOT) are sufficient for expressing MINUS and INTERSECT).

• Transformation in the other direction (Datalog into SQL) uses the same techniques in prin- ciple – two slides on this issue will be discussed at the end of this chapter.

© 201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 105 SQL2Datalog (1): Simple 1 -Table Query 2

CREATE VIEW p AS Table t: SELECT A SQL-Style: attributes, no positions SELECT A A, B, C, D, E FROM t FROM t Datalog-Style: positions, no attributes WHEREWHERE B>0 B>0 AND AND C=D C=D AND AND NOT NOT D=E D=E Correspondence: 1. A, 2. B, 3. C, 4. D, 5. E

Direct translation into Datalog rule: X, Y, Z, V, W are variables, p(X) ← t(X, Y, Z, V, W), Y>0, Z=V , not V=W. p(X) ← t(X, Y, Z, V, W), Y>0, Z=V, not V=W. not attributes! SELECT FROM WHERE

More compact rule by expressing (consequences of) equation using the same variable: ← p(X)p(X) ← t(X,t(X, Y, Y, Z Z,, Z Z,, W), W), Y>0, Y>0, not not ZZ=W.=W.

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 106 SQL2Datalog (1a): Some Background on TRC vs. DRC 2

CREATE VIEW p AS SELECTSELECTA A FROMFROM t t TRC: Tuple Relational Calculus WHEREWHERE B>0 B>0 AND AND C=D C=D AND AND NOT NOT D=E D=E

Implicit tuple variable in query:

SELECTSELECT x x.A.A FROMFROM t t AS AS x x WHEREWHERE x x.B>0.B>0 AND AND x x.C=x.D.C= x.D AND AND NOT NOT x x.D=x.E.D= x.E • x bound to t-tuples one after the other • Attributes are functions „extracting“ components from current value of x. • Function application expressed in postfix notation

← p(X)p(X) ← t(X,t(X, Y, Y, Z, Z, V, V, W), W), Y>0, Y>0, Z=V, Z=V, not not V=W.V=W.

• Variables represent components of the same tuple in t DRC: Domain Relational Calculus • „same tuple in t“: expressed by common t-literal

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 107 SQL2Datalog (2): Simple Multi -Table Query 2

CREATE VIEW p AS New tables: r(A,B,C) SELECTSELECTA A s(A,B) FROMFROM r, r, s s WHEREWHERE r.B=s.A r.B=s.A

Variant with double occurrence of s and aliasing:

Same translation idea: CREATE VIEW p AS ← SELECT A p(X)p(X) ← r(X,r(X, Y, Y, Z), Z), s(V,W), s(V,W), Y=V. Y=V. SELECT A FROMFROM s s AS ASs1, s1, s s AS AS s2 s2 SELECT FROM WHERE WHEREWHERE s1.B=s.2A s1.B=s.2A

Compactification by multiple variable occurrence and anonymous variables: ← p(X) ←← s(X, Y), s(Y,_). p(X)p(X) ← r(X,r(X, Y, Y, _), _), s(Y,_). s(Y,_). p(X) s(X, Y), s(Y,_).

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 108 SQL2Datalog (2a): JOIN as „Syntactic Sugar “ 2

CREATE VIEW p AS JOIN queries are dispensable in SQL, they are just syntactic alternatives SELECT A SELECT A for products queries with conditions FROMFROM r rJOIN JOIN s sON ON r.B r.B = = s.A s.A in the WHERE part:

SELECTSELECTA A FROMFROM r, r, s s WHEREWHERE r.B r.B = = s.A s.A

← p(X)p(X) ← r(X,r(X, Y, Y, _), _), s(Y,_). s(Y,_). Datalog doesn‘t offer any such special „luxury“ notations: Same translation!

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 109 SQL2Datalog (3): Complex UNION -Query 2

CREATE VIEW p AS Each subquery in a UNION query SELECTSELECTA A is turned into a rule of its own, defining the same relation: FROMFROM r r WHEREWHERE B=C B=C ← p(X)p(X) ← r(X,r(X, Y, Y, Y). Y). UNION ← p(X)p(X) ← ss(X,(X, Z), Z), Z>0. Z>0. SELECTSELECTA A FROMFROM s s WHEREWHERE B>0 B>0

Variant: Y instead of Z in 2 nd rule (Ys are different in different rules!) ← p(X)p(X) ← r(X,r(X, Y, Y, Y). Y). ← p(X)p(X) ← ss(X,(X, Y), Y), Y>0. Y>0.

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 110 SQL2Datalog (3a): OR in WHERE -part Turned into UNION 2

CREATE VIEW p AS OR in WHERE parts is dispensable, SELECTSELECTA A UNION serves the same purpose! FROMFROM t t WHERE B>0 OR C=D WHERE B>0 OR C=D CREATE VIEW p AS SELECTSELECTA A FROMFROM t t WHEREWHERE C=D C=D

UNION

SELECTSELECTA A FROMFROM t t WHEREWHERE B>0 B>0

← p(X)p(X) ← t(X,t(X, _, _, Z, Z, Z, Z, _). _). ← p(X)p(X) ← t(X,t(X, Y, Y, _, _, _, _, _), _), Y>0. Y>0.

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 111 SQL2Datalog (3b): UNION Cannot Always be Turned into OR in WHERE -part 2

CREATE VIEW p AS Both tables in FROM part generate product of tables! SELECTSELECTA A FROMFROM r r WHEREWHERE B=C B=C SELECTSELECTA A FROMFROM r, r, s s UNION WHEREWHERE B>0 B>0 OR ORB=C B=C SELECTSELECTA A FROMFROM s s WHEREWHERE B>0 B>0 UNION is more general than OR as it applies to inhomogeneous input as well (two different tables in sub-queries):

In such a case: No transformation using OR in a single query is possible! ← p(X)p(X) ← t(X,t(X, _, _, Z, Z, Z, Z, _). _). ← p(X)p(X) ← t(X,t(X, Y, Y, _, _, _, _, _), _), Y>0. Y>0.

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 112 SQL2Datalog (4): Complex Query With Nesting in WHERE -part 2

CREATE VIEW p AS SELECTSELECTA A FROMFROM t t WHEREWHERE B>0 B>0 AND AND (C=D (C=D OR ORD=E) D=E) 2nd step: Expressing OR by UNION 1st step: Un-nesting using laws of (as before) Boolean algebra til OR is outermost SELECTSELECTA A FROMFROM t t SELECT A SELECT A WHEREWHERE B>0 B>0 AND AND C=D C=D FROMFROM t t WHERE (B>0 AND C=D) OR WHERE (B>0 AND C=D) OR UNION (B>0(B>0 AND AND D=E) D=E) SELECTSELECTA A FROMFROM t t WHEREWHERE B>0 B>0 AND AND D=E D=E

← p(X)p(X) ← t(X,t(X, Y, Y, Z, Z, Z, Z, _). _). Y>0. Y>0. ← p(X)p(X) ← t(X,t(X, Y, Y, _, _, Z, Z, Z), Z), Y>0. Y>0.

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 113 SQL2Datalog (5): MINUS Queries Translated Via NOT EXISTS 2 CREATE VIEW p AS SELECTSELECTA A FROMFROM r r MINUS is dispensable, too, as NOT EXISTS serves the WHEREWHERE B=C B=C same purpose (and shows clearer which subquery plays „generator role“ and which one „filter role“) MINUS

SELECTSELECTA A FROMFROM s s SELECT A WHEREWHERE B>0 B>0 SELECT A FROMFROM r r WHEREWHERE B=C B=C AND AND NOT NOT EXISTS EXISTS ( ( SELECTSELECT* * FROMFROM s s WHEREWHERE B>0 B>0 AND AND s.A=r.A)s.A=r.A) ))

← p(X)p(X) ← r(X,r(X, Y, Y, Y), Y), not not ss‘(X)‘(X) ← ss‘(X)‘(X) ← ss(X,Z),(X,Z), Z>0. Z>0. Auxiliary rule needed for embedded subquery in order to guarantee safe negation.

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 114 SQL2Datalog (6): INTERSECT Queries Translated Via EXISTS 2 CREATE VIEW p AS SELECTSELECTA A FROMFROM r r WHEREWHERE B=C B=C

INTERSECT

SELECTSELECTA A FROMFROM s s SELECT A WHEREWHERE B>0 B>0 SELECT A FROMFROM r r WHEREWHERE B=C B=C AND AND EXISTS EXISTS ( ( SELECTSELECT* * FROMFROM s s WHEREWHERE B>0 B>0 AND AND s.A=r.A)s.A=r.A) ))

p(X) ← r(X, Y, Y), s ‘(X). p(X)p(X) ← rr(X,(X, Y, Y, Y), Y), s s‘(X).‘(X). Auxiliary relation s‘ instead of nesting in Datalog. ← ss‘(X)‘(X) ← ss(X,Z),(X,Z), Z>0. Z>0.

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 115 SQL2Datalog (6a): INTERSECT Queries Translated Without EXISTS 2

CREATE VIEW p AS As opposed to NOT EXISTS: EXISTS queries can SELECTSELECTA A be „folded back“ into a single-level query! FROMFROM r r WHEREWHERE B=C B=C SELECTSELECTr.A r.A FROM r, s INTERSECT FROM r, s WHEREWHERE r.B r.B = = r.C r.C AND AND s.B>0 SELECTSELECTA A s.B>0 AND s.A = r.A FROMFROM s s AND s.A = r.A WHEREWHERE B>0 B>0

← p(X)p(X) ← r(X,r(X, Y, Y, Y), Y), s s(X,Z),(X,Z), Z>0. Z>0.

Corresponding effect in Datalog: No auxiliary rule necessary as not is missing!

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 116 SQL2Datalog (7): Recursion 2

CREATE RECURSIVE VIEW p AS ( (SELECT * FROM s ) p(X,Y) ← s(X,Y). UNION (SELECT s.A, p.B p(X,Y) ← s(X,Z), p(Z,Y). FROM s, p WHERE s.B=p.A ) )

SQL has recursive views , like Datalog Therefore, every recursive view in SQL having recursive rules. can be equivalently translated into a set of recursive Datalog rules (but not always There are certain restrictions remaining vice versa). in SQL, though, most notable: - linear recursion - stratifiable rule sets (see chapter 3)

© 2017201 8 Prof. Dr. Rainer Manthey Intelligent Information Systems 117