<<

Course information

Lectures: Tuesday, DS2 and DS 3 (9:20–12:40) • THEORY Exercise classes: to be decided (communicated in next week’s lectures) • { taught by David Carral Lecture 1: Introduction / Relational data model Oral examination (details based on applicable examination regulations) • Course homepage (dates, slides, excercise sheets): Markus Krotzsch¨ • Knowledge-Based Systems https://iccl.inf.tu-dresden.de/web/Database_Theory_(SS2019)

TU Dresden, 2nd Apr 2019

Markus Krötzsch, 2nd Apr 2019 Database Theory slide 2 of 27

Aims of the course Literature, prerequisites, related courses

Serge Abiteboul, Richard Hull, Victor Vianu: • Obtain an understanding of key topics in database theory with a special focus on query Foundations of . Addison-Wesley. 1994. formalisms: – Available at http://webdam.inria.fr/Alice/ Relational data model – Slight deviations in the lecture • Basic and advanced query languages – Further literature will be given for advanced topics • Expressive power of query languages Prerequisites: basics of first-order logic, Turing machines, worst-case complexity • • Complexity of query answering + some algorithmic approaches Related courses at TUD: • • – Advanced Logic Modelling with constraints • – Foundations of Semantic Web Technologies – Introduction to Logic Programming Connect databases with other advanced topics in logic/KR/formal methods – Introduction to Constraint Programming – Datenbanken (Grundlagen) – Theoretische Informatik & Logik

Markus Krötzsch, 2nd Apr 2019 Database Theory slide 3 of 27 Markus Krötzsch, 2nd Apr 2019 Database Theory slide 4 of 27 What is a database? What is a database? (2) Basic functionality of DBMS: Schema definition: specify how data should be logically organised A Database Management System (DBMS) is a software to manage collections of data. • Update: insert/delete/update stored data • { highly important class of software systems Query: retrieve stored data or information derived from it • major role in industry and in research { Administration: user rights management, configuration, recovery, data export, etc. { extremely wide variety of concepts and implementations • Many related concerns: General three-level architecture of DBMS: Persistence: data retained when DBMS is shut down • External Level: Application-specific user views Optimisation: ensure maximal efficiency • • Logical Level: Abstract data model, independent of implementation, conceptual view Scalability: cope with increasing loads by adding resources • • Physical Level: Data structures and algorithms, platform-specific Concurrency: support many update and query operations in parallel • • Distribution: combine data from several locations • In this lecture: focus on logical view for relational data model Interfaces: APIs, query languages, update languages, etc. • ... • In this lecture: schema, query languages, some optimisation Markus Krötzsch, 2nd Apr 2019 Database Theory slide 5 of 27 Markus Krötzsch, 2nd Apr 2019 Database Theory slide 6 of 27

Overview: Topics covered (excerpt)

1. Introduction | Relational data model 2. First-order queries 3. Complexity of query answering 4. Complexity of FO query answering 5. Conjunctive queries The Relational Data Model 6. Tree-like conjunctive queries 7. Query optimisation 8. Optimisation / First-Order Expressiveness 9. First-Order Expressiveness / Introduction to 10. Expressive Power and Complexity of Datalog 11. Optimisation and Evaluation of Datalog 12. Database dependencies 13. Query answering under constraints

Markus Krötzsch, 2nd Apr 2019 Database Theory slide 7 of 27 Markus Krötzsch, 2nd Apr 2019 Database Theory slide 8 of 27 Database = collection of tables Towards a formal definition of “table”

Lines: Stops: A table row has one value for each column Line Type SID Stop Accessible { row = function from the attributes of the table schema to specific values 85 bus 17 Hauptbahnhof true 3 tram 42 Helmholtzstr. true Example: The row

F1 ferry 57 Stadtgutstr. true SID Stop Accessible ...... 123 Gustav-Freytag-Str. false ...... 42 Helmholtzstr. true Connect: ...... From To Line Every table has a schema: can be represented by the function: 57 42 85 Lines[Line:string, Type:string] • f : SID 42, Stop "Helmholtzstr.", Accessible true 17 789 3 Stops[SID:int, Stop:string, { 7→ 7→ 7→ } • ...... Accessible:bool] Connect[From:int, To:int, Line:string] • Markus Krötzsch, 2nd Apr 2019 Database Theory slide 9 of 27 Markus Krötzsch, 2nd Apr 2019 Database Theory slide 10 of 27

Database = set of tables Database = set of relations

Observation: Attribute names don’t matter. Instead of the function Let dom (“domain”) be the (infinite) set of conceivable values in tables. SID 42, Stop "Helmholtzstr.", Accessible true For simplicity, we drop the datatypes of database columns and assume that each { 7→ 7→ 7→ } column uses the same datatype that supports all values in dom. we could also use a tuple:

Definition 1.1: 42, "Helmholtzstr.", true h i A relation schema R[U] consists of a relation name R and a finite set U of • attributes ( U is the arity of R[U]) Necessary assumption: Attributes have a fixed order. | | A table for R[U] is a finite set of functions from U to dom • Definition 1.2: A database instance is a finite set of tables A relation schema R[U] is defined as before • I • U A table for R[U] is a finite subset of dom| | Note: we disregard the order and multiplicity of rows. • A database instance is a finite set of tables • I Tables are also called relation instances. The table with relation schema R[U] in the U database instance is written RI. Recall that a subset of dom| | is just a U -ary relation. Sets of relations are also called I | | relational structures.

Markus Krötzsch, 2nd Apr 2019 Database Theory slide 11 of 27 Markus Krötzsch, 2nd Apr 2019 Database Theory slide 12 of 27 Database = interpretation of first-order logic Database = set of facts

Recall: Another convenient way to write databases: First-order logic is based on predicate symbols with a fixed arity (we won’t need • Lines(85, "bus") function symbols here) Lines(F1, "ferry") An interpretation of first-order logic is a pair ∆I, I : , , • I h · i Stops(42 "Helmholtzstr." true) – ∆I is a set (the domain of interpretation) ... – maps n-ary predicates p to n-ary relations p (∆ )n ·I I ⊆ I This is (almost) a database instance! Definition 1.4: A fact is an expression p(t1, ... , tn) where p is an n-ary predicate symbol Definition 1.3: • t1,... tn are constant symbols domain of interpretation ∆I = database domain dom • • A database instance is a finite set of facts. predicate symbol = relation name • interpretation of predicate symbol (if finite!) = table • When interpreting these facts logically, their least model is again the database instance finite first-order logic interpretation = database instance (viewed as a first-order logic interpretation). •

Markus Krötzsch, 2nd Apr 2019 Database Theory slide 13 of 27 Markus Krötzsch, 2nd Apr 2019 Database Theory slide 14 of 27

Visualising relations Database = hypergraph

Binary relations (sets of pairs) can be viewed as directed graphs. Example: What to do with tables of arity , 2? { generalise graphs to hypergraphs Source Target Definition 1.5: A hypergraph is a 1 2 triple V, E, ρ , where 1 3 h i V is a set of vertices • 2 5 E is a set of edge names • 3 2 ρ maps each edge name e E • ∈ 3 4 to an n-ary relation ρ(e) Vn ⊆ 4 3 In other words: finite hypergraphs are databases. 5 3 Many binary tables in one graph? Use table name to label edges!

Markus Krötzsch, 2nd Apr 2019 Database Theory slide 15 of 27 Markus Krötzsch, 2nd Apr 2019 Database Theory slide 16 of 27 Summary: the

Relational databases are everywhere: sets of tables with named attributes (“named perspective”) • sets of relations (“unnamed perspective”) • first-order logic interpretations • The sets of logical facts (ground atoms) • hypergraphs (and graphs as a special case) • . . . all restricted to finite sets

Important elements of the theory of relational databases are very widely applicable, also to many datamodels that are not the classical relational one (e.g., graph databases, RDF databases, XML databases).

Markus Krötzsch, 2nd Apr 2019 Database Theory slide 17 of 27 Markus Krötzsch, 2nd Apr 2019 Database Theory slide 18 of 27

Relational Algebra Queries Selection

“Find all bus lines” based on a set of operations on databases. σType="bus"Lines Each operation refers to one or more tables and produces another table “Find all connections that begin and end in the same stop” (we often simplify notation and write a table name rather than a table instance)

σFrom=ToConnect Main operations of the named perspective: Selection σ • Definition 1.6: The selection operator has the form σn=m Projection π n is an attribute name • • Join ./ m is an attribute name or a constant value • • Renaming δ • Consider a table RI for R[U]. Difference For m constant value: σ (R ) = f R f (n) = m • − • n=m I { ∈ I | } Union For m attribute name: σ (R ) = f R f (n) = f (m) • ∪ • n=m I { ∈ I | } Intersection • ∩ This is only defined if U contains the required attribute names.

Markus Krötzsch, 2nd Apr 2019 Database Theory slide 19 of 27 Markus Krötzsch, 2nd Apr 2019 Database Theory slide 20 of 27 Projection Natural join

“Find all connections and their type of line” “Find all possible types of lines” Connect: Lines: Connect ./ Lines: πTypeLines From To Line Line Type From To Line Type “Find all pairs of adjacent stops on line 85” 57 42 85 85 bus 57 42 85 bus πFrom,To(σLine="85"Connect) 17 789 3 3 tram 17 789 3 tram ...... F1 ferry ...... Definition 1.7: The projection operator has the form πa1,...,an where each ai is an attribute name......

Consider a table RI for R[U]. Definition 1.8: The natural join operator has the form ./.

, , , , πa1 ... an (RI) = f a1 ... an f RI Consider tables R for R[U] and S for S[V]. { } | ∈ I I

, ,  , , where f a1 ... an is the restriction of f to the domain a1 ... an , i.e., the function RI ./ SI = f : U V dom f RI and f SI { } { } U V a f (a ), ... , a f (a ) . { ∪ → | ∈ ∈ } { 1 7→ 1 n 7→ n } Of course this projection is only defined if a U for each a . where f (f ) is the restriction of f to elements in U (V) as before i ∈ i U V

Markus Krötzsch, 2nd Apr 2019 Database Theory slide 21 of 27 Markus Krötzsch, 2nd Apr 2019 Database Theory slide 22 of 27

Renaming Difference, Union, Intersection

“Find all lines that depart from an accessible stop” Stops: Connect: SID Stop Accessible From To Line Binary operators on tables of the same relational schema, defined like the usual set 57 Stadtgutstr. true 57 42 85 operations. 123 Gustav-Freytag-Str. false 17 789 3 ...... “Find all stops where line 3 departs, but line 8 does not depart.” We need to join Stops.SID with Connect.From use renaming { “Find all stops where either line 3 or line 8 departs.”

πLine σAccessible="true"(Stops ./ δFrom,To,Line SID,To,Line(Connect)) → “Find all stops where both line 3 and line 8 depart.” Definition 1.9: The renaming operator has the form δa ,...,a b ,...,b with all ai mu- 1 n→ 1 n tually distinct attribute names, and likewise for all bi. Consider a table R for R[ a , ... , a ]. I { 1 n} δa ,...,a b ,...,b (RI) = f g f RI and g : bi ai 1 i n 1 n→ 1 n { ◦ | ∈ { 7→ } ≤ ≤ } where f g is function composition: (f g)(x) = f (g(x)) ◦ ◦ Markus Krötzsch, 2nd Apr 2019 Database Theory slide 23 of 27 Markus Krötzsch, 2nd Apr 2019 Database Theory slide 24 of 27 Table constants in queries Reachability

Generalising the previous example:

“Stops that are Helmholtzstr.” R = From 42 0 {{ 7→ } It is sometimes convenient to define constant tables in queries. “Stops that are next to Helmholtzstr.”

“Find all stops near Helmholtzstr. (SID 42), including Helmholtzstr.” R1 = δTo From(πTo(Connect ./ R0)) →

δTo StopId(πTo(σFrom="42"Connect)) StopId 42 “Stops at distance 2 from Helmholtzstr.” → ∪ { 7→ }

 R2 = δTo From(πTo(Connect ./ R1)) One can generalise this to constant tables with more than one column or more than one → table (no additional expressive power, see exercise).

Stops reachable from Helmholtzstr. with a short-distance ticket:

R R R R R 0 ∪ 1 ∪ 2 ∪ 3 ∪ 4 What about all stops reachable from Helmholtzstr.? { see upcoming lectures . . . Markus Krötzsch, 2nd Apr 2019 Database Theory slide 25 of 27 Markus Krötzsch, 2nd Apr 2019 Database Theory slide 26 of 27

Summary and Outlook

The relational model is very versatile

Relational algebra allows us to define queries with operators

Many operators exist, not all are really needed (see exercise)

Open questions: What does this have to do with logic? (next lecture) • How hard is it to actually answer such queries? (complexity) • How can we study the expressiveness of query languages? •

Markus Krötzsch, 2nd Apr 2019 Database Theory slide 27 of 27