<<

Course information

14 lectures: Thursday, DS4 (13:00–14:30) • 13 excercise classes: Monday, DS3 (11:10–12:40) • THEORY { taught by Francesco Kriegel Oral examination (details based on applicable examination • Lecture 1: Introduction / Relational data model regulations) Course homepage (dates, slides, excercise sheets): Markus Krotzsch¨ • https://ddll.inf.tu-dresden.de/web/Database_ Theory_%28SS2016%29/en

TU Dresden, 4 April 2016

Markus Krötzsch, 4 April 2016 Database Theory slide 2 of 27

Aims of the course Literature, prerequisites, related courses

Serge Abiteboul, Richard Hull, Victor Vianu: Obtain an understanding of key topics in database theory with a • Foundations of . Addison-Wesley. 1994. special focus on query formalisms: – Available at http://webdam.inria.fr/Alice/ Relational data model • – Slight deviations in the lecture Basic and advanced query languages – Further literature will be given for advanced topics • Expressive power of query languages Prerequisites: basics of first-order logic, Turing machines, • • Complexity of query answering + some algorithmic worst-case complexity • approaches Related courses at TUD: • Modelling with constraints – Advanced Logic • – Foundations of Semantic Web Technologies – Introduction to Logic Programming Connect databases with other advanced topics in logic/KR/formal – Introduction to Constraint Programming methods – Datenbanken (Grundlagen) – Intelligent Information Systems

Markus Krötzsch, 4 April 2016 Database Theory slide 3 of 27 Markus Krötzsch, 4 April 2016 Database Theory slide 4 of 27 What is a database? What is a database? (2)

A Database Management System (DBMS) is a software to manage Basic functionality of DBMS: collections of data. Schema definition: specify how data should be logically organised • Update: insert/delete/update stored data { highly important class of software systems • Query: retrieve stored data or information derived from it • { major role in industry and in research Administration: user rights management, configuration, recovery, data export, { extremely wide variety of concepts and implementations • etc.

General three-level architecture of DBMS: Many related concerns: Persistence: data retained when DBMS is shut down External Level: Application-specific user views • • Optimisation: ensure maximal efficiency • Logical Level: Abstract data model, independent of Scalability: cope with increasing loads by adding resources • • implementation, conceptual view Concurrency: support many update and query operations in parallel • Physical Level: Data structures and algorithms, Distribution: combine data from several locations • • Interfaces: APIs, query languages, update languages, etc. platform-specific • ... • In this lecture: focus on logical view for relational data model In this lecture: schema, query languages, some optimisation

Markus Krötzsch, 4 April 2016 Database Theory slide 5 of 27 Markus Krötzsch, 4 April 2016 Database Theory slide 6 of 27

Overview 1. Introduction | Relational data model 2. First-order queries 3. Complexity of query answering 4. Complexity of FO query answering 5. Conjunctive queries 6. Tree-like conjunctive queries The Relational Data Model 7. Query optimisation 8. Optimisation / First-Order Expressiveness 9. First-Order Expressiveness / Introduction to 10. Expressive Power and Complexity of Datalog 11. Optimisation and Evaluation of Datalog 12. Evaluation of Datalog (2) 13. Graph Databases and Path Queries 14. Outlook: database theory in practice

See course homepage [ link] for more information and materials ⇒ Markus Krötzsch, 4 April 2016 Database Theory slide 7 of 27 Markus Krötzsch, 4 April 2016 Database Theory slide 8 of 27 Database = collection of tables Towards a formal definition of “table”

A table row has one value for each column Lines: Stops: { row = function from the attributes of the table schema to Line Type SID Stop Accessible specific values 85 bus 17 Hauptbahnhof true 3 tram 42 Helmholtzstr. true Example: The row F1 ferry 57 Stadtgutstr. true ...... 123 Gustav-Freytag-Str. false SID Stop Accessible ...... Connect: 42 Helmholtzstr. true From To Line Every table has a schema: 57 42 85 ...... Lines[Line:string, Type:string] 17 789 3 • Stops[SID:int, Stop:string, Accessible:bool] can be represented by the function: ...... • Connect[From:int, To:int, Line:string] • f : SID 42, Stop "Helmholtzstr.", Accessible true { 7→ 7→ 7→ }

Markus Krötzsch, 4 April 2016 Database Theory slide 9 of 27 Markus Krötzsch, 4 April 2016 Database Theory slide 10 of 27

Database = set of tables Database = set of relations

Let dom (“domain”) be the (infinite) set of conceivable values in Observation: Attribute names don’t matter. Instead of the function tables. SID 42, Stop "Helmholtzstr.", Accessible true { 7→ 7→ 7→ } For simplicity, we drop the datatypes of database columns and assume that each column uses the same datatype that supports all we could also use a tuple: values in dom. 42, "Helmholtzstr.", true Definition h i A relation schema R[U] consists of a relation name R and a Necessary assumption: Attributes have a fixed order. • finite set U of attributes ( U is the arity of R[U]) | | Definition A table for R[U] is a finite set of functions from U to dom • A relation schema R[U] is defined as before • A database instance is a finite set of tables U • A table for R[U] is a finite subset of dom| | I • A database instance is a finite set of tables Note: we disregard the order and multiplicity of rows. • I U Recall that a subset of dom| | is just a U -ary relation. Sets of Tables are also called relation instances. The table with relation | | schema R[U] in the database instance is written RI. relations are also called relational structures. I Markus Krötzsch, 4 April 2016 Database Theory slide 11 of 27 Markus Krötzsch, 4 April 2016 Database Theory slide 12 of 27 Database = interpretation of first-order logic Database = set of facts

Recall: Another convenient way to write databases: First-order logic is based on predicate symbols with a fixed Lines(85, "bus") • arity (we won’t need function symbols here) Lines(F1, "ferry") Stops(42 "Helmholtzstr." true) An interpretation of first-order logic is a pair ∆I, I : , , • I h · i ... – ∆I is a set (the domain of interpretation) – maps n-ary predicates p to n-ary relations p (∆ )n ·I I ⊆ I This is (almost) a database instance! Definition A fact is an expression p(t , ... , t ) where Definition 1 n p is an n-ary predicate symbol domain of interpretation ∆I = database domain dom • • t1,... tn are constant symbols predicate symbol = relation name • • A database instance is a finite set of facts. interpretation of predicate symbol (if finite!) = table • finite first-order logic interpretation = database instance When interpreting these facts logically, their least model is again • the database instance (viewed as a first-order logic interpretation).

Markus Krötzsch, 4 April 2016 Database Theory slide 13 of 27 Markus Krötzsch, 4 April 2016 Database Theory slide 14 of 27

Visualising relations Database = hypergraph

Binary relations (sets of pairs) can be viewed as directed graphs. Example: What to do with tables of arity , 2? { generalise graphs to hypergraphs Source Target Definition 1 2 A hypergraph is a triple V, E, ρ , where h i 1 3 V is a set of vertices • 2 5 E is a set of edge names • 3 2 ρ maps each edge name e E to • ∈ an n-ary relation ρ(e) Vn 3 4 ⊆ 4 3 In other words: finite hypergraphs are databases. 5 3 Many binary tables in one graph? Use table name to label edges!

Markus Krötzsch, 4 April 2016 Database Theory slide 15 of 27 Markus Krötzsch, 4 April 2016 Database Theory slide 16 of 27 Summary: the

Relational databases are everywhere: sets of tables with named attributes (“named perspective”) • sets of relations (“unnamed perspective”) • first-order logic interpretations • The sets of logical facts (ground atoms) • hypergraphs (and graphs as a special case) • . . . all restricted to finite sets

Important elements of the theory of relational databases are very widely applicable, also to many datamodels that are not the classical relational one (e.g., graph databases, RDF databases, XML databases).

Markus Krötzsch, 4 April 2016 Database Theory slide 17 of 27 Markus Krötzsch, 4 April 2016 Database Theory slide 18 of 27

Relational Algebra Queries Selection “Find all bus lines” based on a set of operations on databases. σType="bus"Lines Each operation refers to one or more tables and produces another “Find all connections that begin and end in the same stop” table σFrom=ToConnect (we often simplify notation and write a table name rather than a table instance)

Main operations of the named perspective: Definition Selection σ • The selection operator has the form σn=m Projection π • n is an attribute name Join ./ • • m is an attribute name or a constant value Renaming δ • • Consider a table RI for R[U]. Difference • − For m constant value: σn=m(RI) = f RI f (n) = m Union • { ∈ | } • ∪ For m attribute name: σn=m(RI) = f RI f (n) = f (m) Intersection • { ∈ | } • ∩ This is only defined if U contains the required attribute names.

Markus Krötzsch, 4 April 2016 Database Theory slide 19 of 27 Markus Krötzsch, 4 April 2016 Database Theory slide 20 of 27 Projection Natural join “Find all possible types of lines” “Find all connections and their type of line”

πTypeLines Connect ./ Lines

“Find all pairs of adjacent stops on line 85” Connect: Lines: Connect ./ Lines: From To Line Line Type From To Line Type πFrom,To(σLine="85"Connect) 57 42 85 85 bus 57 42 85 bus Definition 17 789 3 3 tram 17 789 3 tram ...... F1 ferry ...... π The projection operator has the form a1,...,an where each ai is an ...... attribute name. Consider a table RI for R[U]. Definition The natural join operator has the form ./. πa ,...,a (RI) = f a ,...,a f RI 1 n { 1 n} | ∈ Consider tables RI for R[U] and SI for S[V].  where f a1,...,an is the restriction of f to the domain a1, ... , an , i.e., ./ = { } { } RI SI f : U V dom fU RI and fV SI the function a f (a ), ... , a f (a ) . { ∪ → | ∈ ∈ } { 1 7→ 1 n 7→ n } Of course this projection is only defined if ai U for each ai. where f (f ) is the restriction of f to elements in U (V) as before ∈ U V Markus Krötzsch, 4 April 2016 Database Theory slide 21 of 27 Markus Krötzsch, 4 April 2016 Database Theory slide 22 of 27

Renaming Difference, Union, Intersection

“Find all lines that depart from an accessible stop” Stops: Connect: SID Stop Accessible From To Line 57 Stadtgutstr. true 57 42 85 Binary operators on tables of the same relational schema, defined 123 Gustav-Freytag-Str. false 17 789 3 like the usual set operations...... We need to join Stops.SID with Connect.From { use renaming “Find all stops where line 3 departs, but line 8 does not depart.”

πLine σAccessible="true"(Stops ./ δFrom,To,Line SID,To,Line(Connect)) → “Find all stops where either line 3 or line 8 departs.” Definition  “Find all stops where both line 3 and line 8 depart.” The renaming operator has the form δa ,...,a b ,...,b with all ai 1 n→ 1 n mutually distinct attribute names, and likewise for all bi. Consider a table RI for R[ a , ... , a ]. { 1 n} δa ,...,a b ,...,b (RI) = f g f RI and g : bi ai 1 i n 1 n→ 1 n { ◦ | ∈ { 7→ } ≤ ≤ } where f g is function composition: (f g)(x) = f (g(x)) ◦ ◦ Markus Krötzsch, 4 April 2016 Database Theory slide 23 of 27 Markus Krötzsch, 4 April 2016 Database Theory slide 24 of 27 Table constants in queries Reachability

Generalising the previous example:

“Stops that are Helmholtzstr.”

It is sometimes convenient to define constant tables in queries. R = From 42 0 {{ 7→ } “Find all stops near Helmholtzstr. (SID 42), including Helmholtzstr.” “Stops that are next to Helmholtzstr.”

R1 = δTo From(πTo(Connect ./ R0)) → δTo StopId(πTo(σFrom="42"Connect)) StopId 42 → ∪ { 7→ } “Stops at distance 2 from Helmholtzstr.” One can generalise this to constant tables with more than one R2 = δTo From(πTo(Connect ./ R1)) column or more than one table (no additional expressive power, → see exercise). Stops reachable from Helmholtzstr. with a short-distance ticket: R R R R R 0 ∪ 1 ∪ 2 ∪ 3 ∪ 4 What about all stops reachable from Helmholtzstr.? { see upcoming lectures . . . Markus Krötzsch, 4 April 2016 Database Theory slide 25 of 27 Markus Krötzsch, 4 April 2016 Database Theory slide 26 of 27

Summary and Outlook

The relational model is very versatile

Relational algebra allows us to define queries with operators

Many operators exist, not all are really needed (see exercise)

Open questions: What does this have to do with logic? (next lecture) • How hard is it to actually answer such queries? (complexity) • How can we study the expressiveness of query languages? •

Markus Krötzsch, 4 April 2016 Database Theory slide 27 of 27