Overview

• Homework • Normalization Relational – Functional dependencies Database Systems 1 – 2NF – 3NF Wolf-Tilo Balke – BCNF Joachim Selke – 4NF, 5NF, 6NF Institut für Informationssysteme • Technische Universität Braunschweig www.ifis.cs.tu-bs.de

Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 2

Exercise 8.1 Exercise 8.1

Hotel(hotelNo, hotelName, city) • Again, our hotel database: Room(roomNo, hotelNo → Hotel, type, price) Booking(hotelNo → Hotel/Room, guestNo → Guest, dateFrom, dateTo, roomNo → Room) Hotel(hotelNo, hotelName, city) Guest(guestNo, guestName, guestAddress) Room(roomNo, hotelNo → Hotel, type, price) Booking(hotelNo → Hotel/Room, guestNo → Guest, dateFrom, dateTo, roomNo → Room) Guest(guestNo, guestName, guestAddress) • CREATE SCHEMA hotelinfo • SET SCHEMA hotelinfo • Provide all SQL statements (in the right order!) that are necessary to create • CREATE TABLE hotel ( the table structure given above (including all primary keys and referential hotelNo INTEGER NOT NULL PRIMARY KEY, integrity), and additionally ensure the following: hotelName VARCHAR(200) NOT NULL, – All table should be contained in a new schema called hotelinfo – Allowed room types are single, double, and family city VARCHAR(100) NOT NULL) – The price of each room must be between 10 and 100 Euros – The same guest cannot have overlapping bookings at the same hotel

Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 3 Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 4

Exercise 8.1 Exercise 8.1

Hotel(hotelNo, hotelName, city) Hotel(hotelNo, hotelName, city) Room(roomNo, hotelNo → Hotel, type, price) Room(roomNo, hotelNo → Hotel, type, price) Booking(hotelNo → Hotel/Room, guestNo → Guest, dateFrom, dateTo, roomNo → Room) Booking(hotelNo → Hotel/Room, guestNo → Guest, dateFrom, dateTo, roomNo → Room) Guest(guestNo, guestName, guestAddress) Guest(guestNo, guestName, guestAddress)

• CREATE TABLE room ( • roomNo INTEGER NOT NULL, CREATE TABLE guest ( hotelNo INTEGER NOT NULL REFERENCES hotel, guestNo INTEGER NOT NULL PRIMARY KEY, type VARCHAR(10) NOT NULL guestName VARCHAR(200) NOT NULL, CHECK (type IN (’single’, ’double’, ’family’)), guestAddress VARCHAR(1000)) price NUMERIC(3, 2) NOT NULL CHECK (price BETWEEN 10 AND 100), PRIMARY KEY (roomNo, hotelNo))

Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 5 Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 6 Exercise 8.1 Exercise 8.2

• CREATE TABLE booking ( • The relational schema of a product database: hotelNo INTEGER NOT NULL REFERENCES hotel, guestNo INTEGER NOT NULL REFERENCES guest, – product(model, maker, price) dateFrom DATE NOT NULL, dateTo DATE NOT NULL, – pc(model → product, speed, ram, hd) roomNo INTEGER NOT NULL, – laptop(model → product, speed, ram, hd, screen) PRIMARY KEY (hotelNo, guestNo, dateFrom), FOREIGN KEY (hotelNo, roomNo) REFERENCES room, – printer(model → product, color, type) CHECK (dateFrom <= dateTo), CHECK (NOT EXISTS ( SELECT * FROM booking b1, booking b2 WHERE b1.hotelNo = b2.hotelNo AND b1.guestNo = b2.guestNo AND b1.dateFrom < b2.dateFrom AND b1.dateTo > b2.dateFrom)))

Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 7 Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 8

Exercise 8.2 Exercise 8.2

– product(model, maker, price) – product(model, maker, price) – pc(model → product, speed, ram, hd) – pc(model → product, speed, ram, hd) – laptop(model → product, speed, ram, hd, screen) – laptop(model → product, speed, ram, hd, screen) – printer(model → product, color, type) – printer(model → product, color, type) • Give SQL statements to modify the database as follows: b) Store the fact that PC model 1500 is made by manufacturer A, has speed 3.1, RAM 1024, hard disk 300, and sells for 2499 Euros. a) Delete all PCs with less than 200 gigabytes of hard disk. INSERT INTO product VALUES (1500, ’A’, 2499) CREATE TABLE temp (model INTEGER) INSERT INTO pc VALUES(1500, 3.1, 1024, 300) INSERT INTO temp (SELECT model FROM pc WHERE hd < 200 DELETE FROM pc WHERE model IN (SELECT * FROM temp) DELETE FROM product WHERE model IN (SELECT * FROM temp) DROP TABLE temp (As an alternative solution, you can make the assumption that all foreign keys have been defined using ON DELETE CASCADE. Then, only the deletion from the pc table has to be performed.)

Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 9 Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 10

Exercise 8.2 Exercise 8.2

– product(model, maker, price) – product(model, maker, price) – pc(model → product, speed, ram, hd) – pc(model → product, speed, ram, hd) – laptop(model → product, speed, ram, hd, screen) – laptop(model → product, speed, ram, hd, screen) – printer(model → product, color, type) – printer(model → product, color, type) c) Delete all laptops made by a manufacturer that does not make d) Manufacturer B buys manufacturer C. PCs. Change all products made by C so they are now made by B. CREATE TABLE temp (model INTEGER) UPDATE product SET maker = ’B’ WHERE maker = ’C’ INSERT INTO temp (SELECT model FROM laptop WHERE maker NOT IN e) For each PC, double the amount of hard disk and (SELECT maker FROM pc JOIN product pr ON pc.model = pr.model)) add 1024 megabytes to the amount of RAM. DELETE FROM laptop WHERE model IN (SELECT * FROM temp) DELETE FROM product WHERE model IN (SELECT * FROM temp) UPDATE pc SET (hd, ram) = (2 * hd, ram + 1024) DROP TABLE temp

(Alternative solution: see above)

Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 11 Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 12 Exercise 8.2 Exercise 8.2

– product(model, maker, price) – product(model, maker, price) – pc(model → product, speed, ram, hd) – pc(model → product, speed, ram, hd) – laptop(model → product, speed, ram, hd, screen) – laptop(model → product, speed, ram, hd, screen) – → – printer(model → product, color, type) printer(model product, color, type) g) Insert the facts that for every laptop there is a PC with the same f) For each laptop made by manufacturer D, add 1 inch to manufacturer, speed, RAM, hard disk, a model number that is the screen size and subtract 200 Euros from the price. 1100 less, and a price that is 500 Euros less. CREATE TABLE temp (model INTEGER) CREATE TABLE temp (model INTEGER) INSERT INTO temp INSERT INTO temp (SELECT l.model (SELECT model FROM laptop l WHERE NOT EXISTS FROM laptop l JOIN product p ON l.model = p.model (SELECT * FROM product p WHERE p.model = l.model - 1100)) WHERE maker = ’D’) INSERT INTO product UPDATE laptop SET screen = screen + 1 (SELECT t.model - 1100, maker, price - 500 FROM temp t JOIN product p ON t.model = p.model) WHERE model IN (SELECT * FROM temp) INSERT INTO pc UPDATE product SET price = price - 200 (SELECT t.model - 1100, speed, ram, hd WHERE model IN (SELECT * FROM temp) FROM temp t JOIN laptop l ON t.model = p.model) DROP TABLE temp DROP TABLE temp

Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 13 Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 14

Exercise 8.3 Exercise 8.3

– marks(studentid, score) • A relation containing exam scores: – marks(studentid, score) • Write SQL queries to do the following: • We wish to assign grades to students based on a) Display the grade for each student, based on the marks relation. scores as follows: SELECT studentid, – Grade F if score < 40, CASE – grade C if 40 ≤ score < 60, WHEN score < 40 THEN ’F’ – grade B if 60 ≤ score < 80, and WHEN score < 60 THEN ’C’ WHEN score < 80 THEN ’B’ – grade A if 80 ≤ score. ELSE ’A’ END FROM marks

Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 15 Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 16

Exercise 8.3 Exercise 8.4

– marks(studentid, score) • The SQL-92 standard provides an n-ary operation • Write SQL queries to do the following: called COALESCE, which is defined as follows: a) Find the number of students with each grade. WITH grades(studentid, grade) AS COALESCE(A1, A2, ..., An) returns the first nonnull (SELECT studentid, Ai in the list A1, A2, ..., An and returns null if all of CASE WHEN score < 40 THEN ’F’ A1, A2, ..., An are null. WHEN score < 60 THEN ’C’ WHEN score < 80 THEN ’B’ • Show how to express the COALESCE operation ELSE ’A’ using the CASE operation. END AS grade FROM marks) SELECT grade, COUNT(*) AS count FROM grades GROUP BY grade

Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 17 Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 18 Exercise 8.4 9.1 Introduction

• Show how to express the COALESCE operation • Up to now, we have learned ... using the CASE operation. – ... how the relational model works.

– COALESCE(A1, A2, ..., An) is equivalent to the following – ... how it is implemented in current RDBMS. CASE operation with searched WHEN-clause: – ... how to create relational databases (SQL DDL). CASE – ... how to define constraints (SQL DDL). – ... how to query relational databases. WHEN A1 IS NOT NULL THEN A1 WHEN A2 IS NOT NULL THEN A2 – ... how to insert, delete, and update data (SQL DML). ... • What’s missing? WHEN An IS NOT NULL THEN An – How to create a “good” database design? ELSE NULL optional – By the way: What is a “good” database design? END

Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 19 Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 20

9.1 Introduction 9.1 Introduction

• Which table design is better? heroID teamID heroName teamName joinYear 1 1 Thor The Avengers 1963 2 2 Mister Fantastic Fantastic Four 1961 heroID teamID heroName teamName joinYear 3 1 Iron Man The Avengers 1963 1 1 Thor The Avengers 1963 A 4 1 Hulk The Avengers 1963 2 2 Mister Fantastic Fantastic Four 1961 5 1 Captain America The Avengers 1964 3 1 Iron Man The Avengers 1963 A 6 2 Invisible Girl Fantastic Four 1961 4 1 Hulk The Avengers 1963 5 1 Captain America The Avengers 1964 6 2 Invisible Girl Fantastic Four 1961 • What’s wrong with design A? – Redundancy: The fact that certain teams have certain heroID heroName teamID teamName heroID teamID joinYear 1 Thor 1 The Avengers 1 1 1963 names is represented several times. 2 Mister Fantastic 2 Fantastic Four 2 2 1961 3 Iron Man 3 1 1963 – Inferior expressiveness: We cannot represent B 4 Hulk 4 1 1963 heroes that currently have no team. 5 Captain America 5 1 1964 6 Invisible Girl 6 2 1961 – Modification anomalies: (see next slide)

Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 21 Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 22

9.1 Introduction 9.1 Introduction

heroID teamID heroName teamName joinYear 1 1 Thor The Avengers 1963 • In general, “good” relational database designs 2 2 Mister Fantastic Fantastic Four 1961 have the following properties: 3 1 Iron Man The Avengers 1963 A 4 1 Hulk The Avengers 1963 – Redundancy is minimized 5 1 Captain America The Avengers 1964 • That is: no information is represented several times! 6 2 Invisible Girl Fantastic Four 1961 • Logically distinct information is placed in • There are three kinds of modification anomalies: distinct relation schemes – Insertion anomalies – Modification anomalies are prevented “by design” • How do you add heroes that currently have no team? • That is: by using keys and foreign keys, not by enforcing an excessive amount of (hard to check) constraints! • How do you (consistently!) add new tuples? – Deletion anomalies – In practice, “good” designs should also match the characteristics of the used RDBMS • Deleting Mister Fantastic and Invisible Girl also deletes all information about the Fantastic Four • Enable efficient query processing – Update anomalies • In essence, it’s all about splitting up tables ... • Renaming a team requires updating several tuples (due to redundancy) – Remember design B

Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 23 Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 24 9.2 Normalization 9.2 Normalization

• These “rules of thumb” can be formalized by the • For this lecture, let’s assume the following: concept of relational – R(A1, ..., An) is a relation schema • But before going into details, let’s recap some – 풞 is a set of constraints satisfied by all extensions of R definitions from the relational model: • Our ultimate goal is to enhance the database – Data is represented using relation schemas R(A1,…, An), where A1,…, An are attributes design by decomposing R into a set of smaller – A relational database schema consists of relation schemas, as we did in our example: • A set of relation schemas • A set of integrity constraints heroID teamID heroName teamName joinYear (e.g. “heroID is unique” and “heroID determines heroName”) – A relational database instance (or extension) is • A set of tuples adhering to the respective schemas and respecting all integrity constraints heroID heroName teamID teamName heroID teamID joinYear

Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 25 Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 26

9.2 Normalization 9.2 Normalization heroes • Definition (decomposition): • Example: heroID teamID heroName teamName joinYear 1 1 Thor The Avengers 1963 풞 = {“{heroID, teamID} is unique”, – Let α , ..., α ⊆ {A , ..., A } be k subsets of R’s attributes 2 2 Mister Fantastic Fantastic Four 1961 1 k 1 n “heroID determines heroName”, 3 1 Iron Man The Avengers 1963 • Note that these subsets may be overlapping “teamID determines teamName”, “{heroID, teamID} determines joinYear”} 4 1 Hulk The Avengers 1963 5 1 Captain America The Avengers 1964 – Then, for any αi, a new relation Ri can be derived: 6 2 Invisible Girl Fantastic Four 1961 R = π (R) i αi – Our example decomposition is lossless: α = {heroID, heroname}, α = {teamID, teamName}, α = {heroID, teamID, joinYear} – α1, ..., αk is called a decomposition of R 1 2 3

πα1(heroes) πα2(heroes) πα3(heroes) • “Good” decompositions are reversible: heroID heroName teamID teamName heroID teamID joinYear 1 Thor 1 The Avengers 1 1 1963 – The decomposition α , ..., α is called lossless 2 Mister Fantastic 2 Fantastic Four 2 2 1961 1 k 3 Iron Man 3 1 1963 if and only if R = R1 ⋈ R2 ⋈ ⋯ ⋈ Rk, for any 4 Hulk 4 1 1963 5 Captain America 5 1 1964 extension of R satisfying the constraints 풞 6 Invisible Girl 6 2 1961

Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 27 Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 28

9.2 Normalization 9.2 Normalization heroes heroID teamID heroName teamName joinYear • Normalizing a relation schema R means 1 1 Thor The Avengers 1963 풞 = {“{heroID, teamID} is unique”, 2 2 Mister Fantastic Fantastic Four 1961 “heroID determines heroName”, replacing R by a lossless decomposition of itself 3 1 Iron Man The Avengers 1963 “teamID determines teamName”, “{heroID, teamID} determines joinYear”} 4 1 Hulk The Avengers 1963 5 1 Captain America The Avengers 1964 • However, this raises some new questions: 6 2 Invisible Girl Fantastic Four 1961 – Under which conditions there is a (nontrivial) – Is the following a lossless decomposition? lossless decomposition?

α1 = {heroID, teamID, joinYear}, α2 = {teamID, heroName, teamName, joinYear} • Decompositions involving αi = {A1, ..., An} or αi = ∅ are called trivial πα1(heroes) πα2(heroes) heroID teamID joinYear teamID heroName teamName joinYear – If there is a lossless decomposition, how to find it? 1 1 1963 1 Thor The Avengers 1963 2 2 1961 2 Mister Fantastic Fantastic Four 1961 – How to measure a relation schema’s “design quality”? 3 1 1963 1 Iron Man The Avengers 1963 4 1 1963 1 Hulk The Avengers 1963 • We may abstain from further normalization if the quality is 5 1 1964 1 Captain America The Avengers 1964 “good enough” ... 6 2 1961 2 Invisible Girl Fantastic Four 1961

Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 29 Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 30 9.2 Normalization 9.3 Functional Dependencies

• The normalization of R depends entirely • Informally, functional dependencies can be on the set of constraints 풞 imposed on R described as follows: • Instead of dealing with constraints of arbitrary – Let X and Y be some sets of attributes complexity, we restrict 풞 to the class of – “If Y functionally depends on X, and two tuples functional dependencies (FDs) agree on their X values, then they also have to agree on their Y values” – Most update anomalies and problems of redundancy occurring in practice can be traced back to violations • Examples: of constraints – “{end time} functionally depends on {start time, duration}“ • Typically, functional dependencies are all you need – “{duration} functionally depends on {start time, end time}“ – “heroName is completely determined by heroID” – “{end time} functionally depends on {end time}” is an example for a functional dependency

Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 31 Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 32

9.3 Functional Dependencies 9.3 Functional Dependencies

Formal definition: • If X → Y, then one says that ... • Let X and Y be subsets of R’s attributes – X functionally determines Y, and – Y functionally depends on X. – That is, X, Y ⊆ {A1, ..., An} • There is functional dependency (FD) between • X is called the determinant of the FD X → Y X and Y (denoted as X → Y), if and only if, ... • Y is called the dependent of the FD X → Y

– … for any two tuples t1 and t2 within any instance of R, the following is true:

If t1[X] = t2[X], then t1[Y] = t2[Y]

Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 33 Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 34

9.3 Functional Dependencies 9.3 Functional Dependencies

• Functional dependencies are semantic properties • In fact, functional dependencies are of the underlying domain and data model a generalization of key constraints • FDs are NOT a property of a particular instance • To see this, we need a short recap: (extension) of the relation schema R! – A set of attributes X is a (candidate) key for R – The designer is responsible for identifying FDs if and only if it has both of the following properties: • Uniqueness: No legal instance of R ever contains – FDs are manually defined integrity constraints on R two distinct tuples with the same value for X – All extensions respecting R’s functional dependencies • Irreducibility: No proper subset of X has the uniqueness are called legal extensions of R property – A superkey is a superset of a key • That is, only uniqueness is required

Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 35 Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 36 9.3 Functional Dependencies 9.3 Functional Dependencies

• In practice, if there is more than one key, we • Example: usually choose one and call it the primary key – A relation containing students – However, for normalization purposes, only keys are • Semantics: matriculationNo is unique important. Thus, we ignore primary keys today. • {matriculationNo} → {firstName, lastName, dateOfBirth} • The following is true: matriculationNo firstName lastName dateOfBirth – X is a superkey of R if and only if

X → {A1, ..., An} is a functional dependency in R

Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 37 Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 38

9.3 Functional Dependencies 9.3 Functional Dependencies

• Example: • Example: – A relation containing real names and aliases of heroes, – A relation containing license plates and the type of the where each hero has only one unique alias respective car • {alias} → {realName} • {areaCode, characterCode, numberCode} → {carType}

areaCode characterCode numberCode carType alias realName

Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 39 Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 40

9.3 Functional Dependencies 9.3 Functional Dependencies

• What FDs can be derived from the following • One possible solution: description of an address book? – {zip} → {city, state} name street city state zip – {street, city, state} → {zip} • For any given zip code, there is just one city and state. • Typically, not all actual FDs are modeled explicitly: • For any given street, city, and state, there is just one zip code. – {zip} → {city} • FDs and candidate keys? – {street} → {street} – {state} → ∅ – ...

Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 41 Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 42 9.3 Functional Dependencies 9.3 Functional Dependencies

• Obviously, some FDs are implied by others – {zip} → {city, state} implies {zip} → {city} • Definition: + • Moreover, some FDs are trivial For any set F of FDs, the closure of F (denoted F ) – {street} → {street} is the set of all FDs that are logically implied by F – {state} → ∅ – F implies the FD X → Y, if and only if any extension of – Definition: The FD X → Y is called trivial iff X ⊇ Y R satisfying any FD in F, also satisfies the FD X → Y • What do we need? • Fortunately, the closure of F can easily be – A compact representation for sets of FD constraints computed using a small set of inference rules • No redundant FDs – An algorithm to compute the set of all implied FDs

Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 43 Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 44

9.3 Functional Dependencies 9.3 Functional Dependencies

• For any attribute sets X, Y, Z, the following is true: • To simplify the practical task of computing F+ – Reflexivity: from F, several additional rules can be derived If X ⊇ Y, then X → Y It‘s that simple! from Armstrong’s axioms: – Augmentation: – If X → Y, then X ∪ Z → Y ∪ Z Decomposition: If X → Y ∪ Z, then X → Y and X → Z – Transitivity: – If X → Y and Y → Z, then X → Z Union: If X → Y and X → Z, then X → Y ∪ Z • These rules are called Armstrong’s axioms – Composition: – One can show that they are complete and sound If X → Y and Z → W, then X ∪ Z → Y ∪ W • Completeness: Every implied FD can be derived • Soundness: No non-implied FD can be derived

Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 45 Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 46

9.3 Functional Dependencies 9.3 Functional Dependencies

• Example: • In principle, we can compute the closure F+ of a – Relational schema R(A, B, C, D, E, F) given set F of FDs by means of the following – FDs: {A} → {B, C} {B} → {E} {C, D} → {E, F} algorithm: – Then we can make the following derivation: – Repeatedly apply the six inference rules until they stop producing new FDs. 1. {A} → {B, C} (given) 2. {A} → {C} (by decomposition) • In practice, this algorithm is hardly very efficient 3. {A, D} → {C, D} (by augmentation) – However, there usually is little need to compute the 4. {A, D} → {E, F} (by transitivity with given {C, D} → {E, F}) closure per se 5. {A, D} → {F} (by decomposition) – Instead, it often suffices to compute a certain subset of the closure: namely, that subset consisting of all FDs with a certain left side

Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 47 Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 48 9.3 Functional Dependencies 9.3 Functional Dependencies • Definition: Given a set of attributes X and a set of FDs F, • Quiz: the closure of X under F, written as (X, F)+, – F = { {A} → {B, C}, {E} → {C, F}, consists of all attributes that functionally depend on X {B} → {E}, {C, D} → {E, F} } + – That is, (X, F) ≔ {Ai | X → Ai is implied by F} – What is the closure of {A, B} under F? • The following algorithm computes (X, F)+: unused = F closure = X repeat for each Y → Z ∈ unused do if Y ⊆ closure then unused = unused ∖ {Y → Z} closure = closure ∪ Z until unused and closure did not change on this iteration return closure Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 49 Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 50

9.3 Functional Dependencies 9.3 Functional Dependencies

• Now, we can do the following: • Definition: – Given a set F of FDs, we can easily tell whether Two sets of FDs F and G are equivalent iff F+ = G+ a specific FD X → Y is contained in F+ • Just check whether Y ⊆ (X, F)+ – In particular, we can find out whether a • How can we find out whether two given sets of FDs F and G are equivalent? set of attributes X is a superkey of R – Theorem: + + + + + • Just check whether (X, F) = {A1, ..., An} F = G iff for any FD X → Y ∈ F ∪ G, it is (X, F) = (X, G) – Proof: • What’s still missing? • Let F’ = {X → (X, F)+ | X → Y ∈ F ∪ G} • Analogously, derive G’ from G – Given a set of FDs F, how to find a set of FDs G, • Obviously, then F’+ = F+ and G’+ = G+ such that F+ = G+, and G is as small as possible? • Moreover, every left side of an FD in F’ occurs as a left side of an FD in G’ (and reverse) + + – Given sets of FDs F and G, does F = G hold? • If F’ and G’ are different, then also F+ and G+ must be different

Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 51 Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 52

9.3 Functional Dependencies 9.3 Functional Dependencies

• Example: • Remember: – F = { {A, B} → {C}, {C} → {B} } To have a small representation of F, we want to find a G, – G = { {A} → {C}, {A, C} → {B} } such that: – Are F and G equivalent? – F and G are equivalent – G is “as small as possible” (we will call this property minimality)

– We must check (X, F)+ = (X, G)+ for the following X: • Definition: • {A, B}, {C}, {A}, and {A, C} A set of FDs F is minimal iff the following is true: – ({A, B}, F)+ = {A, B, C} ({A, B}, G)+ = {A, B, C} – Every FD X → Y in F is in canonical form • That is, Y consists of exactly one attribute + + – ({C}, F) = {B, C} ({C}, G) = {C} – Every FD X → Y in F is left-irreducible • That is, no attribute can be removed from X without changing F+ – Therefore, F and G are not equivalent! – Every FD X → Y in F is non-redundant • That is, X → Y cannot be removed from F without changing F+

Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 53 Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 54 9.3 Functional Dependencies 9.3 Functional Dependencies

• Example: • The following algorithm “minimizes” F, that is, it transforms F into a minimal equivalent of F: – Given F = { {A} → {B, C}, {B} → {C}, {A} → {B}, {A, B} → {C}, 1. Split up all right sides to get FDs in canonical form {A, C} → {D} } 2. Remove all redundant attributes from the left sides 1. Split up the right sides: + (by checking which attribute removals change F ) {A} → {B}, {A} → {C}, {B} → {C}, 3. Remove all redundant FDs from F {A, B} → {C}, {A, C} → {D} + (by checking which FD removals change F ) 2. Remove C from {A, C} → {D} • {A} → {C} implies {A} → {A, C} (augmentation) • {A} → {A, C} and {A, C} → {D} imply {A} → {D} (transitivity)

Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 55 Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 56

9.3 Functional Dependencies 9.3 Functional Dependencies

– Now we have: • Functional dependencies are the perfect tool for {A} → {B}, {A} → {C}, {B} → {C}, performing lossless decompositions {A, B} → {C}, {A} → {D} – Heath’s Theorem: 3. Remove {A, B} → {C} Let X → Y be an FD constraint of the relation schema

• {A} → {C} implies {A, B} → {C} R(A1, ..., An). Then, the following decomposition of R is lossless: 4. Remove {A} → {C} α1 = X ∪ Y and α2 = {A1, ..., An} ∖ Y. • {A} → {B} and {B} → {C} imply {A} → {C} (transitivity) – Example: FDs: {heroID} → {heroName} {teamID} → {teamName} heroID teamID heroName teamName joinYear – Finally, we end up with a minimal equivalent of F: {heroID, teamID} → {joinYear} {A} → {B}, {B} → {C}, {A} → {D} Decompose with respect to {heroID} → {heroName}

heroID heroName heroID teamID teamName joinYear

Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 57 Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 58

9.3 Functional Dependencies 9.3 Functional Dependencies

• How to come up with functional dependencies? 2. Based on an explicit model – There are several ways: 1 (1, 2) • Based on “domain knowledge” A r B • Based on an explicit data model • Based on existing data – Automated generation of FDs possible – But: Are all actual FDs present in the model? 1. Based on “domain knowledge” • What about FDs between attributes of the same entity? – “Obvious” FDs are easy to find – What about more complicated FDs? – No guarantee that you found all (important) FDs!

Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 59 Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 60 9.3 Functional Dependencies 9.3 Functional Dependencies

3. Based on existing data A B C D E 1 1 1 1 1 – In practice, often there is already some data available 1 2 2 2 1 (that is, tuples) 2 1 3 3 1 – 2 1 4 3 1 We can use the data to derive FD constraints 3 2 5 1 1 – Obviously: • All FDs that hold in general for some relation schema, • Which of the following FDs are satisfied also hold for any given extension in this particular extension? • Therefore, the set of all FDs that hold in some extension, is a superset of all “true” FDs of the relation schema a) {C} → {A, B} – What we can do: b) {A, D} → {C} • Find all FDs that hold in a given extension • Find a minimal representation of this FD set c) ∅ → {E} • Ask a domain expert, what FDs are generally true

Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 61 Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 62

9.3 Functional Dependencies 9.4 Normal Forms

– Find all FDs that are satisfied in this extension! A B C D E • We will check any FD X → Y in canonical form, i.e., 1 1 1 1 1 • Back to normalization: X is a subset of {A, B, C, D, E} and 1 2 2 2 1 2 1 3 3 1 – Y is an element of {A, B, C, D, E} Remember: 2 1 4 3 1 3 2 5 1 1 Normalization = Finding lossless decompositions – But only decompose, if the relation schema is of “bad quality” • How to measure the quality of a relation schema? – Clearly: The quality depends on the constraints – In our case: Quality depends on the FDs of the relation schema – Schemas can be classified into different “quality levels,” which are called normal forms

Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 63 Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 64

9.4 Normal Forms 9.4 1NF

• Part of a schema design process is to choose a desired • (1NF) attribute normal form and convert the schema into that form – Already known from previous lectures • There are seven normal forms • Has nothing to do with functional dependencies! – The higher the number, ... – Restricts relations to being “flat” • ... the stricter the requirements, • Only atomic attribute values are allowed • ... the less anomalies and redundancy, and – Multi-valued attributes must be normalized, e.g., by • ... the better the “design quality.” A) Introducing a new relation for the multi-valued attribute B) Replicating the tuple for each multi-value C) introducing an own attribute for each multi-value (if there is a small maximum number of values) 6NF 5NF 4NF BCNF 3NF 2NF 1NF – Solution A is usually considered the best

Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 65 Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 66 9.4 1NF 9.4 1NF

• A: Introducing a new relation • B: Replicating the tuple for each multi-value – Uses old key and multi-attribute as composite key – Uses old key and multi-attribute as composite key

heroID heroName powers heroID heroName powers 1 Storm weather control, flight 1 Storm weather control, flight 2 Wolverine extreme cellular regeneration 2 Wolverine extreme cellular regeneration 3 Phoenix omnipotence, indestructibility, limitless energy manipulation 3 Phoenix omnipotence, indestructibility, limitless energy manipulation

heroID power heroID heroName powers heroID heroName 1 weather control 1 Storm weather control 1 Storm 1 flight 1 Storm flight 2 Wolverine 2 extreme cellular regeneration 2 Wolverine extreme cellular regeneration 3 Phoenix 3 omnipotence 3 Phoenix omnipotence 3 indestructibility 3 Phoenix indestructibility 3 limitless energy manipulation 3 Phoenix limitless energy manipulation

Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 67 Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 68

9.4 1NF 9.4 2NF

• C: Introducing an own attribute for • The (2NF) each multi-value – The 2NF aims to avoid attributes that are functionally dependent on (proper) subsets of keys heroID heroName powers – Remember: 1 Storm weather control, flight 2 Wolverine extreme cellular regeneration • A set of attributes X is a (candidate) key 3 Phoenix omnipotence, indestructibility, limitless energy manipulation if and only if X → {A1, ..., An} is a valid FD • An attribute Ai is a key attribute if and only if it is contained in some key; otherwise, it is a non-key attribute – Definition (2NF): heroID heroName power1 power2 power3 A relation schema is in 2NF (wrt. a set of FDs) iff ... 1 Storm weather control flight NULL • It is in 1NF and 2 Wolverine cellular regeneration NULL NULL 3 Phoenix omnipotence indestructibility limitless energy manipulation • no non-key attribute is functionally dependent on a proper subset of some key.

Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 69 Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 70

9.4 2NF 9.4 2NF

• Normalization into 2NF is archived by • Functional dependence on key parts is only a decomposition according to the “non-2NF” FDs problem in relation schemas with composite keys – If X → Y is a valid FD and X is a proper subset of some key, – A key is called composite key if it consists of then decompose into α1 = X ∪ Y and α2 = {A1, ..., An} ∖ Y more than one attribute – According to Heath’s Theorem, this decomposition is lossless FDs: • Corollary: {heroID} → {heroName} heroID teamID heroName teamName joinYear {teamID} → {teamName} Every 1NF-relation without constant attributes {heroID, teamID} → {joinYear} and without composite keys also is in 2NF. Decompose with respect to – 2NF is violated, if there is a composite key and {heroID} → {heroName} some non-key attribute depends only on a proper subset of this composite key heroID heroName heroID teamID teamName joinYear

Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 71 Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 72 9.4 2NF 9.4 3NF

• Repeat this decomposition step for every created • The (3NF) relation schema that is still not in 2NF – The 3NF relies on the concept of transitive FDs

FDs: – Definition: heroID teamID teamName joinYear {teamID} → {teamName} Given a set of FDs F, an FD X → Z ∈ F+ is transitive {heroID, teamID} → {joinYear} in F, if and only if there is an attribute set Y such that: • X → Y ∈ F+, Decompose with respect to + {teamID} → {teamName} • Y → X ∉ F , and • Y → Z ∈ F+.

– Example: heroID heroName homeCityID homeCityName • {heroID} → {heroName} 11 Professor X 563 New York 12 Wolverine 782 Alberta • {heroID} → {homeCityID} 13 Cyclops 112 Anchorage heroID teamID joinYear teamID teamName • {heroID} → {homeCityName} 14 Phoenix 563 New York • {homeCityID} → {homeCityName}

Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 73 Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 74

9.4 3NF 9.4 3NF

• Definition: • Assume that the “non-3NF” transitive FD X → Z A relation schema is in 3NF if and only if: has been created by FDs X → Y and Y → Z • Then, normalization into 3NF is archived by – It is 2NF and decomposition according to Y → Z – no key transitively determines a non-key attribute. – Again, this decomposition is lossless FDs: heroID heroName homeCityID homeCityName {heroID} → {heroName} 11 Professor X 563 New York heroID heroName homeCityID homeCityName {heroID} → {homeCityID} 12 Wolverine 782 Alberta {homeCityID} → {homeCityName} 13 Cyclops 112 Anchorage 14 Phoenix 563 New York Decompose with respect to {homeCityID} → {homeCityName} {heroID} → {heroName} {heroID} → {homeCityID} {homeCityID} → {homeCityName} heroID heroName homeCityID homeCityID homeCityName

Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 75 Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 76

9.4 BCNF 9.4 BCNF

• Boyce-Codd normal form (BCNF) – BCNF is very similar to 3NF: – Was actually proposed by Ian Heath (he called it 3NF) • BCNF: three years before Boyce and Codd did In any non-trivial FD X → Y, the set X is a superkey. • 3NF (alternative definition): – Definition: In any non-trivial FD X → Y, the set X is a superkey, or A relation schema R is in BCNF if and only if, the set Y is a subset of some key. in any non-trivial FD X → Y, the set X is a superkey – A 3NF schema is not in BCNF, if it has • All BCNF schemas are also in 3NF, two or more overlapping composite keys. and most 3NF schemas are also in BCNF • That is: There are different keys X and Y such that – There are some rare exceptions |X|, |Y| ≥ 2 and X ∩ Y ≠ ∅.

Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 77 Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 78 9.4 BCNF 9.4 BCNF

• Example: student topic advisor – Students, a topic, and an advisor – Let’s assume that the following dependencies hold: – Consequently, there are the following keys: • {student, topic} → {advisor} • {student, topic} • {advisor} → {topic} • {student, advisor} – That is: For each topic, a student has a specific advisor. Each advisor is responsible for a single specific topic. – The schema is in 3NF, because it is in 1NF and there are no non-key attributes student topic advisor 100 Math Gauss – However, it is not in BCNF 100 Physics Einstein • It is {advisor} → {topic} but {advisor} is not a superkey 101 Math Leibniz 102 Math Gauss

Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 79 Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 80

9.4 BCNF 9.4 Higher Normal Forms

• Moreover, there are modification anomalies: • BCNF is the “ultimate” normal form when Student Topic Advisor 100 Math Gauss using only functional dependencies as constraints 100 Physics Einstein If you delete this row, all information – “Every attribute depends on a key, a whole key, 101 Math Leibniz about Leibniz doing math is lost 102 Math Gauss and nothing but a key, so help me Codd.” • Normalization by decomposition prevents these anomalies: • However, there are higher normal forms (4NF to 6NF) that rely on generalizations of FDs Student Advisor Advisor Topic 100 Gauss Gauss Math – 4NF: Multivalued dependencies 100 Einstein Einstein Physics 101 Leibniz Leibniz Math – 5NF/6NF: Join dependencies 102 Gauss

Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 81 Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 82

9.4 4NF 9.4 4NF

• Definition: • The 4NF is about multivalued dependencies (MVDs) A relation schema is in 4NF if and only if, for any non-trivial • Example: multivalued dependency X ↠ Y, also the functional dependency

course teacher textbook Dependencies: Z → Y holds, for some key Z Physics Prof. Green Basic Mechanics • “For any course, there is a – “There are no two attributes in a 1:n relationship with a key” Physics Prof. Green Principles of Optics fixed set of teachers.” Physics Prof. Brown Basic Mechanics (written as {course} ↠ {teacher}) • Decomposition into 4NF schemas: course teacher Physics Prof. Brown Principles of Optics • “For any course, there is a Physics Prof. Green Math Prof. Green Basic Mechanics fixed set of textbooks, which is course teacher textbook Physics Prof. Brown Math Prof. Green Vector Analysis independent of the teacher“ Physics Prof. Green Basic Mechanics Math Prof. Green Math Prof. Green Trigonometry (written as {course} ↠ {textbook}) Physics Prof. Green Principles of Optics Physics Prof. Brown Basic Mechanics course textbook • In fact, every FD can be expressed as a MVD: Physics Prof. Brown Principles of Optics Physics Basic Mechanics Physics Principles of Optics – If X → Y then also X ↠ Y Math Prof. Green Basic Mechanics Math Prof. Green Vector Analysis Math Basic Mechanics – But both expressions are not equivalent! Math Prof. Green Trigonometry Math Vector Analysis Math Trigonometry

Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 83 Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 84 9.4 5NF 9.4 6NF • The 6NF also is about join dependencies • The 5NF deals with join dependencies (JDs) – Definition: – Directly related to lossless decompositions A relation schema is in 6NF if and only if – Definition: it satisfies no non-trivial JDs at all. • In other words: You cannot decompose it anymore. Let α1, ..., αk ⊆ {A1, ..., An} be k subsets of R’s attributes (possibly overlapping). We say that R • Decomposition into 6NF means that every satisfies the join dependency ∗{α , ..., α } if and only if resulting relation schema contains a key and 1 k one(!) additional non-key attribute α1, ..., αk is a lossless decomposition of R. – Definition: – This means a lot of tables! A relation schema is in 5NF if and only if, • By definition, 6NF is the final word on normalization by lossless decomposition for every non-trivial join dependency ∗{α1, ..., αk}, each α is a superkey. – All kinds of dependencies can be expressed by i key and foreign key constraints

Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 85 Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 86

9.5 Denormalization 9.5 Denormalization

• Normalization in real world databases: • Usually, a schema in a higher normal form is – Guided by normal form theory better than one in a lower normal form – But: Normalization is not everything! – However, sometimes it is a good idea to artificially – Trade-off: Redundancy/anomalies vs. speed create lower-form schemas to, e.g., increase • General design: Avoid redundancy wherever possible, read performance because redundancies often lead to inconsistent states • This is called denormalization • An exception: Materialized views (≈ precomputed joins) – • Denormalization usually increases query speed and expensive to maintain, but can boost read efficiency decreases update efficiency due to the introduction of redundancy

Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 87 Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 88

9.5 Denormalization 9.5 Denormalization

• Rules of thumb: – Sometimes, you even can perform denormalization at – A good data model almost always directly leads to relational schemas in high normal forms the physical level of the database • Carefully design your models! • Let your RDBMS know what attributes are often are • Think of dependencies and other constraints! accessed together, even if they are located in different tables • Have normal forms in mind during modeling! • State-of-the-art RDBMS can exploit this information to – Denormalize only when faced with a physically cluster data or precompute some joins, even performance problem that cannot be resolved by: without changing your table designs! • money • hardware scalability • current SQL technology • network optimization • parallelization • other performance techniques

Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 89 Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 90