In This Lecture

• Idea of normalisation Normalisation • Functional dependencies •Normal forms • Decompositions Database Systems Lectures 11-12 • 2NF, 3NF, BCNF Natasha Alechina

Functional Dependencies Example

• Redundancy is often • A set of attributes, A, •{ID, modCode}  {First, Last, modName} caused by a functional functionally determines •{modCode}  {modName} dependency another set, B, or: there •{ID}  {First, Last} • A functional dependency exists a functional (FD) is a link between dependency between A two sets of attributes in a and B (A  B), if IDFirst Last modCode modName whenever two rows of the relation have the • We can normalise a 111Joe Bloggs G51PRG Programming relation by removing same values for all the undesirable FDs attributes in A, then they also have the same 222Anne Smith G51DBS Databases values for all the attributes in B.

FDs and Normalisation Key attributes and superkeys

•We define a set of • Not all FDs cause a • We call an attribute • We call a set of 'normal forms' problem a key attribute if this attributes a • Each normal form has • We identify various attribute is part of if it includes a fewer FDs than the sorts of FD that do some . candidate key (or is last • Each normal form Alternative a candidate key). • Since FDs represent removes a type of FD terminology is redundancy, each that is a problem `prime’ attribute. normal form has less • We will also need a redundancy than the way to remove FDs last

1 Partial FDs and 2NF

1NF •Partial FDs: Second normal form: • 1NF is not in 2NF Module Dept Lecturer Text •A FD, A  B is a partial • A relation is in second • We have the FD FD, if some attribute of normal form (2NF) if it is M1 D1 L1 T1 {Module, Text}  A can be removed and M1 D1 L1 T2 in 1NF and no non-key {Lecturer, Dept} the FD still holds M2 D1 L1 T1 attribute is partially •But also • Formally, there is some dependent on a M2 D1 L1 T3 {Module}  {Lecturer, Dept} proper of A, candidate key M3 D1 L2 T4 C  A, such that C  B M4 D2 L3 T1 • And so Lecturer and • In other words, no C  B M4 D2 L3 T5 Dept are partially • Let us call attributes where C is a strict subset which are part of some M5 D2 L4 T6 dependent on the of a candidate key and B primary key candidate key, key is a non-key attribute. attributes, and the rest non-key attributes.

Removing FDs 1NF to 2NF – Example 1NF 2NFa 2NFb • Suppose we have a • It turns out that we can Module Dept Lecturer Text Module Dept Lecturer Module Text relation R with scheme S split R into two parts: M1 D1 L1 T1 M1 D1 L1 M1 T1 and the FD A  B where • R1, with scheme C U A M1 D1 L1 T2 M2 D1 L1 M1 T2 A ∩ B = { } • R2, with scheme A U B M2 D1 L1 T1 M3 D1 L2 M2 T1 • Let C = S – (A U B) • The original relation can M2 D1 L1 T3 M4 D2 L3 M2 T3 • In other words: be recovered as the M3 D1 L2 T4 M5 D2 L4 M3 T4 • A – attributes on the left natural join of R1 and M4 D2 L3 T1 M4 T1 hand side of the FD R2: M4 D2 L3 T5 A, B where A  B M4 T5 • B – attributes on the • R = R1 NATURAL JOIN R2 M5 D2 L4 T6 is the `bad’ M1 T6 right hand side of the FD dependency – • C – all other attributes A C violating 2NF A, C B

Problems Resolved in 2NF Problems Remaining in 2NF

2NFa •Problems in 1NF • In 2NF the first two • INSERT anomalies Module Dept Lecturer • INSERT – Can't add a are resolved, but not • Can't add lecturers module with no texts the third one who teach no modules M1 D1 L1 • UPDATE – To change M2 D1 L1 2NFa •UPDATE anomalies lecturer for M1, we M3 D1 L2 • To change the have to change two Module Dept Lecturer M4 D2 L3 department for L1 we rows M5 D2 L4 M1 D1 L1 must alter two rows • DELETE – If we M2 D1 L1 remove M3, we M3 D1 L2 • DELETE anomalies remove L2 as well M4 D2 L3 • If we delete M3 we M5 D2 L4 delete L2 as well

2 Transitive FDs and 3NF

•Transitive FDs: • Third normal form 2NFa • 2NFa is not in 3NF •A FD, A  C is a • A relation is in third Module Dept Lecturer • We have the FDs transitive FD, if there normal form (3NF) if M1 D1 L1 is some set B such it is in 2NF and no {Module}  {Lecturer} that A  B and B  C non-key attribute is M2 D1 L1 {Lecturer}  {Dept} M3 D1 L2 are non-trivial FDs transitively dependent •So there is a • A  B non-trivial on a candidate key M4 D2 L3 transitive FD from the means: B is not a • Alternative (simpler) M5 D2 L4 primary key {Module} subset of A definition: a relation to {Dept} •We have is in 3NF if in every A  B  C non-trivial fd A  B either B is a key attribute or A is a superkey.

2NF to 3NF – Example Problems Resolved in 3NF

2NFa 3NFa 3NFb •Problems in 2NF • In 3NF all of these are resolved (for this relation – Module Dept Lecturer Lecturer Dept Module Lecturer • INSERT – Can't add but 3NF can still have lecturers who teach anomalies!) M1 D1 L1 L1 D1 M1 L1 no modules 3NFb M2 D1 L1 L2 D1 M2 L1 • UPDATE – To change Module Lecturer M3 D1 L2 L3 D2 M3 L2 3NFa the department for L1 M4 D2 L3 L4 D2 M4 L3 Lecturer Dept M1 L1 we must alter two M5 D2 L4 M5 L4 M2 L1 rows L1 D1 M3 L2 • DELETE – If we delete L2 D1 M4 L3 M3 we delete L2 as L3 D2 M5 L4 well L4 D2

Normalisation so Far The Stream Relation

• Third normal form • Consider a relation, • Each course has • All data values are • In 2NF plus no non-key Stream, which stores several streams atomic attribute depends information about • Only one stream (of • Second normal form transitively on a candidate key (or, no times for various any course at all) • In 1NF plus no non-key dependencies of non- streams of courses takes place at any attribute is partially key on non-superkey) dependent on a •For example: labs given time candidate key for first years • Each student taking a course is assigned to a single stream for it

3 The Stream Relation FDs in the Stream Relation

•Stream has the Student Course Time following non-trivial John Databases 12:00 FDs Mary Databases 12:00 • {Student, Course}  Richard Databases 15:00 {Time} Richard Programming 10:00 •{Time}  {Course} Mary Programming 10:00 Rebecca Programming 13:00 • Since all attributes are key attributes, Stream Candidate keys: {Student, Course} and {Student, Time} is in 3NF

Anomalies in Stream Boyce-Codd Normal Form

• INSERT anomalies • A relation is in Boyce- • The same as 3NF except • You can’t add an Codd normal form in 3NF we only worry empty stream Student Course Time (BCNF) if for every FD A about non-key Bs  B either • If there is only one •UPDATE anomalies John Databases 12:00 • B is contained in A (the candidate key then 3NF • Moving the 12:00 Mary Databases 12:00 FD is trivial), or and BCNF are the same class to 9:00 means Richard Databases 15:00 • A contains a candidate changing two rows Richard Programming 10:00 key of the relation, Mary Programming 10:00 • DELETE anomalies • In other words: every Rebecca Programming 13:00 determinant in a non- • Deleting Rebecca trivial dependency is a removes a stream (super) key.

Stream and BCNF Conversion to BCNF

• Stream is not in Student Course Time BCNF as the FD Student Course Time John Databases 12:00 {Time}  {Course} Mary Databases 12:00 is non-trivial and Richard Databases 15:00 {Time} does not Richard Programming 10:00 contain a candidate Mary Programming 10:00 Student Time Course Time key Rebecca Programming 13:00

Stream has been put into BCNF but we have lost the FD {Student, Course}  {Time}

4 Decomposition Properties Higher Normal Forms

• Lossless: Data should • Normalisation to 3NF • BCNF is as far as we 1NF Relations not be lost or created is always lossless and can go with FDs 2NF Relations when splitting dependency • Higher normal forms relations up preserving are based on other 3NF Relations sorts of dependency • Dependency • Normalisation to • BCNF Relations preservation: It is BCNF is lossless, but removes multi-valued 4NF Relations desirable that FDs are may not preserve all dependencies preserved when dependencies • Fifth normal form 5NF Relations splitting relations up removes join dependencies

Denormalisation Denormalisation

• Normalisation • However •You might want to Address •Removes data •It leads to more denormalise if Number Street City Postcode redundancy tables in the database • Database speeds are •Solves INSERT, • Often these need to unacceptable (not Not normalised since UPDATE, and DELETE be joined back just a bit slow) {Postcode}  {City} anomalies together, which is • There are going to be Address1 • This makes it easier expensive to do very few INSERTs, to maintain the • So sometimes (not UPDATEs, or DELETEs Number Street Postcode information in the often) it is worth • There are going to be database in a ‘denormalising’ lots of SELECTs that Address2 consistent state involve the joining of Postcode City tables

Relational algebra reminder: Lossless decomposition selection • To normalise a relation, • Reminder of projection: we used projections C=D(R) • If R(A,B,C) satisfies AB R AB(R) R then we can project it on A,B and A,C without losing AB C AB AB CD AB CD information 1ccx 1ccx • Lossless decomposition: 2y de 3z aa R = AB(R) ⋈ AC(R) 3z aa where AB(R) is projection of 4u bc R on AB and ⋈ is natural join. 5w cd

5 Relational algebra reminder: Connection to SQL product

SELECT A,B R1 R2 FROM R1, R2, R3 R1R2 WHERE (some property  holds) AB AC ABAC 1 x 1 w 1x1w 2y 2v 1x2v translates into relational algebra 3u 1x3u 2y1w 2y2v 2y3u  A,B   (R1R2R3)

Relational algebra: natural join When is decomposition lossless:

R1⋈R2 = R1.A,B,C R1.A = R2.A (R1R2) Module  Lecturer

R  Module,LecturerR  Module,TextR R1 R2 R1 ⋈ R2 Module Lecturer Text Module Lecturer Module Text

AB AC ABC DBS nza CB DBS nza DBS CB DBS nza UW RDB nza DBS UW 1 x 1 w 1xw RDB nza UW APS rcb RDB UW 2y 2v 2yv APS rcb B APS B 3u

When is decomposition is not When is decomposition is not lossless: no fd lossless: no fd

 S  S S  First,LastS  First,AgeS First,Last ⋈ First,Last  First,LastS  First,AgeS

FirstLast Age First Last First Age FirstLast Age First Last First Age

John Smith 20 John Smith John 20 John Smith 20 John Smith John 20 John Brown 30 John Brown John 30 John Smith 30 John Brown John 30 Mary Smith 20 Mary Smith Mary 20 John Brown 20 Mary Smith Mary 20 Tom Brown 10 Tom Brown Tom 10 John Brown 30 Tom Brown Tom 10 Mary Smith 20 Tom Brown 10

6 Heath’s theorem Heath’s theorem

• A relation R(A,B,C) that satisfies a functional • Now we show that AB(R) A AC(R)  R. This only dependency A  B can always be non-loss decomposed holds if R satisfies A  B.

into its projections R1=AB(R) and R2=AC(R). • Assume r AB(R) A AC(R). Proof. •So, r(A,B) AB(R) and r(A,C) AC(R). • By the definition of projection, if r(A,B) AB(R), then • First we show that R AB(R) A AC(R). This actually there is a s1  R such that s1(A,B) = r(A,B). holds for any relation, does not have to satisfy A  B. Similarly, since r(A,C) AC(R), there is s2  R such that s2(A,C) = r(A,C). • Assume r R. We need to show r AB(R) A AC(R). Since r R, r(A,B)  (R) and r(A,C)  (R). Since •Since s1(A,B) = r(A,B) and s2(A,C) = r(A,C), s1(A) = AB AC s (A). So because of A  B, s (B) = s (B). This means r(A,B) and r(A,C) have the same value for A, their join 2 1 2 that s1(A,B,C) = s2(A,B,C) = r and r  R. r(A,B,C) = r is in AB(R) A AC(R).

Normalisation in exams Next Lecture

• Consider a relation Book with attributes Author, Title, Publisher, City, Country, Year, ISBN. There are two • Physical DB Issues candidate keys: ISBN and (Author, Title, Publisher, Year). City is the place where the book is published, and • RAID arrays for recovery and speed there are functional dependencies Publisher → City and • Indexes and query efficiency City → Country. Is this relation in 2NF? Explain your answer. (4 marks) • Query optimisation

• Is this relation in 3NF? Explain your answer. (5 marks) • Query trees • For more information • Is the relation above in BCNF? If not, decompose it to BCNF and explain why the resulting tables are in BCNF. • Connolly and Begg chapter 21 and (5 marks). appendix C.5, Ullman and Widom 5.2.8

Next Lecture

• More normalisation • Lossless decomposition; why our reduction to 2NF and 3NF is lossless • Boyce-Codd normal form (BCNF) • Higher normal forms • Denormalisation • For more information • Connolly and Begg chapter 14 • Ullman and Widom chapter 3.6

7