Session 7 – Main Theme
Total Page:16
File Type:pdf, Size:1020Kb
Database Systems Session 7 – Main Theme Functional Dependencies and Normalization Dr. Jean-Claude Franchitti New York University Computer Science Department Courant Institute of Mathematical Sciences Presentation material partially based on textbook slides Fundamentals of Database Systems (6th Edition) by Ramez Elmasri and Shamkant Navathe Slides copyright © 2011 and on slides produced by Zvi Kedem copyight © 2014 1 Agenda 1 Session Overview 2 Logical Database Design - Normalization 3 Normalization Process Detailed 4 Summary and Conclusion 2 Session Agenda . Logical Database Design - Normalization . Normalization Process Detailed . Summary & Conclusion 3 What is the class about? . Course description and syllabus: » http://www.nyu.edu/classes/jcf/CSCI-GA.2433-001 » http://cs.nyu.edu/courses/spring15/CSCI-GA.2433-001/ . Textbooks: » Fundamentals of Database Systems (6th Edition) Ramez Elmasri and Shamkant Navathe Addition Wesley ISBN-10: 0-1360-8620-9, ISBN-13: 978-0136086208 6th Edition (04/10) 4 Icons / Metaphors Information Common Realization Knowledge/Competency Pattern Governance Alignment Solution Approach 55 Agenda 1 Session Overview 2 Logical Database Design - Normalization 3 Normalization Process Detailed 4 Summary and Conclusion 6 Agenda . Informal guidelines for good design . Functional dependency . Basic tool for analyzing relational schemas . Informal Design Guidelines for Relation Schemas . Normalization: . 1NF, 2NF, 3NF, BCNF, 4NF, 5NF • Normal Forms Based on Primary Keys • General Definitions of Second and Third Normal Forms • Boyce-Codd Normal Form • Multivalued Dependency and Fourth Normal Form • Join Dependencies and Fifth Normal Form 7 Logical Database Design . We are given a set of tables specifying the database » The base tables, which probably are the community (conceptual) level . They may have come from some ER diagram or from somewhere else . We will need to examine whether the specific choice of tables is good for » For storing the information needed » Enforcing constraints » Avoiding anomalies, such as redundancies . If there are issues to address, we may want to restructure the database, of course not losing any information . Let us quickly review an example from “long time ago” 8 A Fragment Of A Sample Relational Database R Name SSN DOB Grade Salary A 121 2367 2 80 A 132 3678 3 70 B 101 3498 4 70 C 106 2987 2 80 Business rule (one among several): • The value of Salary is determined only by the value of Grade Comment: • We keep track of the various Grades for more than just computing salaries, though we do not show it • For instance, DOB and Grade together determine the number of vacation days, which may therefore be different for SSN 121 and 106 9 Anomalies Name SSN DOB Grade Salary A 121 2367 2 80 A 132 3678 3 70 B 101 3498 4 70 C 106 2987 2 80 . “Grade = 2 implies Salary = 80” is written twice . There are additional problems with this design. » We are unable to store the salary structure for a Grade that does not currently exist for any employee. » For example, we cannot store that Grade = 1 implies Salary = 90 » For example, if employee with SSN = 132 leaves, we forget which Salary should be paid to employee with Grade = 3 » We could perhaps invent a fake employee with such a Grade and such a Salary, but this brings up additional problems, e.g., What is the SSN of such a fake employee? It cannot be NULL as SSN is the primary key 10 Better Representation Of Information . The problem can be solved by replacing R Name SSN DOB Grade Salary A 121 2367 2 80 A 132 3678 3 70 B 101 3498 4 70 C 106 2987 2 80 . by two tables S Name SSN DOB Grade T Grade Salary A 121 2367 2 2 80 A 132 3678 3 3 70 B 101 3498 4 4 70 C 106 2987 2 11 Decomposition . SELECT INTO S Name, SSN, DOB, Grade FROM R; . SELECT INTO T Grade, Salary FROM R; 12 Better Representation Of Information . And now we can » Store “Grade = 3 implies Salary = 70”, even after the last employee with this Grade leaves » Store “Grade = 2 implies Salary = 90”, planning for hiring employees with Grade = 1, while we do not yet have any employees with this Grade S Name SSN DOB Grade T Grade Salary A 121 2367 2 1 90 B 101 3498 4 2 80 3 70 C 106 2987 2 4 70 13 No Information Was Lost . Given S and T, we can reconstruct R using natural join S Name SSN DOB Grade T Grade Salary A 121 2367 2 2 80 A 132 3678 3 3 70 B 101 3498 4 4 70 C 106 2987 2 SELECT INTO R Name, SSN, DOB, S.Grade AS Grade, Salary FROM T, S WHERE T.Grade = S.Grade; R Name SSN DOB Grade Salary A 121 2367 2 80 A 132 3678 3 70 B 101 3498 4 70 C 106 2987 2 80 14 Natural Join (Briefly, More Later) . Given several tables, say R1, R2, …, Rn, their natural join is computed using the following “template”: SELECT INTO R one copy of each column name FROM R1, R2, …, Rn WHERE equal named columns have to be equal . The intuition is that R was “decomposed” into R1, R2, …,Rn by appropriate SELECT statements, and now we are putting it back together 15 Comment On Decomposition . It does not matter whether we remove duplicate rows . But some systems insist that that a row cannot appear more than once with a specific value of a primary key . So this would be OK for such a system T Grade Salary 2 80 3 70 4 70 . This would not be OK for such a system T Grade Salary 2 80 3 70 4 70 2 80 16 Comment On Decomposition . We can always make sure, in a system in which DISTINCT is allowed, that there are no duplicate rows by writing SELECT INTO T DISTINCT Grade, Salary FROM R; . And similarly elsewhere 17 Natural Join And Lossless Join Decomposition . Natural Join is: » Cartesian join with condition of equality on corresponding columns » Only one copy of each column is kept . “Lossless join decomposition” is another term for information not being lost, that is we can reconstruct the original table by “combining” information from the two new tables by means of natural join . This does not necessarily always hold . We will have more material about this later . Here we just observe that our decomposition satisfied this condition at least in our example 18 Elaboration On “Corresponding Columns” (Using Semantically “Equal” Columns) . It is suggested by some that no two columns in the database should have the same name, to avoid confusion, then we should have columns and join similar to these S S_Name S_SSN S_DOB S_Grade T T_Grade T_Salary A 121 2367 2 2 80 A 132 3678 3 3 70 B 101 3498 4 4 70 C 106 2987 2 SELECT INTO R S_Name AS R_Name, S_SSN AS R_SSN, S_DOB AS R_DOB, S_Grade AS R_Grade, T_Salary AS R_Salary FROM T, S WHERE T_Grade = S_Grade; R R_Name R_SSN R_DOB R_Grade R_Salary A 121 2367 2 80 A 132 3678 3 70 B 101 3498 4 70 C 106 2987 2 80 19 Mathematical Notation For Natural Join (We Will Use Sparingly) . There is a special mathematical symbol for natural join . It is not part of SQL, of course, which only allows standard ANSI font . In mathematical, relational algebra notation, natural join of two tables is denoted by a bow-like symbol (this symbol appears only in special mathematical fonts, so we may use in these notes instead) . So we have: R = S T . It is used when “corresponding columns” means “equal columns” 20 Revisiting The Problem . Let us look at R Name SSN DOB Grade Salary A 121 2367 2 80 A 132 3678 3 70 B 101 3498 4 70 C 106 2987 2 80 A 132 3678 3 70 B 101 3498 4 70 . The problem is not that there are duplicate rows . The problem is the same as before, business rule assigning Salary to Grade is written a number of time . So how can we “generalize” the problem? 21 Stating The Problem In General R Name SSN DOB Grade Salary A 121 2367 2 80 A 132 3678 3 70 B 101 3498 4 70 C 106 2987 2 80 A 132 3678 3 70 B 101 3498 4 70 . We have a problem whenever we have two sets of columns X and Y (here X is just Grade and Y is just Salary), such that 1. X does not contain a key either primary or unique (thus there could be several/many non-identical rows with the same value of X) 2. Whenever two rows are equal on X, they must be equal on Y . Why a problem: the business rule specifying how X “forces” Y is “embedded” in different rows and therefore » Inherently written redundantly » Cannot be stored by itself 22 What Did We Do? Think X = Grade And Y = Salary . We had a table U X V Y W . We replaced this one table by two tables U X V W X Y 23 Goodness of Relational Schemas . Levels at which we can discuss goodness of relation schemas . Logical (or conceptual) level . Implementation (or physical storage) level . Approaches to database design: . Bottom-up or top-down 24 Informal Design Guidelines for Relation Schemas . Measures of quality . Making sure attribute semantics are clear . Reducing redundant information in tuples . Reducing NULL values in tuples . Disallowing possibility of generating spurious tuples 25 Imparting Clear Semantics to Attributes in Relations . Semantics of a relation . Meaning resulting from interpretation of attribute values in a tuple .