XML Normal Form (XNF)
Total Page:16
File Type:pdf, Size:1020Kb
Ryan Marcotte www.cs.uregina.ca/~marcottr CS 475 (Advanced Topics in Databases) March 14, 2011 Outline Introduction to XNF and motivation for its creation Analysis of XNF’s link to BCNF Algorithm for converting a DTD to XNF Example March 14, 2011 Ryan Marcotte 2 March 14, 2011 Ryan Marcotte 3 Introduction XML is used for data storage and exchange Data is stored in a hierarchical fashion Duplicates and inconsistencies may exist in the data store March 14, 2011 Ryan Marcotte 4 Introduction Relational databases store data according to some schema XML also stores data according to some schema, such as a Document Type Definition (DTD) Obviously, some schemas are better than others A normal form is needed that reduces the amount of storage needed while ensuring consistency and eliminating redundancy March 14, 2011 Ryan Marcotte 5 Introduction XNF was proposed by Marcelo Arenas and Leonid Libkin (University of Toronto) in a 2004 paper titled “A Normal Form for XML Documents” Recognized a need for good XML data design as “a lot of data is being put on the web” “Once massive web databases are created, it is very hard to change their organization; thus, there is a risk of having large amounts of widely accessible, but at the same time poorly organized legacy data.” March 14, 2011 Ryan Marcotte 6 Introduction XNF provides a set of rules that describe well-formed DTDs Poorly-designed DTDs can be transformed into well- formed ones (through normalization – just like relational databases!) Well-formed DTDs avoid redundancies and update anomalies March 14, 2011 Ryan Marcotte 7 March 14, 2011 Ryan Marcotte 8 Review of Basic Terms Recall the definition of functional dependencies (FDs) Given a relation schema R, a set of attributes X is said to functionally determine another set of attributes Y (also in R), written X → Y, if and only if for each unique value for X there is exactly one value for Y March 14, 2011 Ryan Marcotte 9 Review of Basic Terms F+ is the closure of FDs derived using Armstrong’s axioms: reflexivity (if Y ⊆ X, then X → Y) augmentation (if X → Y, then XZ → YZ) transitivity (if X → Y and Y → Z, then X → Z) Every set of FDs has a canonical cover (a minimal set of FDs such that all other FDs can be derived using the above axioms) March 14, 2011 Ryan Marcotte 10 Review of Basic Terms An element represents a node in the XML tree and includes everything from its start tag to its end tag An attribute provides additional information about an element; attributes begin with @ A path in an XML document is a sequence of element names separated by periods, ending with an element name or an attribute name March 14, 2011 Ryan Marcotte 11 Review of Basic Terms <!DOCTYPE students [ <!ELEMENT student_list (student)*> <!ELEMENT student (first_name, last_name)> <!ELEMENT first_name (#PCDATA)> <!ELEMENT last_name (#PCDATA)> <!ATTLIST student id CDATA #REQUIRED> ]> For example: student_list.student.first_name.S student_list.student.@id March 14, 2011 Ryan Marcotte 12 Review of Basic Terms The term S represents a string value (corresponding to the #PCDATA keyword in the DTD) For example, if the element name is first_name and the element is <first_name>Paul</first_name>, then S = Paul March 14, 2011 Ryan Marcotte 13 Review of Basic Terms Redundancy occurs when data corresponding to a single element is stored more than once Update anomalies take two forms: Because data for an element is stored multiple times, updating one record creates an inconsistency Removing an element may remove it from the document entirely Examples of the above will be given later in the presentation March 14, 2011 Ryan Marcotte 14 Boyce-Codd Normal Form A relational database is in BCNF if and only if for every one of its nontrivial FDs X → Y, X is a superkey (X is either a candidate key or a superset thereof) Simply speaking, for distinct X, there is exactly one value for Y (no redundancy) Note that the number of attributes in the key X should be minimized for ease of identification among individual tuples March 14, 2011 Ryan Marcotte 15 Boyce-Codd Normal Form Examples: sid, first_name, last_name → age (BAD – not minimum size) sid → first_name, last_name, age (GOOD – only one attribute) cid → course_name, semester_offered course_name → course_description March 14, 2011 Ryan Marcotte 16 XNF Versus BCNF XNF generalizes Boyce-Codd Normal Form XNF disallows redundancy-causing FDs March 14, 2011 Ryan Marcotte 17 XML Normal Form Let P1 and P2 be paths in an XML document A DTD D and its set of FDs F is in XNF if and only if for every one of its nontrivial FDs of the form P1 → P2.@a (where @a is an attribute) or P1 → P2.E (where E is an + element), it is the case that P1 → P2 is implied by F March 14, 2011 Ryan Marcotte 18 XML Normal Form In layman’s terms, for distinct values of P1, there is only one value for P2 This is remarkably similar to our definition of BCNF! In fact, a relational database schema is in BCNF if and only if it’s XML schema equivalent is in XNF (this will not be proven here) March 14, 2011 Ryan Marcotte 19 XML Normal Form <!DOCTYPE students [ <!ELEMENT student_list (STUDENT)*> <!ELEMENT student (first_name, last_name)> <!ELEMENT first_name (#PCDATA)> <!ELEMENT last_name (#PCDATA)> <!ATTLIST student id CDATA #REQUIRED> ]> student_list.student.@id → student_list.student.first_name.S, student_list.student.last_name.S March 14, 2011 Ryan Marcotte 20 Relational Schema to XML Let R be a relation over attributes A, B, C The schema R(A, B, C) with FD A → BC translates to: <!ELEMENT db (G*)> <!ELEMENT G EMPTY> <!ATTLIST G A CDATA #REQUIRED B CDATA #REQUIRED C CDATA #REQUIRED> ... with FD db.G.@A → db.G.@B, db.G.@C March 14, 2011 Ryan Marcotte 21 March 14, 2011 Ryan Marcotte 22 Usage The following algorithm must be used in the design stage of XML database creation Once data exists in the XML database, it can be very tedious and/or difficult to modify the schema (also, errors may be introduced as a result of the database modifications if it is done by hand) March 14, 2011 Ryan Marcotte 23 Assumptions DTDs are assumed to be nonrecursive (nonrecursive DTDs lead to an infinite number of paths) Note that we can allow for recursion by considering that FDs only specify a finite number of paths and so we can restrict our attention to a finite number of ‘unfoldings’ of the recursive rules FDs are assumed to have at least one element path on the left-hand side of the rule (that is, FDs are of the form { p, p1.@a1, p1.@a2, ..., p1.@an } → q) March 14, 2011 Ryan Marcotte 24 Basic Operations Move attributes / child elements from an existing element to another one Create a new element type March 14, 2011 Ryan Marcotte 25 Algorithm Given a DTD D and set of FDs F: If (D, F) is in XNF, return Otherwise, find an anomalous FD and use the two basic operations to modify D to eliminate the anomalous FD Repeat the above – the first step will cause the algorithm to terminate once (D, F) is in XNF March 14, 2011 Ryan Marcotte 26 Algorithm Just like other normalization algorithms (for 1NF, 2NF, 3NF, and BCNF), the algorithm: Is simple Decomposes the schema into separate data structures (tables for relational databases, trees for XML) FDs are preserved (it is lossless) The algorithm always terminates; this will not be proven here March 14, 2011 Ryan Marcotte 27 March 14, 2011 Ryan Marcotte 28 Example Schema <!DOCTYPE courses [ <!ELEMENT courses (course*)> <!ELEMENT course (title, taken_by)> <!ATTLIST course cno CDATA #REQUIRED> <!ELEMENT title (#PCDATA)> <!ELEMENT taken_by (student*)> <!ELEMENT student (name, grade)> <!ATTLIST student sno CDATA #REQUIRED> <!ELEMENT name (#PCDATA)> <!ELEMENT grade (#PCDATA)> ]> FDs: courses.course.@cno → courses.course { courses.course, courses.course.taken_by.student.@sid } → courses.course.taken_by.student courses.course.taken_by.student.@sid → courses.course.taken_by.student.name.S March 14, 2011 Ryan Marcotte 29 Example Schema The previous FDs enforce the following constraints: A course ID uniquely identifies a course Two distinct students of the same course cannot have the same student ID Two students with the same student ID must have the same name March 14, 2011 Ryan Marcotte 30 Schema Problems Or do they? Consider the third FD: courses.course.taken_by.student.@sid → courses.course.taken_by.student.name.S By XNF, the following must hold: courses.course.taken_by.student.@sid → courses.course.taken_by.student.name It does not. Why? March 14, 2011 Ryan Marcotte 31 Schema Problems \ A single @sid identifies two distinct paths! March 14, 2011 Ryan Marcotte 32 Schema Problems The third FD can be violated under the current schema This is because multiple copies of the name element are stored for each unique @sid; because of this, changing a value in one place introduces inconsistency Also, deleting student information from a course could remove that student from the database if only one copy of that student’s information exists The above two points are examples of update anomalies March 14, 2011 Ryan Marcotte 33 Using the Algorithm Fix by creating a new element type student_info with @sid as its key Move the name element from the student element to the student_info element Though it is not part of the algorithm, we will modify the root element name from “courses” to “db” (database) to better reflect intended semantics March 14, 2011 Ryan Marcotte 34 Using the Algorithm <!DOCTYPE university_db [ <!ELEMENT db (course*, student_info*)> <!ELEMENT course (title, taken_by)> <!ATTLIST course