Ryan Marcotte www.cs.uregina.ca/~marcottr CS 475 (Advanced Topics in ) March 14, 2011 Outline  Introduction to XNF and motivation for its creation  Analysis of XNF’s link to BCNF  Algorithm for converting a DTD to XNF  Example

March 14, 2011 Ryan Marcotte 2 March 14, 2011 Ryan Marcotte 3 Introduction  XML is used for data storage and exchange  Data is stored in a hierarchical fashion  Duplicates and inconsistencies may in the data store

March 14, 2011 Ryan Marcotte 4 Introduction  Relational databases store data according to some schema  XML also stores data according to some schema, such as a Document Type Definition (DTD)  Obviously, some schemas are better than others  A normal form is needed that reduces the amount of storage needed while ensuring consistency and eliminating redundancy

March 14, 2011 Ryan Marcotte 5 Introduction  XNF was proposed by Marcelo Arenas and Leonid Libkin (University of Toronto) in a 2004 paper titled “A Normal Form for XML Documents”  Recognized a need for good XML data design as “a lot of data is being put on the web”  “Once massive web databases are created, it is very hard to change their organization; thus, there is a risk of having large amounts of widely accessible, but at the same time poorly organized legacy data.”

March 14, 2011 Ryan Marcotte 6 Introduction  XNF provides a set of rules that describe well-formed DTDs  Poorly-designed DTDs can be transformed into well- formed ones (through normalization – just like relational databases!)  Well-formed DTDs avoid redundancies and update anomalies

March 14, 2011 Ryan Marcotte 7 March 14, 2011 Ryan Marcotte 8 Review of Basic Terms  Recall the definition of functional dependencies (FDs)  Given a schema R, a set of attributes X is said to functionally determine another set of attributes Y (also in R), written X → Y, if and only if for each unique value for X there is exactly one value for Y

March 14, 2011 Ryan Marcotte 9 Review of Basic Terms  F+ is the closure of FDs derived using Armstrong’s axioms:  reflexivity (if Y ⊆ X, then X → Y)  augmentation (if X → Y, then XZ → YZ)  transitivity (if X → Y and Y → Z, then X → Z)  Every set of FDs has a canonical cover (a minimal set of FDs such that all other FDs can be derived using the above axioms)

March 14, 2011 Ryan Marcotte 10 Review of Basic Terms  An element represents a node in the XML tree and includes everything from its start tag to its end tag  An attribute provides additional information about an element; attributes begin with @  A path in an XML document is a sequence of element names separated by periods, ending with an element name or an attribute name

March 14, 2011 Ryan Marcotte 11 Review of Basic Terms ]>

For example:  student_list.student.first_name.S  student_list.student.@id

March 14, 2011 Ryan Marcotte 12 Review of Basic Terms  The term S represents a string value (corresponding to the #PCDATA keyword in the DTD)  For example, if the element name is first_name and the element is Paul, then S = Paul

March 14, 2011 Ryan Marcotte 13 Review of Basic Terms  Redundancy occurs when data corresponding to a single element is stored more than once  Update anomalies take two forms:  Because data for an element is stored multiple times, updating one record creates an inconsistency  Removing an element may remove it from the document entirely  Examples of the above will be given later in the presentation

March 14, 2011 Ryan Marcotte 14 Boyce-Codd Normal Form  A relational is in BCNF if and only if for every one of its nontrivial FDs X → Y, X is a (X is either a or a superset thereof)  Simply speaking, for distinct X, there is exactly one value for Y (no redundancy)  Note that the number of attributes in the key X should be minimized for ease of identification among individual tuples

March 14, 2011 Ryan Marcotte 15 Boyce-Codd Normal Form Examples:  sid, first_name, last_name → age (BAD – not minimum size)  sid → first_name, last_name, age (GOOD – only one attribute)  cid → course_name, semester_offered  course_name → course_description

March 14, 2011 Ryan Marcotte 16 XNF Versus BCNF  XNF generalizes Boyce-Codd Normal Form  XNF disallows redundancy-causing FDs

March 14, 2011 Ryan Marcotte 17 XML Normal Form

 Let P1 and P2 be paths in an XML document  A DTD D and its set of FDs F is in XNF if and only if for

every one of its nontrivial FDs of the form P1 → P2.@a (where @a is an attribute) or P1 → P2.E (where E is an + element), it is the case that P1 → P2 is implied by F

March 14, 2011 Ryan Marcotte 18 XML Normal Form

 In layman’s terms, for distinct values of P1, there is only one value for P2  This is remarkably similar to our definition of BCNF!  In fact, a schema is in BCNF if and only if it’s XML schema equivalent is in XNF (this will not be proven here)

March 14, 2011 Ryan Marcotte 19 XML Normal Form ]>

student_list.student.@id → student_list.student.first_name.S, student_list.student.last_name.S

March 14, 2011 Ryan Marcotte 20 Relational Schema to XML Let R be a relation over attributes A, B, C The schema R(A, B, C) with FD A → BC translates to:

... with FD db.G.@A → db.G.@B, db.G.@C

March 14, 2011 Ryan Marcotte 21 March 14, 2011 Ryan Marcotte 22 Usage  The following algorithm must be used in the design stage of XML database creation  Once data exists in the XML database, it can be very tedious and/or difficult to modify the schema (also, errors may be introduced as a result of the database modifications if it is done by hand)

March 14, 2011 Ryan Marcotte 23 Assumptions  DTDs are assumed to be nonrecursive (nonrecursive DTDs lead to an infinite number of paths)  Note that we can allow for recursion by considering that FDs only specify a finite number of paths and so we can restrict our attention to a finite number of ‘unfoldings’ of the recursive rules  FDs are assumed to have at least one element path on the left-hand side of the rule (that is, FDs are of the form

{ p, p1.@a1, p1.@a2, ..., p1.@an } → q)

March 14, 2011 Ryan Marcotte 24 Basic Operations  Move attributes / child elements from an existing element to another one  Create a new element type

March 14, 2011 Ryan Marcotte 25 Algorithm Given a DTD D and set of FDs F:  If (D, F) is in XNF, return  Otherwise, find an anomalous FD and use the two basic operations to modify D to eliminate the anomalous FD  Repeat the above – the first step will cause the algorithm to terminate once (D, F) is in XNF

March 14, 2011 Ryan Marcotte 26 Algorithm  Just like other normalization algorithms (for 1NF, 2NF, 3NF, and BCNF), the algorithm:  Is simple  Decomposes the schema into separate data structures (tables for relational databases, trees for XML)  FDs are preserved (it is lossless)  The algorithm always terminates; this will not be proven here

March 14, 2011 Ryan Marcotte 27 March 14, 2011 Ryan Marcotte 28 Example Schema

]>

FDs:  courses.course.@cno → courses.course  { courses.course, courses.course.taken_by.student.@sid } → courses.course.taken_by.student  courses.course.taken_by.student.@sid → courses.course.taken_by.student.name.S

March 14, 2011 Ryan Marcotte 29 Example Schema The previous FDs enforce the following constraints:  A course ID uniquely identifies a course  Two distinct students of the same course cannot have the same student ID  Two students with the same student ID must have the same name

March 14, 2011 Ryan Marcotte 30 Schema Problems Or do they? Consider the third FD: courses.course.taken_by.student.@sid → courses.course.taken_by.student.name.S

By XNF, the following must hold: courses.course.taken_by.student.@sid → courses.course.taken_by.student.name

It does not. Why?

March 14, 2011 Ryan Marcotte 31 Schema Problems  \

A single @sid identifies two distinct paths!

March 14, 2011 Ryan Marcotte 32 Schema Problems  The third FD can be violated under the current schema  This is because multiple copies of the name element are stored for each unique @sid; because of this, changing a value in one place introduces inconsistency  Also, deleting student information from a course could remove that student from the database if only one copy of that student’s information exists  The above two points are examples of update anomalies

March 14, 2011 Ryan Marcotte 33 Using the Algorithm  Fix by creating a new element type student_info with @sid as its key  Move the name element from the student element to the student_info element  Though it is not part of the algorithm, we will modify the root element name from “courses” to “db” (database) to better reflect intended semantics

March 14, 2011 Ryan Marcotte 34 Using the Algorithm

]> FDs:  db.course.@cno → db.course  { db.course, db.course.taken_by.student.@sid } → db.course.taken_by.student  db.course.taken_by.student.@sid → db.student_info.name.S

March 14, 2011 Ryan Marcotte 35 Using the Algorithm  No additional anomalous FDs exist; the schema is in XNF  FDs have been preserved

March 14, 2011 Ryan Marcotte 36 Do you have any questions?

March 14, 2011 Ryan Marcotte 37 Resources A Normal Form for XML Documents Marcelo Arenas and Leonid Libkin (University of Toronto) http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.3.7590&rep=rep1&type=pdf

March 14, 2011 Ryan Marcotte 38