Design Guidelines for Reducing Redundancy in Relational and XML Data

Design Guidelines for Reducing Redundancy in Relational and XML Data by Solmaz Kolahi A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Computer Science University of Toronto Copyright °c 2008 by Solmaz Kolahi Abstract Design Guidelines for Reducing Redundancy in Relational and XML Data Solmaz Kolahi Doctor of Philosophy Graduate Department of Computer Science University of Toronto 2008 In this dissertation, we propose new design guidelines to reduce the amount of redundancy that databases carry. We use techniques from information theory to de¯ne a measure that evaluates a database design based on the worst possible redundancy carried in the instances. We then continue by revisiting the design problem of relational data with functional dependencies, and measure the lowest price, in terms of redundancy, that has to be paid to guarantee a dependency-preserving normalization for all schemas. We provide a formal justi¯cation for the Third Normal Form (3NF) by showing that we can achieve this lowest price by doing a good 3NF normalization. We then study the design problem for XML documents that are views of relational data. We show that we can design a redundancy-free XML representation for some relational schemas while preserving all data dependencies. We present an algorithm for converting a relational schema to such an XML design. We ¯nally study the design problem for XML documents that are stored in relational databases. We look for XML design criteria that ensure a relational storage with low redundancy. First, we characterize XML designs that have a redundancy-free relational storage. Then we propose a restrictive condition for XML functional dependencies that guarantees a low redundancy for data values in the relational storage. ii To my beloved family: Vahid, Mom, Dad, and Siamak. iii Acknowledgements I would like to acknowledge the help and support of many people, who made the com- pletion of this thesis possible. First and foremost, I would like to express my deep and sincere gratitude to my supervisor, Professor Leonid Libkin, for everything he taught me. His enthusiasm, broad knowledge, and insightful ideas have made working with him a great pleasure. I would like to thank the members of my Ph.D. advisory committee, Professors Sheila McIlraith and Ren¶eeJ. Miller for their helpful comments and advices over the years. I would also like to thank my external examiner, Professor Leopoldo Bertossi, and also Professors Michael Gruninger and John Mylopoulos for reading my thesis and providing useful feedbacks and enlightening questions. I am also grateful to Professor Alberto O. Mendelzon for being a great help in the early years of my studies before he sadly passed away. He will always be dearly remembered. I would like to acknowledge the ¯nancial support I received from the Department of Computer Science, University of Toronto, and the Government of Ontario. I feel very lucky to have had the opportunity of working and learning among a great group of people in the database group at the University of Toronto. I would like to thank all my friends in the database group, especially Marcelo Arenas, for starting a very nice research, which was a great inspiration for me, and Ramona Truta for being a nice instructor and a good friend when I was her TA. I would like to say a big thank you to all my Iranian friends in Toronto, who made me feel home in Canada. I will always remember the wonderful memories we shared. My special thanks goes to Afra Alishahi, Niloofar Arshadi, Elham Fazli, Afsaneh Fazly, and Nargol Rezvani for always listening to my complaints patiently, and Azadeh Farzan, Shiva Nejati, and Azadeh Shakery for being such great and inspiring friends ever since high school days. I could have never ¯nished my degree without the exceptional support of my loving iv family. I would like to thank my dear husband, Vahid, for his technical support and feedbacks on my work over the years, for his patience and understanding during hard times, and more importantly, for being my best friend and companion. I am deeply indebted to my parents for their endless love, support, and encouragement in all aspects of my life, and for being my ¯rst and greatest teachers. I would also like to thank my dear brother, Siamak, for the happy childhood memories, and for always being such a caring and supportive friend. v Contents 1 Introduction 1 1.1 Summary of Contributions . 4 2 Relational Databases - An Overview 6 2.1 Database Schemas and Instances . 6 2.2 Data Dependencies . 9 2.2.1 Functional and Key Dependencies . 9 2.2.2 Multivalued and Join Dependencies . 11 2.2.3 Equality-Generating Dependencies . 12 2.2.4 Inclusion and Foreign Key Dependencies . 13 2.3 Designing Relational Data . 14 2.3.1 Normalizing Relational Data . 14 3 Information Theory and Normal Forms - An Overview 21 3.1 Preliminaries . 22 3.1.1 Schemas and Instances . 22 3.1.2 Basics of Information Theory . 22 3.2 Information Theory and Normal Forms . 23 3.2.1 Relative Information Content . 24 3.2.2 Justifying Perfect Normal Forms . 28 vi 4 Characterizing Almost-Perfect Normal Forms 31 4.1 Introduction . 32 4.2 Price of Dependency Preservation . 35 4.2.1 Guaranteed Information Content . 36 4.3 Comparing Normal Forms . 43 4.3.1 Guaranteed Average Information Content . 50 4.4 How Good Is an Arbitrary Design? . 54 4.5 Related Work . 58 5 XML Documents - An Overview 60 5.1 DTDs and XML Documents . 60 5.2 Key and Foreign Key Constraints for XML . 66 5.2.1 Keys De¯ned by Attributes . 68 5.2.2 Keys De¯ned by Path Expressions . 70 5.3 Functional Dependencies for XML . 72 5.3.1 Functional Dependencies De¯ned by Tree Tuples . 72 5.3.2 Other De¯nitions for Functional Dependencies . 75 5.4 Designing XML Data . 79 5.4.1 XNF: A Normal Form for XML Documents . 79 5.4.2 XNF Normalization and Dependency Preservation . 80 5.4.3 Justifying Perfect XML Normal Forms . 84 6 A Redundancy-Free XML Design for Relational Data 86 6.1 Introduction . 86 6.2 Dependency-Preserving Hierarchical Representation of Relations . 89 6.3 Algorithm . 94 6.4 A More General Representation . 96 6.5 Related Work . 98 vii 7 XML Design for Relational Storage 99 7.1 Introduction . 100 7.2 XML-to-Relational Mapping Scheme . 102 7.2.1 Translating XML Constraints . 105 7.3 Relational Storage for Perfect XML Designs . 106 7.4 What is an Almost-Perfect Design for XML? . 108 7.4.1 Guaranteed Information Content . 111 7.5 A Di®erent Mapping Scheme . 114 7.6 Related Work . 118 8 Conclusions 119 9 Future Work 121 Bibliography 123 viii List of Figures 2.1 Movies Database. 7 2.2 An algorithm for computing the closure [AHV95]. 10 2.3 An algorithm for producing BCNF schemas [AHV95]. 17 2.4 A 3NF relation. 18 2.5 An algorithm for synthesizing 3NF schemas. 19 k 3.1 Calculating RicI (p j §). 26 3.2 Information content vs. redundancy, where §1 = fA ! Cg and §2 = fA ! C; B ! Cg................................ 27 4.1 An algorithm for synthesizing 3NF schemas [AHV95]. 38 4.2 A database instance for the proofs of Propositions 4.5 and 4.10. 40 4.3 A database instance for the proof of Proposition 4.9. 45 4.4 A database instance for the proof of Proposition 4.16. 52 5.1 An XML document. 62 5.2 A DTD for the company database. 63 5.3 Tree representation of XML document. 64 5.4 A tree tuple. 73 5.5 An XML schema tree [HL03]. 76 5.6 An XML data tree [HL03]. 77 5.7 A school database [WT05]. 78 ix 5.8 XNF decomposition operations: (a) Moving attributes, (b) Creating new element types [AL04]. 81 5.9 XNF decomposition algorithm [AL04]. 82 5.10 A redundancy-free XML document. 84 6.1 Conversion of relational data into a redundancy-free XML document. 88 6.2 Dependency-preserving translation of relational data into redundancy-free XML documents. 93 6.3 Finding the orderings of attributes using the algorithm in Section 6.3. 95 7.1 An XML tree. 103 x Chapter 1 Introduction With the rapid growth and development of information systems, it becomes more essential to organize data in such a way that fast access and analysis of the data is feasible. Since the early 1970s, relational databases continue to serve as a popular framework for organizing and querying data. In relational databases, data values are grouped into tables or relations, where each table has a number of related columns or attributes. The schema of a relational database consists of all the tables and their column names. Database design is the process of producing a good schema for a database by deciding how to group the columns of interest into tables. There are several criteria for a good design [BPR88]: performance, or how fast the data is accessible; integrity, or to what extent the schema guarantees that the data is correct with respect to some constraints; understandability, or how coherent the struc- ture of the database is to a user; and extensibility, or how easily the database can be extended to new applications. The most important factor in maintaining the integrity or correctness of a database is controlling the redundancy of data, which is the main concern of this thesis in the process of schema design. Di®erent methodologies exist to facilitate the process of database design.

Design Guidelines for Reducing Redundancy in Relational and XML Data

Health Sensor Data Management in Cloud

Requirements for XML Document Database Systems Airi Salminen Frank Wm

XML Normal Form (XNF)

Model-Based XML to Relational Database Mapping Choices

XML in Oracle 9I

Transactional Support in Native Xml Databases

XML Native Databases and Legislative Documents: a White Paper

Examining Database Persistence of ISO/EN 13606 Standardized Electronic Health Record Extracts: Relational Vs

Translating Between XML and Relational Databases Using XML Schema and Automed

XML Programming with SQL/XML and Xquery

Database Management Systems Ebooks for All Edition (

Synopsis Data Structures for XML Databases: Models, Issues, and Research Perspectives