Instituto Tecnológico Y De Estudios Superiores De Monterrey Campus Monterrey
Total Page:16
File Type:pdf, Size:1020Kb
INSTITUTO TECNOLÓGICO Y DE ESTUDIOS SUPERIORES DE MONTERREY CAMPUS MONTERREY GRADUATE PROGRAM IN MECHATRONICS AND INFORMATION TECHNOLOGIES AN INTEGRATED DATA MODEL AND WEB PROTOCOL FOR ARBITRARILY STRUCTURED INFORMATION DOCTORAL DISSERTATION PRESENTED AS A PARTIAL REQUISITE TO OBTAIN THE ACADEMIC DEGREE OF: Doctor of Philosophy in Information Technologies and Communications Major in Computer Science BY: FRANCISCO ÁLVAREZ CAVAZOS MONTERREY, NUEVO LEÓN, MÉXICO DECEMBER 2007 AN INTEGRATED DATA MODEL AND WEB PROTOCOL FOR ARBITRARILY STRUCTURED INFORMATION A Dissertation Presented by Francisco Álvarez Cavazos Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Information Technologies and Communications Major: Computer Science Thesis Committee: Dr. José I. Icaza Acereto, ITESM Campus Monterrey Dr. Juan C. Lavariega Jarquín, ITESM Campus Monterrey Dr. David A. Garza Salazar, ITESM Campus Monterrey Dr. Lorena G. Gómez Martínez, ITESM Campus Monterrey Dr. Susan D. Urban, Arizona State University Informatics Research Center Instituto Tecnológico y de Estudios Superiores de Monterrey Campus Monterrey December 2007 Instituto Tecnológico y de Estudios Superiores de Monterrey Campus Monterrey Graduate Program in Mechatronics and Information Technologies The committee members hereby recommend the dissertation presented by Francisco Álvarez Cavazos to be accepted as a partial fulfillment of requirements to be admitted to the Degree of Doctor of Philosophy in Information Technologies and Communications, Major in Computer Science. Committee members: Dr. José I. Icaza Acereto Advisor Dr. Juan C. Lavariega Jarquín Advisor Dr. David A. Garza Salazar Dr. Lorena G. Gómez Martínez Dr. Susan D. Urban Dr. Graciano Dieck Assad Director of Graduate Program in Mechatronics and Information Technologies Declaration I hereby declare that I composed this dissertation entirely myself and that it describes my own research. Francisco Álvarez Cavazos Monterrey, N.L., México December 2007 Dedication To Ana Marianela, beloved partner, betrothed, and warmth of my life. Thanks for teach- ing me to be at ease with myself and others; and for being my closest companion in this just-concluded adventure of spiritual growth, and for looking forward to all others to come. Although you value this work as your own, I will compensate you for the time I could not give you while I was working on this project by pledging to love and honor you in intention and attention, knowing that in these deeds is also my reward. To my sister, Yvett, and to my brothers, Eduardo and Gerardo, for teaching me to laugh and to feel joy. To my mother, for teaching me to care and love. To my father, for teaching me to seek truth by reason. To you, Lord Jesus; Peace, Joy, Love and Light of the world, for making it all possible. vi Acknowledgements I thank Ana Marianela Ríos for her company, support and encouragement dedicated to me during the course of my doctoral studies. Thanks for always inspiring me to do my best. To my family, for always being ready to lend a helping hand whenever I need it. To my friends, José A. Aldana, Juan F. Álvarez, and Abraham D. López, through whom I have had the opportunity to stay in touch with this world and others. To Dr. Lorena G. Gómez and Dr. Susan D. Urban, for the interest and support given to this dissertation. To Dr. David A. Garza, for always being there, edifying me through example and wise counsel through all stages, both the pleasurable and the harsh, of my graduate studies and professional development. To Dr. Juan C. Lavariega, advisor of this dissertation, for his invaluable help and accurate recommendations, crucial in bringing this research to completion. Thanks in particular, for the advice given during my master’s studies and in the first two years of my doctoral studies, years from which my subsequent research and practical contributions have been derived. To Dr. José I. Icaza, principal advisor of this dissertation, for the guidance through the con- ceptual thicket of computer science. For directing me towards the Realms of Novelty and Value, by keeping an eye on the Feasibility, Repeatability, Efficiency and Effectiveness land- marks. Since good research makes good teaching and good teaching makes good research, it follows that a good teacher makes a good researcher. For this, I thank you dearly. To all others that were my professors or classmates during my graduate studies. In some way or another, all the lessons and experiences we have shared have contributed to the products of this research, both human and technical. Many thanks to you all! viii Abstract Within the Web’s data ecosystem dwell applications that consume and produce information with varying degrees of structuring, ranging from very structured business data to the semi- structured or unstructured data found in documents which contain a significant amount of text. Current database technology was not designed for the Web and, consequently, database communication protocols, query models, and even data models are inadequate for the de- mands of “data everywhere.” Thus, a technique to uniformly store, search, transport and up- date all the variety of information within Web or intranet environments has yet to be designed. The Web context require the data management community to address: (a) data modeling and basic querying to support multiple data models to accommodate many types of data sources, (b) powerful search mechanisms that accept keyword queries and select relevant structured sources that may answer them, and (c) the ability to combine answers from structured and un- structured data in a principled way. In consequence, this dissertation constructively designs a technique to store, search, transport and update unstructured and structured information for Web or intranet-based environments: the Relational-text (RELTEX) protocol. Central to the design of the protocol is an integrated model for structured and unstructured data and its associated declarative language interface, namely, the RELTEX model and calculus. The RELTEX model is constructively defined departing from the relational and infor- mation retrieval models and their associated retrieval strategies. The model’s data items are tuples with structured “columns” and unstructured “fields” that further allow idiosyncratic schema in the form of “extension fields”, which are tuple-specific name/value pairs. This flexibility allows representation of totally unstructured information, totally structured infor- mation, and mixtures of structured and unstructured data, such as tables where tuples have a varying number of fields over time. RELTEX calculus extends tuple relational calculus to consider text fields, similarity matches, match ranking, and sort order. Then, building on top of the formally-defined RELTEX data model and calculus and departing from the architec- ture of the Web, the RELTEX protocol is defined as a resource-centric protocol to describe and manipulate data and schema of unstructured and structured data sources. An equivalence mapping between RELTEX and the relational and information retrieval models is provided. The mapping suggests a wide range of applicability for RELTEX, thus proving the model’s value. On the other hand, the RELTEX protocol is distinguished from other techniques for data access and storage in the Web since (a) it supports structured and unstructured data ma- nipulation and retrieval, (b) it offers operations to describe and manipulate both common and idiosyncratic schema of data items and (c) it directly federates data items to the Web over a compound key; thus demonstrating novelty and value. The RELTEX protocol, model and calculus are proven feasible by means of a proof-of-concept implementation. Departing from a motivating scenario, the prototype is used to provide representative examples of data and schema operations. Having demonstrated that the RELTEX protocol and model contribute towards the data modeling and basic querying challenge imposed by the Web, we expect that this dissertation benefits researchers and practitioners alike with a novel, valuable, effective and feasible tech- nique to store, search, transport and update unstructured and structured information in the Web environment. Table of Contents Dedication vi Acknowledgements viii Abstract x Table of Contents xi List of Figures xiv List of Tables xv 1 Introduction 1 1.1 Research Problem & Hypotheses . 7 1.2 Validation . 9 1.3 Justification . 11 1.4 Scope . 12 1.5 Contributions . 12 1.6 Document Organization . 13 2 A Web of Data in the Making 15 2.1 The Structure of Data Retrieval . 15 2.2 The Web Service Sea Change . 17 2.3 Structuring the Unstructured . 20 2.4 Related Work . 21 2.5 Motivating Scenario . 23 3 Model for Describing Unstructured and Structured Data 26 3.1 Modeling Data with Relations and Documents . 26 3.1.1 Relational Model . 27 3.1.2 Information Retrieval Model . 28 3.2 Relational-text Model . 31 3.2.1 Relational-text Calculus . 34 3.2.2 Modeling Relations and Document Collections . 38 3.2.3 Calculus Example: SQL statements . 40 3.2.4 Relational-text Schema . 42 xi 3.2.5 Relational-text Database Sample Operations . 50 3.3 Relational-text Model Assessment . 57 3.3.1 Wide Range of Applicability . 60 3.4 Conclusions . 68 4 Relational-text Protocol 71 4.1 RESTful Design . 71 4.2 Data Context . 73 4.2.1 Formalizing the Protocol . 75 4.2.2 Data Retrieval Options . 79 4.3 Schema Context . 80 4.4 Metadata Context . 83 4.5 Relational-text Protocol Sample Operations . 85 4.6 Securing the Protocol . 105 4.7 Relational-text Protocol Assessment . 106 4.8 Conclusions . 110 5 Relational-text Implementation 113 5.1 Reference Architecture . 113 5.1.1 Component-based Architecture . 116 5.1.1.1 Implementation Platform . 119 5.1.1.2 Implementation Statistics . 121 5.1.2 Function-based Architecture . 128 5.1.2.1 Data Access API . 128 5.1.2.2 Query Decomposition and Merger of Results . 139 5.1.2.3 Relational-text API Sample Operations .