Data on the Web: from Relations to Semistructured Data and XML
Total Page:16
File Type:pdf, Size:1020Kb
Data on the Web: From Relations to Semistructured Data and XML Serge Abiteboul Peter Buneman Dan Suciu February 19, 2014 1 Data on the Web: from Relations to Semistructured Data and XML Serge Abiteboul Peter Buneman Dan Suciu An initial draft of part of the book Not for distribution 2 Contents 1 Introduction 7 1.1 Audience . 8 1.2 Web Data and the Two Cultures . 9 1.3 Organization . 15 I Data Model 17 2 A Syntax for Data 19 2.1 Base types . 21 2.2 Representing Relational Databases . 21 2.3 Representing Object Databases . 22 2.4 Specification of syntax . 26 2.5 The Object Exchange Model, OEM . 26 2.6 Object databases . 27 2.7 Other representations . 30 2.7.1 ACeDB . 30 2.8 Terminology . 32 2.9 Bibliographic Remarks . 34 3 XML 37 3.1 Basic syntax . 39 3.1.1 XML Elements . 39 3.1.2 XML Attributes . 41 3.1.3 Well-Formed XML Documents . 42 3.2 XML and Semistructured Data . 42 3.2.1 XML Graph Model . 43 3.2.2 XML References . 44 3 4 CONTENTS 3.2.3 Order . 46 3.2.4 Mixing elements and text . 47 3.2.5 Other XML Constructs . 47 3.3 Document Type Declarations . 48 3.3.1 A Simple DTD . 49 3.3.2 DTD's as Grammars . 49 3.3.3 DTD's as Schemas . 50 3.3.4 Declaring Attributes in DTDs . 52 3.3.5 Valid XML Documents . 54 3.3.6 Limitations of DTD's as schemas . 55 3.4 Document Navigation . 56 3.5 DCD . 57 3.6 Paraphernalia . 58 3.7 Bibliographic Remarks . 61 II Queries 63 4 Query Languages 65 4.1 Path expressions . 67 4.2 A core language . 70 4.2.1 The basic syntax . 71 4.3 More on Lorel . 74 4.4 UnQL . 77 4.5 Label and path variables . 79 4.6 Mixing with structured data . 81 4.7 Bibliographic Remarks . 84 5 Query Languages for XML 87 5.1 XML-QL . 87 5.2 XSL . 98 5.3 Bibliographic Remarks . 102 6 Interpretation and advanced features 105 6.1 First-order interpretation . 106 6.2 Object creation . 110 6.3 Graphical languages . 114 6.4 Structural Recursion . 115 6.4.1 Structural recursion on trees . 115 CONTENTS 5 6.4.2 XSL and Structural Recursion . 118 6.4.3 Bisimulation in Semistructured Data . 120 6.4.4 Structural recursion on cyclic data . 124 6.5 StruQL . 127 III Types 131 7 Typing semistructured data 133 7.1 What is typing good for? . 135 7.1.1 Browsing and querying data . 135 7.1.2 Optimizing query evaluation . 135 7.1.3 Improving storage . 138 7.2 Analyzing the problem . 138 7.3 Schema Formalisms . 140 7.3.1 Logic . 140 7.3.2 Datalog . 142 7.3.3 Simulation . 145 7.3.4 Comparison between datalog rules and simulation . 152 7.4 Extracting Schemas From Data . 154 7.4.1 Data Guides . 154 7.4.2 Extracting datalog rules from data . 160 7.5 Inferring Schemas from Queries . 164 7.6 Sharing, Multiplicity, and Order . 165 7.6.1 Sharing . 165 7.6.2 Attribute Multiplicity . 168 7.6.3 Order . 169 7.7 Path constraints . 170 7.7.1 Path constraints in semistructured data . 172 7.7.2 The constraint inference problem . 175 7.8 Bibliographic Remarks . 176 IV Systems 179 8 Query Processing 181 8.1 Architecture . 181 8.2 Semistructured Data Servers . 184 8.2.1 Storage . 186 6 CONTENTS 8.2.2 Indexing . 194 8.2.3 Distributed Evaluation . 203 8.3 Mediators for Semistructured Data . 211 8.3.1 A Simple Mediator: Converting Relational Data to XML211 8.3.2 Mediators for Data Integration . 213 8.4 Incremental Maintenance of Semistructured Data . 220 8.5 Bibliographic Remarks . 222 9 The Lore system 225 9.1 Architecture . 226 9.2 Query processing and indexes . 227 9.3 Other aspects of Lore . 230 10 Strudel 235 10.0.1 An Example . 236 10.0.2 Advantages of Declarative Web Site Design . 245 11 Database products supporting XML 249 Bibliography 254 Index 254 Chapter 1 Introduction Until a few years ago the publication of electronic data was limited to a few scientific and technical areas. It is now becoming universal. Most people see such data as Web documents, but these documents rather than being man- ually composed are increasingly generated automatically from databases. The documents therefore have some regularity or some underlying structure that may or may not be understood by the user. It is possible to pub- lish enormous volumes of data in this way, and we are now starting to see the development of software that extracts structured data from Web pages that were generated to be readable by humans. From a document perspec- tive, issues such as efficient retrieval, version control, change management and sophisticated methods of querying documents, which were formerly the province of database technology, are now important. From a database per- spective, the Web has generated an enormous demand for recently developed database architectures such as data warehouses and mediation systems for database integration, and it has led to the development of semistructured data models with languages adapted to this model. The emergence of XML as a standard for data representation on the Web is expected greatly to facilitate the publication of electronic data by provid- ing a simple syntax for data that is both human- and machine-readable. While XML is itself relatively simple, it is surrounded by a confusing num- ber of XML-enabled systems by various software vendors and and an al- phabet soup of proposals (RDF, XML-Data, XML-Schema, SOX, etc.) for XML-related standards. Although the document and database viewpoints were, until quite re- cently, irreconcilable, there is now a convergence in technologies brought 7 8 CHAPTER 1. INTRODUCTION about by the development of XML for data on the Web and the closely re- lated development of semistructured data in the database community. This book is primarily about this convergence. Its main goal is to present founda- tions for the management of data found on the Web. New shifts of paradigms are needed from an architecture and a data model viewpoint. These form the core of the book. A constant theme is bridging the gap between logi- cal representation of data and data modeling on one hand and syntax and functionalities of document systems on the other. We hope this book will help clarify these concepts and make it easier to understand the variety of powerful tools that are being developed for Web data. 1.1 Audience The book aims at laying the foundations for future Web-based data-intensive applications. As such it has several potential audiences. The prime audience consists of people developing tools or doing research related to the management of data on the Web. Most of the topics presented in the book are today the focus of active research. The book can serve as an entry point to this rapidly evolving domain. For readers with a data management background, it will serve as an introduction to Web data and notably to XML. For people coming from Web publishing, this book aims to explain why modern database technology is needed for the integrated storage and retrieval of Web data. It will present a perhaps unexpected view of the future use of XML, as a data exchange format, as opposed to a standardized document markup language. A second audience consists of students and teachers interested in semistruc- tured data and in the management of data on the Web. The book contains an extensive survey of the literature, and can thus serve as basis for an advanced seminar. A third audience consists of information systems managers in charge of publishing data in Web sites. This book is intended to take them away from their day to day struggle with new tools and changes of standards. It is hoped that it may help them understand the main technical solutions as well as bottlenecks and give them a feeling of longer term goals in the representation of data. This book may help them achieve a better vision of the new breed of information systems that will soon pervade the Web. 1.2. WEB DATA AND THE TWO CULTURES 9 1.2 Web Data and the Two Cultures Today's Web The Web provides a simple and universal standard for the exchange of information. The central principle is to decompose information into units that can be named and transmitted. Today, the unit of infor- mation is typically a file that is created by one Web user and shared with others by making available its name in the form of a URL (Uniform Re- source Locator). Other users and systems keep the URL in order to retrieve the file when required. Information, however, has structure. The success of the Web is derived from the development of HTML (Hypertext Markup Language), a means of structuring text for visual presentation. HTML de- scribes both an intra-document structure (the layout and format of the text), and an inter-document structure (references to other documents through hy- perlinks). The introduction of HTTP as a standard and use of HTML for composing documents is at the root of the universal acceptance of the Web as the medium of information exchange. The Database Culture There is another long-standing view of the struc- ture of information that is almost orthogonal to that of textual structure. This is the view developed for data management systems.