XML Storage Structured Data to Relations

Comp. by: CsenthilkumaranGalleys0000861598 Date:30/8/08 Time:02:46:19 Stage:First Proof File Path://ppdys1108/Womat3/Production/PRODENV/0000000005/0000008302/0000000016/ 0000861598.3D Proof by: QC by: X allowed the definition of mappings from semi- XML Storage structured data to relations. Even earlier, storage techniques and storage systems had been developed for 1 2 DENILSON BARBOSA ,PHILIP BOHANNON ,JULIANA object-oriented data ([EDS reference: Object data 3 4 FREIRE ,CARL-CHRISTIAN KANNE ,IOANA MANO- models]). These techniques focused on storing indi- 5 6 7 LESCU ,VASILIS VASSALOS ,MASATOSHI YOSHIKAWA vidual objects, including their private and public data 1 Department of Computer Science, University of and their methods. Important tasks included Calgary, Calgary, AB, Canada performing garbage collection, managing object mi- 2 Yahoo! Research, CA, USA gration and maintaining class extents. Object cluster- 3 School of Computing, University of Utah, Salt Lake ing techniques were developed that used the class City, UT, USA hierarchy and the composition hierarchy (i.e., which 4 Department of Computer Science, University of object is a component of which other object) to help Manheim, Manheim, PA, USA determine object location. These techniques, and the 5 INRIA Futurs, Le Chesnay, France implemented object storage systems, such as the O2 6 Department of Informatics, Athens University of storage system, influenced the development of Economics and Business, Athens, Greece subsequent semi-structured and XML storage systems. 7 University of Kyoto, Japan Moreover, the above solutions or ad-hoc app- roaches had also been used for the storage of large Synonyms SGML (Standard Generalized Markup Language, a XML persistence; XML database superset and precursor to XML) documents. Definition A wide variety of technologies may be employed to Scientific Fundamentals physically persist XML documents for later retrieval or Given the wide use of XML, most applications need or update, from relational database management systems will need to process and manipulate XML documents, to hierarchical systems to native file systems. Once the and many applications will need to store and retrieve data target technology is chosen, there is still a large number from large documents, large collections of documents, or of storage mapping strategies that define how parts of both. As an exchange format, XML can be simply serial- the document or document collection will be repre- ized and stored in a file, but serialized document storage sented in the back-end technology. Additionally, there often is very inefficient for query processing and updates. are issues of optimization of the technology and strat- As a result, a large-scale XML storage infrastructure is egy used for the mapping. XML Storage covers all the critical to modern application performance. above aspects of persisting XML document collections. Figure 1a shows a simple graphical outline of an XML DTD ([EDS reference]) for movies and television Historical Background shows. Even though the need for XML storage naturally arose As this example shows, XML data may exhibit great after the emergence of XML, similar techniques had variety in their structure. At one extreme, relational- been development earlier, since the mid-1990’s, to style data like title, year and boxoff children of store semi-structured data ([EDS reference: Semi- show in Fig. 1a may be represented in XML. At the structured data]). For example, the LORE system in- opposite extreme are highly irregular structures such as cluded a storage manager specifically designed for might be found under the reviews tag. Figure 2 shows semi-structured objects, while the STORED system a similar graphical representation of a real-life DTD for Comp. by: CsenthilkumaranGalleys0000861598 Date:30/8/08 Time:02:46:20 Stage:First Proof File Path://ppdys1108/Womat3/Production/PRODENV/0000000005/0000008302/0000000016/ 0000861598.3D Proof by: QC by: 2 X XML Storage XML Storage. Figure 1. Movie DTD and example native storage strategy. scientific articles. Since every HTML structure or format- document, ignoring the types assigned to the nodes ting element is also an XML element or attribute, the by a schema (if one exists). In some cases, e.g., when corresponding XML tree is very deep and wide, and no documents have irregular structure or an application two sections are likely to have the same structure. mostly generates navigations to individual elements, XML processing workloads are also diverse. Queries instance-driven storage can greatly simplify the task of and updates may affect or return few or many nodes. storing XML content. One instance-driven technique They may also need to ‘‘visit’’ large portions of an XML is to store nodes and edges in one or more relational tree and return nodes that are far apart, such as all the tables. A second approach is to implement an XML box-office receipts for movies, or may only return or data model natively. affect nodes that are ‘‘close’’ together, such as all the information pertaining to a single review. A few different ways of persisting XML document Tabular Storage of Trees A relational schema for collections are used, and each addresses differently the encoding any XML instance may include relations challenges posed by the varied XML documents and child, modeling the parent-child relationship, and workloads. tag, attr, id, text, associating to each element node respectively a tag, an attribute, an identity and a Instance-Driven Storage text value, as well as sets that contain the root of the In instance-driven storage, the storage of XML content document and the set of all its elements. Notice that is driven by the tree structure of the individual such a schema does not allow full reconstruction of an Comp. by: CsenthilkumaranGalleys0000861598 Date:30/8/08 Time:02:46:20 Stage:First Proof File Path://ppdys1108/Womat3/Production/PRODENV/0000000005/0000008302/0000000016/ 0000861598.3D Proof by: QC by: XML Storage X 3 Graphical outline of a complex article DTD (strongly simplified). XML Storage. Figure 2. Comp. by: CsenthilkumaranGalleys0000861598 Date:30/8/08 Time:02:46:21 Stage:First Proof File Path://ppdys1108/Womat3/Production/PRODENV/0000000005/0000008302/0000000016/ 0000861598.3D Proof by: QC by: 4 X XML Storage original XML document, as it does not retain informa- store. Large text nodes and large child node lists are tion on element order, whitespace, comments, entity handled by chunking them and/or introducing auxil- references etc. The encoding of element order, which is iary nodes. This organization supports efficient local a critical feature of XML, is discussed later in this navigation and tree reconstruction without, for exam- article. ple, loading the entire tree into memory. Such an A relational schema for encoding XML may also approach is used in the commercial DBMS IBM need to capture built-in integrity constraints of XML DB2 as of 2007 (starting with version 9). Native stores documents, such as the fact that every child has a single can support efficiently updates, concurrency control parent, every element has exactly one tag, etc. mechanisms and traditional recovery schemes to pre- Tabular storage of trees as described enables the use serve durability ([EDS reference ACID Properties]). of relational storage engines as the target storage tech- Figure 1b shows a hypothetical instance of the nology for XML document collections. While capable schema of Fig. 1a. The types of nodes are indicated of storing arbitrary documents, with this approach a by shape. One potential assignment of nodes to physi- large number of joins may be required to answer cal storage records is shown as groupings inside dashed queries, especially when reconstructing subtrees. This lines. Note that show elements are often physically is the basic storage mapping supported by Microsoft’s stored with their review children, and reviews are SQL Server as of 2007. frequently stored with the next or previous review in document order. Native XML Storage Native XML storage software Physical-level heuristics that can be implemented implements data structures and algorithms specifically to improve performance include compressed represen- designed to store XML documents on secondary mem- tation of node pointers inside a block, and string ory. These data structures support one or more of the dictionaries allowing integers to replace strings appear- XML data models ([EDS reference to infoset/psvi/ ing repeatedly, such as tag names and namespace URIs. XQuery data models]). Salient functional require- ments implied by standard data models include the Schema-Driven Storage preservation of child order, a stable node identity, When information about the structure of XML docu- and support for type information (depending on the ments is given, e.g., in a DTD or an XML Schema data model supported). An additional functional re- ([EDS references to XML Schema]), techniques have quirement in XML data stores is the ability to recon- been developed for XML storage that exploit this in- struct the exact textual representation of an XML formation. In general, nodes of the same type accord- document, including details such as encoding, white- ing to the schema are mapped in the same way, for space, attribute order, namespace prefixes, and entity example to a relational table. Schema information is references. primarily exploited for tabular storage of XML docu- A native XML storage implementation generally ment collections, and in particular in conjunction with

Load more