Comp. by: CsenthilkumaranGalleys0000861598 Date:30/8/08 Time:02:46:19 Stage:First Proof File Path://ppdys1108/Womat3/Production/PRODENV/0000000005/0000008302/0000000016/ 0000861598.3D Proof by: QC by:

X

allowed the definition of mappings from semi- XML Storage structured data to relations. Even earlier, storage tech- niques and storage systems had been developed for 1 2 DENILSON BARBOSA ,PHILIP BOHANNON ,JULIANA object-oriented data ([EDS reference: Object data 3 4 FREIRE ,CARL-CHRISTIAN KANNE ,IOANA MANO- models]). These techniques focused on storing indi- 5 6 7 LESCU ,VASILIS VASSALOS ,MASATOSHI YOSHIKAWA vidual objects, including their private and public data 1 Department of Computer Science, University of and their methods. Important tasks included Calgary, Calgary, AB, Canada performing garbage collection, managing object mi- 2 Yahoo! Research, CA, USA gration and maintaining class extents. Object cluster- 3 School of Computing, University of Utah, Salt Lake ing techniques were developed that used the class City, UT, USA hierarchy and the composition hierarchy (i.e., which 4 Department of Computer Science, University of object is a component of which other object) to help Manheim, Manheim, PA, USA determine object location. These techniques, and the 5 INRIA Futurs, Le Chesnay, France implemented object storage systems, such as the O2 6 Department of Informatics, Athens University of storage system, influenced the development of Economics and Business, Athens, Greece subsequent semi-structured and XML storage systems. 7 University of Kyoto, Japan Moreover, the above solutions or ad-hoc app- roaches had also been used for the storage of large Synonyms SGML (Standard Generalized Markup Language, a XML persistence; XML superset and precursor to XML) documents. Definition A wide variety of technologies may be employed to Scientific Fundamentals physically persist XML documents for later retrieval or Given the wide use of XML, most applications need or update, from management systems will need to process and manipulate XML documents, to hierarchical systems to native file systems. Once the and many applications will need to store and retrieve data target technology is chosen, there is still a large number from large documents, large collections of documents, or of storage mapping strategies that define how parts of both. As an exchange format, XML can be simply serial- the document or document collection will be repre- ized and stored in a file, but serialized document storage sented in the back-end technology. Additionally, there often is very inefficient for query processing and updates. are issues of optimization of the technology and strat- As a result, a large-scale XML storage infrastructure is egy used for the mapping. XML Storage covers all the critical to modern application performance. above aspects of persisting XML document collections. Figure 1a shows a simple graphical outline of an XML DTD ([EDS reference]) for movies and television Historical Background shows. Even though the need for XML storage naturally arose As this example shows, XML data may exhibit great after the emergence of XML, similar techniques had variety in their structure. At one extreme, relational- been development earlier, since the mid-1990’s, to style data like title, year and boxoff children of store semi-structured data ([EDS reference: Semi- show in Fig. 1a may be represented in XML. At the structured data]). For example, the LORE system in- opposite extreme are highly irregular structures such as cluded a storage manager specifically designed for might be found under the reviews tag. Figure 2 shows semi-structured objects, while the STORED system a similar graphical representation of a real-life DTD for Comp. by: CsenthilkumaranGalleys0000861598 Date:30/8/08 Time:02:46:20 Stage:First Proof File Path://ppdys1108/Womat3/Production/PRODENV/0000000005/0000008302/0000000016/ 0000861598.3D Proof by: QC by:

2 X XML Storage

XML Storage. Figure 1. Movie DTD and example native storage strategy.

scientific articles. Since every HTML structure or format- document, ignoring the types assigned to the nodes ting element is also an XML element or attribute, the by a schema (if one exists). In some cases, e.g., when corresponding XML tree is very deep and wide, and no documents have irregular structure or an application two sections are likely to have the same structure. mostly generates navigations to individual elements, XML processing workloads are also diverse. Queries instance-driven storage can greatly simplify the task of and updates may affect or return few or many nodes. storing XML content. One instance-driven technique They may also need to ‘‘visit’’ large portions of an XML is to store nodes and edges in one or more relational tree and return nodes that are far apart, such as all the tables. A second approach is to implement an XML box-office receipts for movies, or may only return or data model natively. affect nodes that are ‘‘close’’ together, such as all the information pertaining to a single review. A few different ways of persisting XML document Tabular Storage of Trees A relational schema for collections are used, and each addresses differently the encoding any XML instance may include relations challenges posed by the varied XML documents and child, modeling the parent-child relationship, and workloads. tag, attr, id, text, associating to each element node respectively a tag, an attribute, an identity and a Instance-Driven Storage text value, as well as sets that contain the root of the In instance-driven storage, the storage of XML content document and the set of all its elements. Notice that is driven by the tree structure of the individual such a schema does not allow full reconstruction of an Comp. by: CsenthilkumaranGalleys0000861598 Date:30/8/08 Time:02:46:20 Stage:First Proof File Path://ppdys1108/Womat3/Production/PRODENV/0000000005/0000008302/0000000016/ 0000861598.3D Proof by: QC by:

XML Storage X 3 Graphical outline of a complex article DTD (strongly simplified). XML Storage. Figure 2. Comp. by: CsenthilkumaranGalleys0000861598 Date:30/8/08 Time:02:46:21 Stage:First Proof File Path://ppdys1108/Womat3/Production/PRODENV/0000000005/0000008302/0000000016/ 0000861598.3D Proof by: QC by:

4 X XML Storage

original XML document, as it does not retain informa- store. Large text nodes and large child node lists are tion on element order, whitespace, comments, entity handled by chunking them and/or introducing auxil- references etc. The encoding of element order, which is iary nodes. This organization supports efficient local a critical feature of XML, is discussed later in this navigation and tree reconstruction without, for exam- article. ple, loading the entire tree into memory. Such an A relational schema for encoding XML may also approach is used in the commercial DBMS IBM need to capture built-in integrity constraints of XML DB2 as of 2007 (starting with version 9). Native stores documents, such as the fact that every child has a single can support efficiently updates, parent, every element has exactly one tag, etc. mechanisms and traditional recovery schemes to pre- Tabular storage of trees as described enables the use serve durability ([EDS reference ACID Properties]). of relational storage engines as the target storage tech- Figure 1b shows a hypothetical instance of the nology for XML document collections. While capable schema of Fig. 1a. The types of nodes are indicated of storing arbitrary documents, with this approach a by shape. One potential assignment of nodes to physi- large number of joins may be required to answer cal storage records is shown as groupings inside dashed queries, especially when reconstructing subtrees. This lines. Note that show elements are often physically is the basic storage mapping supported by Microsoft’s stored with their review children, and reviews are SQL Server as of 2007. frequently stored with the next or previous review in document order. Native XML Storage Native XML storage software Physical-level heuristics that can be implemented implements data structures and algorithms specifically to improve performance include compressed represen- designed to store XML documents on secondary mem- tation of node pointers inside a block, and string ory. These data structures support one or more of the dictionaries allowing integers to replace strings appear- XML data models ([EDS reference to infoset/psvi/ ing repeatedly, such as tag names and namespace URIs. XQuery data models]). Salient functional require- ments implied by standard data models include the Schema-Driven Storage preservation of child order, a stable node identity, When information about the structure of XML docu- and support for type information (depending on the ments is given, e.g., in a DTD or an XML Schema data model supported). An additional functional re- ([EDS references to XML Schema]), techniques have quirement in XML data stores is the ability to recon- been developed for XML storage that exploit this in- struct the exact textual representation of an XML formation. In general, nodes of the same type accord- document, including details such as encoding, white- ing to the schema are mapped in the same way, for space, attribute order, namespace prefixes, and entity example to a relational . Schema information is references. primarily exploited for tabular storage of XML docu- A native XML storage implementation generally ment collections, and in particular in conjunction with maps tree nodes and edges to storage blocks in a the use of a relational storage engine as the underlying manner that preserves tree locality, i.e., that stores technology, as described in the next paragraph. In parents and children in the same block. The strategy hybrid XML storage different data models, and poten- is to map XML tree structures onto records managed tially even different systems, store different document by a storage manager for variable-size records. One parts. possible approach is to map the complete document to a single Binary Large Object and use the record Relational Storage for XML Documents manager’s large object management to deal with docu- Techniques have been developed that enable the effec- ments larger than a page. This is one of the approaches tive use of a relational database management system for XML storage supported by the commercial DBMS to store XML. Figure 3a illustrates the main tasks Oracle as of 2007. This approach incurs significant that must be performed for storing XML in relational costs both for update and for query processing. . First, the schema of the XML document is A more sophisticated strategy is to divide the doc- mapped into a suitable relational schema that can pre- ument into partitions smaller than a disk block and serve the information in the original XML documents map each to a single record in the underlying (Storage Design). The resulting relational schema needs Comp. by: CsenthilkumaranGalleys0000861598 Date:30/8/08 Time:02:46:21 Stage:First Proof File Path://ppdys1108/Womat3/Production/PRODENV/0000000005/0000008302/0000000016/ 0000861598.3D Proof by: QC by:

XML Storage X 5

XML Storage. Figure 3. Relational storage workflow and example.

to be optimized at the physical level, e.g., with the Schema-driven relational storage mappings for XML selection of appropriate file structures and the creation documents are supported by the Oracle DBMS. of indices, taking into account the distinctive charac- An XML-to-relational mapping scheme consists of teristics of XML queries and updates in general and of definitions that express what data from the XML the application workload in particular. XML docu- document should appear in each relational table and ments are then shredded and loaded into the flat tables constraints over the relational schema. The views gen- (Data Loading). At runtime, XML queries are translat- erally map elements with the same type or tag name to ed into relational queries, e.g. in SQL, submitted to the a table and define a storage mapping. For example, in underlying relational system and the results are trans- Fig. 3b, two views, V1 and V2 are used to populate the lated back into XML (Query Translation). ([EDS refer- Actors and Shows tables respectively. A particular ences to XML publishing, XML query processing]) set of storage views and constraints along with physical Comp. by: CsenthilkumaranGalleys0000861598 Date:30/8/08 Time:02:46:23 Stage:First Proof File Path://ppdys1108/Womat3/Production/PRODENV/0000000005/0000008302/0000000016/ 0000861598.3D Proof by: QC by:

6 X XML Storage

storage and indexing options together comprise a stor- estimation. Due to the size of the search space for age design. The process of parsing an XML document mappings generated by the schema transformations, and populating a set of relational views according to a efficient heuristics are needed to reduce the cost without storage design is referred to as shredding. missing the most efficient mappings. Physical database Due to the mismatch between the tree-structure of design options, such as vertical partitioning of relations XML documents and the flat structure of relational and the creation of indices, can be considered in addi- tables, there are many possible storage designs. For tion to logical database design options, to include po- example, in Fig. 3b, if an element, such as show, is tentially more efficient mappings in the search space. guaranteed to have only a single child of a particular The basic principles and techniques of cost-based type, such as seasons, then the child type may op- approaches for XML storage are shared with relational tionally be inlined, i.e., stored in the same table as the cost-based schema design. parent. On the other hand, due to the nature of XML Correctness and Losslessness An important issue queries and updates, certain indexing and file organi- in designing mappings is correctness, notably, whether zation options have been shown to be generally useful. a given mapping preserves enough information. A In particular, the use of B-tree indexes (as opposed to mapping scheme is lossless if it allows the reconstruc- hash-based indexes) is usually beneficial, as the trans- tion of the original documents, and it is validating if all lation of XML queries into relational languages often legal relational database instances correspond to a valid involves range conditions. There is evidence that the XML document. While losslessness is enough for best file organization for the relations resulting from applications involving only queries over the docu- XML shredding is index-organized tables ([EDS refer- ments, if documents must conform to an XML schema ence to Index Creation and File Structure]), with the and the application involves both queries and updates index on the attribute(s) encoding the order of XML to the documents, schema mappings that are validat- elements. With such file organization, index scanning ing are necessary. Many of the mapping strategies allows the retrieval of the XML elements in document proposed in the literature are (or can be extended to order, as required by XPath semantics, with a mini- be) lossless. While none of them are validating, they mum number of random disk accesses. The use of a can be extended with the addition of constraints to path index that stores complete root-to-node paths for only allow updates that maintain the validity of the all XML elements also provides benefits. XML document. In particular, even though losslessness and validation are undecidable for a large class of mapping schemes, it is possible to guarantee informa- Cost-Based Approaches A key quality of a storage tion preservation by designing mapping procedures mapping is efficiency – whether queries and updates in which guarantee these properties by construction. the workload can be executed quickly. Cost-based mapping strategies can derive mappings that are Order Encoding Schemes Different techniques more efficient than mappings generated using fixed have been proposed to preserve the structure and order strategies. In order to apply such strategies, statistics of XML elements that are mapped into a relational sche- on the values and structure of an XML document ma. In particular, different labeling schemes have been collection need to be gathered. A set of transformations proposed to capture the positional information of each and annotations can be applied to the XML schema to XML element via the assignment of node labels. An derive different schemas that result in different rela- important goal of such schemes is to be able to express tional storage mappings, for example by merging or structural properties among nodes, e.g., the child, descen- splitting the types of different XML elements, and dant, following sibling and other relationships, as condi- hence mapping them into the same or different rela- tions on the labels. Most schemes are either prefix-based tional tables. Then, an efficient mapping is selected or range-based and can be used with both schema- by comparing the estimated cost of executing a given driven and instance-based relational storage of XML. application workload on the relational schema pro- In prefix-based schemes, a node’s label includes as duced by each mapping. The optimizer of the relational a prefix the label of its parent. Dewey-based order database used as storage engine can be used for the cost encodings are the best known prefix-based schemes. Comp. by: CsenthilkumaranGalleys0000861598 Date:30/8/08 Time:02:46:23 Stage:First Proof File Path://ppdys1108/Womat3/Production/PRODENV/0000000005/0000008302/0000000016/ 0000861598.3D Proof by: QC by:

XML Storage X 7

The Dewey Decimal Classification was originally de- combining their benefits. For example, schema-directed veloped for general knowledge classification. The basic relational storage mappings often give better perfor- Dewey-based encoding assigns to each node in an XML mance for identifying the elements that satisfy an tree an identifier that records the position of a node XPath query, while native storage allows the direct re- among its siblings, prefixed by the identifier of its trieval of large elements. In environments where parent node. In Fig. 1b, the Dewey-based encoding updates are infrequent or update cost less important would assign the identifier 1.1.2 to the dashed-line than query performance, such as various web-based year element. In range-based order encodings, such query systems, such redundant storage approaches can as interval or pre/post encoding, a unique {start,end} be beneficial. interval identifies each node in the document tree. This interval can be generated in multiple ways. The most Key Applications common method is to create a unique identifier, start, XMLStorage techniquesare used toefficientlystore XML for each node in a preorder traversal of the document documents, XML messages, accumulated XML streams tree, and a unique identifier, end, in a postorder tra- and any other form of XML-encoded content. XML versal. Additionally, in order to distinguish children Storage is a key component of an XML database manage- from descendants, a level number needs to be recorded ment system. It can also provide significant benefits with each node. for the storage of semi-structured information with An important consideration for any order-encoding mostly tree structure, including scientific data. scheme is to be able to handle updates in the XML documents, and many improvements have been made Cross-references to the above basic encodings to reduce the overhead ▶ Dataguides associated with updates. ▶ Deweys ▶ Intervals Hybrid XML Storage ▶ Storage Management Some XML documents have both very structured and ▶ XML Document very unstructured parts. This has lead to the idea of ▶ XML Indexing hybrid XML storage, where different data models, and ▶ XML Query Processing even systems using different storage technologies, store ▶ XML Schema different document parts. For example, in Fig. 3b, ▶ XPath/XQuery review elements and their subtrees can be stored very differently from show elements, for example by Recommended Reading serializing each review according to the dashed lines in 1. Arion A., Benzaken V., Manolescu I., and Papakonstantinou Y. the figure or storing them in a native XML storage Structured materialized views for XML queries. In VLDB. system. Vienna, Austria, 2007, pp. 87–98. 2. Barbosa D., Freire J., and Mendelzon A.O. Designing informa- Prototype systems such as MARS and XAM have tion-preserving mapping schemes for XML. In VLDB. VLDB been proposed that support a hybrid storage model at Endowment, Trondheim, Norway, 2005, pp. 109–120. the system level, i.e., provide physical data independence. 3. Beyer K., Cochrane R.J., Josifovski V., Kleewein J., Lapis G., In these systems, different access methods corresponding Lohman G., Lyle B., O¨ zcan F., Pirahesh H., Seemann N., to the different storage mappings are formally described Truong T., der Linden B.V., Vickery B., and Zhang C. System RX: one part relational, one part XML. In SIGMOD Conference. using views and constraints, and query processing ACM, New York, NY, USA, 2005, pp. 347–358. involves the use of using views ([EDS 4. Chaudhuri S., Chen Z., Shim K., and Wu Y. Storing XML (with reference to Query rewriting using views]). Moreover, an XSD) in SQL databases: interplay of logical and physical designs. appropriate tool or language is necessary to specify hy- IEEE Trans. Knowl. Data Eng., 17(12):1595–1609, 2005. brid storage designs effectively and declaratively. 5. Deutsch A., Fernandez M., and Suciu D. Storing semi-structured An additional consideration in favor of hybrid XML data with STORED. In SIGMOD Conference. ACM, New York, NY, USA, 1999, pp. 431–442. storage is that storing some information redundantly 6. Fiebig T., Helmer S., Kanne C.C., Moerkotte G., Neumann J., using different techniques can improve the performance Schiele R., and Westmann T. Anatomy of a native XML base of querying and data retrieval significantly by management system. VLDB J., 11(4):292–314, 2003. Comp. by: CsenthilkumaranGalleys0000861598 Date:30/8/08 Time:02:46:25 Stage:First Proof File Path://ppdys1108/Womat3/Production/PRODENV/0000000005/0000008302/0000000016/ 0000861598.3D Proof by: QC by:

8 X XML Storage

7. Georgiadis H. and Vassalos V. XPath on steroids: exploiting 10. Shanmugasundaram J., Tufte K., He G., Zhang C., DeWitt D., relational engines for XPath performance. In SIGMOD Confer- and Naughton J. Relational databases for querying XML docu- ence. ACM, New York, NY, USA, 2007, pp. 317–328. ments: limitations and opportunities. In VLDB. Morgan 8. Ha¨rder T., Haustein M., Mathis C., and Wagner M. Node label- Kaufmann, San Francisco, CA, USA, 1999, pp. 302–314. ing schemes for dynamic XML documents reconsidered. Data 11. Ve´lez F., Bernard G., and Darnis V. The O2 object manager: an Knowl. Eng., 60(1):126–149, 2007. overview. In Building an Object-Oriented Database System, The 9. McHugh J., Abiteboul S., Goldman R., Quass D., and Widom J. Story of O2. Morgan Kaufmann, San Francisco, CA, USA, 1992, Lore: a database management system for semistructured data. pp. 343–368. SIGMOD Rec., 26:54–66, 1997.