NATIVE XML DATABASES Vs. RELATIONAL DATABASES in DEALING with XML DOCUMENTS
Total Page:16
File Type:pdf, Size:1020Kb
181 Kragujevac J. Math. 30 (2007) 181{199. NATIVE XML DATABASES vs. RELATIONAL DATABASES IN DEALING WITH XML DOCUMENTS Gordana Pavlovi¶c-La·zeti¶c Faculty of Mathematics, University of Belgrade, 11000 Belgrade, Serbia (e-mail: [email protected]) (Received October 30, 2006) Abstract. When dealing with data-centric XML documents, it is possible to convert XML documents into a relational database, which can then be queried using SQL. Such relational databases are called XML-enabled databases. On the other hand, the best choice for storing, updating and retrieving document-centric XML documents is usually a native XML database (NXD). NXDs store XML documents as logical units, and retrieve documents using speci¯c query languages such as XPath or XQuery. This paper presents di®erent approaches to accessing XML documents from relational databases, as well as from native XML databases. They will be compared based on how general they are in dealing with di®erent types of XML documents and how expressive in stating requests for data, especially recursive queries. Two examples of di®erent types of XML documents are presented. The ¯rst one is a part explosion problem as a data-centric example. The second one is a large, highly hierarchical XML document - Serbian language wordnet, a lexical-semantic network, as a document-centric example. 1. INTRODUCTION XML (eXtensible Markup Language) has been designed as a markup language and a textual ¯le format. It provides for a description of a document's contents, with non-prede¯ned tags, and does not provide for any presentational characteristics. 182 The following is an example of XML-tagged document (an excerpt from a restaurant menu), contained in the ¯le simple.xml. <?xml version="1.0" encoding="ISO-8859-1"?> <!DOCTYPE menu (V iewSourceforfulldoctype:::) > <menu date='5.10.2006'> <food> <name>Homemade Bean Soup </name> <price>100.00 </price> <description> a bowl of white bean soup with paper and onion </description> <calories>650 </calories> </food> <food> <name>French Toast </name> <price>60.50din </price> <description> thick slices made from our homemade bread </description> <calories>300 </calories> </food> <food> <name>Homestyle Breakfast </name> <price>195.95din </price> <description> two eggs, bacon or sausage, toast </description> <calories>950 </calories> </food> </menu> Still, XML provides for yet another type of data models, which is an ordered tree with typed, named nodes and data in leaf nodes only. An XML document is a linearization of the tree structure. There are number of advantages of using XML data format over other data for- mats, e.g., relational one. For example, heterogeneity of data records is supported in a more natural way, extensibility is provided by allowing di®erent data types in a single document, as well as flexibility through variety in size and con¯guration among 183 instances of the same data type. Sometimes, manipulating XML data format is sig- ni¯cantly more e±cient than traditional ones. At the same time, there are obvious disadvantages of XML data format, one of them being ine±ciency in record format. Data manipulation is often slower than in traditional formats, and optimization is complex due to richness and expressiveness of query languages. 2. XML DATA MODELS Unlike relational data model, there is no unique XML data model. Still, all XML data models are extensions of the basic one and some of them will be briefly presented. The basic XML model includes di®erent types of nodes, such as [1]: ² element node, e.g., <price>195.95din </price> or <menu date='5.10.2006'> . </menu> ² document node, one special kind of element node, represented by the <!DOCTYPE> node; ² processing instruction node, e.g., <?xml version...?> ² comment (in the form of <!--c-->) ² data which always reside in leaf nodes, with only one characteristics: data itself, e.g., "Homestyle Breakfast" or "two eggs, bacon or sausage, toast". Element node has a type, e.g., price, food or menu; it also may have an ordered list of children (for the element node of type food these are element nodes of type name, price, description, calories), and an unordered set of attributes of the form (attribute name { attribute value) pairs, e.g., "date='5.10.2006' ". The document node has a type (menu in our example) but no attributes; it also has exactly one element node child, which must have the same type as the document node (in our example, menu-typed element node). XML may also be considered as an abstract data type (ADT) which is very rich, containing Strings and Identi¯ers, partly ordered in a sequence, partly hierarchical, 184 partly in an unordered database-like keyword/value system. The ADT XML Node actually imports previously de¯ned data types Identi¯er, URL, char, int, boolean, nil, and de¯nes mappings between the types (functions) for creating an XML document, setting and getting schema, attributes, manipulating nodes, as well as linearization functions. Finally, there is an XML data model based on languages for querying them, XQuery and XPath. The so called XDM, XQuery 1.0 and XPath 2.0 Data Model, is a W3C Candidate Recommendation as of the end of 2005. 2.1. DATA-CENTRIC AND DOCUMENT-CENTRIC XML DOCUMENTS Regarding rigidity of XML document structure, XML documents fall into two broad categories: data-centric and document-centric. Data-centric documents are those containing structured data, such as price lists. Data appear in a regular order and are usually stored in databases while XML is used just for data exchange and publishing. They may also contain semi-structured data, such as phonebooks or patient records. In general, relational databases are e±cient enough in storing data contained in data-centric XML documents. Document-centric XML documents are those characterized by irregular structure and mixed content, such as in user's manuals and marketing brochures. Storing and manipulating various XML documents in a shared repository usually requires more than a relational database. 3. XML AND DATABASES Regardless of whether XML is used as a storage or interchange format for data- centric model data, or used for creating semi-structured document-centric model doc- uments, such as XHTML, it is sometimes necessary to store the XML in some sort of repository or database that allows for more sophisticated storage and retrieval of the data, especially if the XML is to be accessed by multiple users. 185 There are conceptually two di®erent ways to store XML documents in a database. The ¯rst is to map the document's data model to a database model and convert XML data into the database, according to that mapping. The second is to map XML model into a ¯xed set of persistent (database) structures that can store any XML doc- ument. Databases that support the ¯rst method are called XML-enabled databases. Databases that support the second method are called native XML databases. XML- enabled databases have their own data model { relational, hierarchical, object-oriented, and they map instances of the XML data model to instances of their data model. Na- tive XML databases use the XML data model directly [2]. Although the choice is somewhat arbitrary, it is usually more convenient and more e±cient to store and ma- nipulate data-centric XML documents using XML-enabled databases, and document- centric XML documents using native XML databases. 3.1. XML-ENABLED DATABASES One way to map XML documents into an XML-enabled database is to create relational views on XML documents stored in columns of a relational database, which can then be queried using SQL. In XML-enabled databases, di®erent XML document schemas correspond to di®erent database schemas. XML is external to the database and invisible inside it. As an example of such a database we may consider the "menu" XML document. It may be represented (either physically or as a relational view) as the following relation: food name price description calories Homemade Bean Soup 100.00 a bowl of white bean soup 650 with paper and onion French Toast 60.50 thick slices made from 300 our homemade bread Homestyle Breakfast 195.95 two eggs, bacon or 950 sausage, toast A query asking for foods with more than 600 calories may be now stated by the following SQL statement: 186 select name from food where calories > 600 If we had yet another two relations, restaurants(name, address), serving(food-name, rest-name) we would be able to formulate an SQL join-query asking for information on where to eat for what amount of money: select restaurant.address, food.price from food, restaurant, serving where food.name=food-name and restaurant.name=rest-name 3.2. NATIVE XML DATABASES Native XML database (NXD) is best described as a database that has an XML document (or its rooted part) as its fundamental unit of (logical) storage and de¯nes a (logical) model for an XML document, as opposed to the data in that document (its contents). It represents logical XML document model (not XML document data model), and stores and manipulates documents according to that model [2]. Basic characteristics of an NXD are the following: ² a logical unit of an NXD is an XML document or its rooted part, and it corre- sponds to a row in a relational database, ² it includes at least the following components: elements, attributes, textual data (PCDATA), and document order, ² physical model (and type of persistent NXD storage) is unspeci¯ed. In a native XML database, XML is visible inside the database. There is a unique database for all XML schemas and documents. Native XML databases are especially suitable for storing irregular, deeply hierarchical, recursive data. Following the de¯nition, an important characteristics of a native XML database is that its physical model is unspeci¯ed.