XML

Introduction:

Relational Database:

A is a powerful data storage and retrieval technology where data is stored as rows in tables and the database has one or more tables. Each of a has the same columns as every other row in that table. Data is related between tables using the concept of “foreign keys” so that data in a row of one table can be associated with one or more rows of another table.

Data in a relational database is readable by executing SQL queries in a management tool to extract and present the data in any number of ways. The extraction requires an understanding of the database structure, including the relationships. Designing a good non-trivial relational database requires significant training and/or significant experience with relational database design techniques.

XML Database:

XML has emerged as the standard for representing and exchanging data on the World Wide Web. The increasing amount of XML documents requires the need to store and query XML documents efficiently.

XML Database is used to store the huge amount of information in the XML format.

XML Schema is commonly known as XML Schema Definition (XSD). It is used to describe and validate the structure and the content of XML data. XML schema defines the elements, attributes and data types.

XML is becoming the predominant data format in a variety of application domains (e.g., supply- chain, scientific data processing, telecommunication infrastructure). Many such applications produce and consume large volumes of XML data and thus require efficient and reliable storage systems. The use of relational database systems for this purpose has attracted considerable interest both by the research community and the database vendors.

Elements are the fundamental units of XML content.

Element name: wrapped in tags (markups), which describes the content ().

Element content: anything go between a pair of opening and closing tag.

Major differences between XML data and relational data

XML data is hierarchical; relational data is represented in a model of logical relationships An XML document contains information about the relationship of data items to each other in the form of the hierarchy. With the , the only types of relationships that can be defined are parent table and dependent table relationships.

XML data is self-describing; relational data is not An XML document contains not only the data, but also tagging for the data that explains what it is. A single document can have different types of data. With the relational model, the content of the data is defined by its definition. All data in a column must have the same type of data.

XML data has inherent ordering; relational data does not For an XML document, the order in which data items are specified is assumed to be the order of the data in the document. There is often no other way to specify order within the document. For relational data, the order of the rows is not guaranteed unless you specify an ORDER BY clause on one or more columns. Comparison of Concepts between XML Database and Relational Database Systems:

1. Structuring and Typing Mechanisms The basic mechanisms used to specify the structure of XML documents and relational schemata are element types and attributes for XML as well as relations and attributes for RDBS. For each XML document, it is required that all component element types are rooted in a single element type. This is in contrast to RDBS, where part-of hierarchies cannot be realized by means of nesting since relations consist of atomic-valued attributes, only. However, part-of hierarchies can be expressed in RDBS by means of foreign key constraints

2. Uniqueness of Names The name of a is required to be unique within the whole relational schema, similar to the name of an XML element type being unique throughout the DTD. By means of so called namespaces, XML allows element types having the same name by using different namespace prefixes. The name of an XML attribute defined within a DTD or an XML Schema has to be unique within its element type, again similar to an RDBS attribute’s name which has to be unique within its relation.

3. Null Values and Default Values Similar to RDBS, XML allows to express values as well as default values. In RDBS the concept of null values is defined for attributes, only. XML, however, supports null values for both attributes and elements. In DTDs, default values may be applied to XML attributes, only, whereas XML Schema supports default values for XML element types, too.

4. Identification In RDBS, the unique identification of tuples is done by means of a primary key, which may be composed of one or more attributes of the corresponding relation. In DTDs, only a single attribute of an element type can be designated as identifying attribute by means of the special attribute type ID which may in turn contain a string value. XML Schema allows not just attributes, but also element types of any other elements of the same element type but rather across all elements of any element type. XML Schema allows to specify the scope for each key by means of an XPath expression.

5. Relationships In RDBS, relationships can be expressed between relations by means of foreign keys, i.e., arbitrary attributes that refer to the primary key of the same arbitrary atomic domain and combinations thereof to serve as keys. The scope of identification in RDBS is a single relation, i.e., the value of the primary key uniquely identifies each tuple within a relation. In DTDs, the scope of identification is broader in the sense that the value of an ID attribute is unique within the whole XML document. This allows the unique identification of an element not only with respect to relation or of another relation. The number of tuples which may participate in a relationship can be constrained by defining the foreign key as NOT NULL and/or UNIQUE.

Fig: Comparison of Relationships

Example showing XML and Relational Database EmployeeDB

Name Company Phone

Ruth Sam XML Database 443-123-4567

Tommy Bryan XML Database 443-789-4567

Ruth Sam

XML Database

443-123-4567

Tommy Bryan

XML Database

443-789-4567

XML Database Types There are three different types of XML :

1. Native XML Database (NXD):

(a) Defines a (logical) model for an XML document — as opposed to the data in that document – and stores and retrieves documents according to that model. At a minimum, the model must include elements, attributes, PCDATA, and document order. Examples of such models are the XPath data model, the XML Infoset, and the models implied by the DOM and the events in SAX 1.0.

(b) Has an XML document as its fundamental unit of (logical) storage, just as a relational database has a row in a table as its fundamental unit of (logical) storage.

(c) Is not required to have any particular underlying physical storage model. For example, it can be built on a relational, hierarchical, or object-oriented database, or use a proprietary storage format such as indexed, compressed files.

2. XML Enabled Database (XEDB): A database that has an added XML mapping layer provided either by the database vendor or a third party. This mapping layer manages the storage and retrieval of XML data. Data that is mapped into the database is mapped into application specific formats and the original XML meta-data and structure may be lost. Data retrieved as XML is NOT guaranteed to have originated in XML form. Data manipulation may occur via either XML specific technologies (e.g. XPath, XSLT, DOM or SAX) or other database technologies (e.g. SQL). The fundamental unit of storage in an XML Enabled Database is implementation dependent.

3. Hybrid XML Databases (HXD): A database that can be treated as either a Native XML Database or as an XML Enabled Database depending on the requirements of the application.

XML Documents can be Data-Centric and Document-Centric XML

Data-centric are documents produced as an import or export format, that is, data-centric XML documents are used for machine consumption. These documents are used for communicating data between companies or applications and the fact that XML is used as a common format is simply a matter of convenience, for reasons of interoperability. Examples of data-centric documents are sales orders, scientific data, and stock quotes. Document-centric are documents usually designed for human consumption, with examples ranging from books to hand-written XHTML documents. They are usually composed directly in XML, or some other format and then converted to XML. Document-centric documents do not need to have regular structure, have coarse-grained data (that is the smallest independent data unit may as well be a document itself) and have mixed content. For example, the following memo document is document-centric.

Converting XML to relational database

There are various ways to convert effectively and automatically XML data into and out of relational databases. DB2 offers two methods for shredding XML data. The first method uses SQL INSERT statements with the XMLTABLE function. One such INSERT statement is required for each target table and multiple statements can be combined in a to avoid repetitive parsing of the same XML document. The shredding statements can include XQuery and SQL functions, joins to other tables, or references to DB2 sequences. These features allow for customization and a high degree of flexibility in the shredding process, but require manual coding. The second approach for shredding XML data uses annotations in an XML Schema to define the mapping from XML to relational tables and columns. IBM Data Studio Developer provides a visual interface to create this mapping conveniently with little or no manual coding.

Different techniques used for conversion:

1. Query Translator

XPath-to-SQL query translator is developed that supports a subset of XPath. The query translator is generic and does not hard-code mapping choices, instead it uses the information provided by mapping API to perform the translation.

The translator algorithm consists of the following steps:

Step 1: Resolve wildcards, so that a set of simple paths is obtained

Step 2: For each simple path, consult the mapping API and bind XML-to-relational mapping information to the nodes in the path

Step 3: Generate SQL query for the annotated path

Step 4: Union the SQL queries (each of them corresponds to one path)

2. XML to Relational Mapping

Mappings between XML document schema and database schema are performed on the element types, attributes and text and not on physical structure. Two mappings are commonly used to map an XML document schema to the database schema: the table-based mapping and the object- relational mapping.

Table Based Mapping: The table-based mapping is used by many of the middleware products that transfer data between an XML document and a relational database. It maps XML documents as a single table or set of tables. The structure of the XML document must be as follows:

The table-based mapping is useful for serializing relational data, such as when transferring data between two relational databases. Its drawback is that it cannot be used for any XML documents that do not match the above format.

Object-Relational Mapping: It models the data in the XML document as a tree of objects that are specific to the data in the document. Element types with attributes, element content, or mixed content are generally modeled as classes. Simple element types are modeled as scalar properties. The model is then mapped to relational databases using traditional object-relational mapping. That is, classes are mapped to tables, scalar properties are mapped to columns, and object-valued properties are mapped to primary key / foreign key pairs. Some products allow you to generate the classes in the model, then use these objects from these classes in your application. With such products, data is transferred between the XML document and these objects and between these objects and the database. Other products use the objects only as a tool to help visualize the mapping and transfer data directly between the XML document and the database.

Example of Object Relational Mapping:

3. Shredding:

Shredding is converting XML documents to rows in relational tables. The reason why shredding of XML documents is still used is, legacy applications, packaged business applications, or reporting software do not always understand XML and have fixed relational interfaces.

Shredding XML into a large number of tables can lead to a complex and unnatural fragmentation of your logical business objects that makes application development difficult and error-prone. Querying the shredded data or reassembling the original documents may require complex multiway joins.

Example of Shredding of an XML Document

In this example, XML documents with customer name, address, and phone information are mapped to two relational tables. The documents can contain multiple phone elements because there is a one-to-many relationship between customers and phones. Hence, phone numbers are shredded into a separate table. Each repeating element, such as phone, leads to an additional table in the relational target schema. Suppose the customer information can also contain multiple email addresses, multiple accounts, a list of most recent orders, multiple products per order, and other repeating items. The number of tables required in the relational target schema can increase very quickly.

Types of Shredding:

Partial shredding: means that only a subset of the elements or attributes from each incoming XML document are shredded into relational tables. This is useful if a relational application does not require all data values from each XML document. Hybrid XML storage: means that upon insert of an XML document into an XML column, selected element or attribute values are extracted and redundantly stored in relational columns.

Advantages: The shredding process is very flexible and allows users to set various parameters (e.g., target database system, login information, bulk loading option) either through the command line or through a con- figuration file. Disadvantages: Shredding XML into a large number of tables can lead to a complex and unnatural fragmentation of your logical business objects that makes application development difficult and error-prone. Querying the shredded data or reassembling the original documents may require complex multiway joins.

Summary: To conclude, relational tables have unordered collections of rows with strictly typed columns, and each row in a table must have the same structure. One-to-many relationships are expressed by using multiple tables and join relationships between them. In contrast, XML documents tend to have a hierarchical and nested structure that can represent multiple one-to-many relationships in a single document. XML allows elements to be repeated any number of times, and XML Schemas can define hundreds or thousands of optional elements and attributes that may or may not in any given document. If the structure of XML data is of limited complexity such that it can easily be mapped to relational tables, and if the XML format is unlikely to change over time, then XML shredding can sometimes be useful to feed existing relational applications and reporting software.

References: https://www.tutorialspoint.com/xml/xml_databases.htm http://www.room4me.com/index.php?option=com_content&view=article&id=8:xmlvsdb&catid= 2:technology&Itemid=5 http://www.ibm.com/support/knowledgecenter/SSEPGG_11.1.0/com.ibm.db2.luw.xml.doc/doc/c 0023811.html

Bourret, R.: XML and Databases. Technical University of Darmstadt, http://www.informatik. tu- darmstadt.de/DVS1/staff/bourret/xml/ XMLAndDatabases.htm, November, 2000

World Wide Web Consortium (W3C): XML Schema, http://www.w3.org/XML/Schema

World Wide Web Consortium (W3C): XPath Specification 1.0, http://www.w3.org/TR/xpath, Recommendation