With DB2 9 it became possible to store, retrieve and query XML documents in DB2 for z/OS ,using the pureXML functionality. Although today most DB2 DBA are familiar with the concepts of this new hybrid technology, it requires some effort to obtain the skills to be able to implement this new technology on site. This presentation will focus on a practical case the speaker used to help acquire these skills, by showing some experiences and scenario's that every one can try out and play with in their own environments.

The objectives of this presentation are : •Understand the benefits of storing XML data in pureXML format instead of relational format •How to create and populate an XML database with real data and realistic volumes to play around •How to use SQL and XPath to query this XML data ; how to handle namespaces •How to implement XML schemas •Understand the use of XML indexes to improve query performance

1 2 3 4 5 6 XML, the extensible markup language acts as a flexible and self-describing data format for data exchange, web services, and service-oriented architectures. XML is also a hierarchical data model that is inherently different from the relational model. While relational data processing is based on rigorous and predefined schemas that allow for limited flexibility, XML is well-suited to represent data with variable or evolving schemas. XML is also commonly used as a data-format for semi-structured and unstructured data . Depending on the performance and flexibility requirements of particular applications, you will find that in some cases XML is a better choice than a relational schema, and in other cases relational data has advantages over XML. Many scenario’s also exists in which a hybrid approach , that is a mix of XML and relational data , is the best solution. DB2 pureXML provides sophisticated capabilities for storing, indexing, querying, updating and validating XML documents. The pureXML technology and its native XML storage format provide significantly higher performance and flexibility than alternative storage options for XML data, such as LOBs or shredding. DB2 pureXML also enables seamless integration of XML and relational data.

(From DB2 pureXML Cookbook p13 )

One reason to store a XML document still as LOB is to be able to retrieve the document afterwards byte-to-byte 100 % identical as the original document (ex. for Compliance or Auditing reasons) . If you store a XML document as pureXML data type, the XML parser might remove insignificant whitespace . 7 DB2 pureXML has been designed to overcome the problems that are inherent in LOB storage and shredding. The advantages of DB2 pureXML and its native XML storage format include: • Retaining awareness of the internal structure of the XML data: Contrary to LOB storage, DB2 pureXML stores XML in a parsed tree format that explicitly represents the structure of each XML document. As a result, applications can query and update XML data using XQuery, XPath, and SQL/XML without XML parsing at runtime. This is a critical performance benefit. Additionally, indexes can be created on specific nodes. • Keeping business objects intact: DB2 pureXML stores each XML document as a cohesive unit that belongs to one row in a table, providing a very intuitive storage and processing model. In contrast, XML shredding scatters the values of each XML document over a number of tables. Hence, shredding can result in an unwieldy relational schema that is difficult to understand and inefficient for queries and the reconstruction of XML documents. • Schema flexibility: While shredding requires all XML documents to adhere to a single XML Schema that is mapped to relational tables, DB2 pureXML can store documents for variable or evolving schemas in the same XML column. The cost of schema evolution is much lower for DB2 pureXML than for a shredding approach. • Faster application development: Because DB2 pureXML does not require any schema mapping and uses a single XML column instead of complex relational schema, prototyping and designing applications can be much simpler with DB2 pureXML than with shredding.

(from DB2 pureXML Cookbook p 10) 8 9 In DB2 pureXML Cookbook, two of IBM's leading experts (Matthias Nicola & Pav Kumar-Chatterjee) provide the single most comprehensive coverage of DB2's pureXML capabilities. This book explains DB2 pureXML in more than 700 practical examples, including 250+ XQuery and SQL/XML queries, taking the reader from simple introductions all the way to advanced scenarios. The authors have distilled their hands-on experience with many pureXML applications so that you can benefit from best practices, tips & tricks, performance guidelines, and other gems that are not documented elsewhere. This book is invaluable for database administrators and application developers, beginners and DB2 experts. The topics are organized by typical user tasks throughout the life cycle of XML database projects, from planning, designing, and implementing databases all the way to tuning, problem determination, and application development. It includes code samples for Java, .NET, COBOL, PL/1, C, PHP, and Perl programmers. The DB2 pureXML Cookbook provides proven recipes rather than a mere reference of ingredients.

10 11 In DB2 for z/OS, The installation job DSNTEJ1 creates five tables with XML columns. These tables are in the relational schema DSN8910 and are named PRODUCT, CUSTOMER, PURCHASEORDER, CATALOG, and SUPPLIERS. Only table DSN8910.PRODUCT is populated by the installation jobs. There are several ways to populate some of these tables. For example, if you have a DB2 for , , and Windows installation, such as the free DB2 Express-C, you can create the sample database and select or export the data from there. The data can then be imported or inserted into the z/OS tables using SUPFI or an import job. The PDF document “DB2 Version 9.1 for z/OS XML Guide” (SC18-9858) provides the DDL and three INSERT statements with XML data for a table called MYCUSTOMER. You can copy and paste these statements into SPUFI to build a sample table to work with.

12 An XML document basically consists of elements with zero, one or more attributes. Each element consists of a and an . These tags are enclosed in angle brackets. Elements can have a value or contain other elements. Empty elements can have attributes and can be represented by a single . Elements can occur multiple times. Attributes always have a value. A well-formed document has a single root element. The order of elements is significant. The order of attributes is not significant An XML document is case sensitive.

This sample XML document is very simple in nature (no encoding schema, no XML version, no namespaces ,limited number of elements and attributes,….)

13 The IBM GSDB sample database is available to use in your own projects and for learning about IBM products (like IBM Data Studio) . The sample database contains a rich set of sample data that follows the fictional Sample Outdoor company and its sales and operations. It can be downloaded from the web To set up the sample database on DB2 for z/OS, you run the setup scripts from a workstation and install the database on a cataloged remote DB2 for z/OS subsystem. It contains one table with an XML column and 212 rows of data.

Beware that cust-order-details1 and cust_order_details2 are empty elements with attributes represented by a single empty tag.

14 15 The XMLSERIALIZE function will convert an XML data type to XML text . The opposite function is XMLPARSE which converts XML text to a XML data type. Internally an XML parsed document is stored in UTF-8.

Difference between AS CLOB and AS BLOB : the XML data is always returned in UTF-8 encoding scheme . With CLOB it will be shown as EBCDIC on the 3270 screen because of conversion to the application coding scheme. With BLOB the resulting string will be in UTF-8 .

16 17 18 Also IBM Data Studio Developer has an XML document viewer and editor.

19 20 21 In QMF you can export the result of a QMF query or a table in XML format by using the DATAFORMAT=XML clause on the EXPORT DATA or EXPORT TABLE command. This format must be used when the data contains XML columns but can also be used when the data or table to be exported does not contain XML columns. When you export data or tables in XML format, the data is exported to the HFS Unix file, the TSO data set, or the CICS data queue that is specified in the command. QMF uses the XML 1.0 specification (fourth edition) when exporting data. QMF uses z/OS XML parse services as well as z/OS Unicode conversion services when processing XML data for export , so these services must be configured and active.

The result of exported XML data in QMF is always in Unicode UTF-8 format. The Unicode character set can include characters from almost all of the living languages of the world. In UTF-8, ASCII and control characters are represented by their usual ASCII single-byte codes, and other characters become two to four bytes long. The IBM UTF-8 implementation is defined by codepage 1208. UTF-8 stands for “UCS Transformation Format 8”

22 To illustrate we can use a simple result set as shown above with 2 columns and 2 rows.

23 The header records in the exported XML file contain the version of XML used, the encoding scheme, and a line that references which style sheet to use to format the exported XML document. QMF provides a default style sheet “qmf_dataset.xslt” as member DSQ1STSH of the QMF samples data set SDSQSAPn . You can copy this default style sheet to the location where the exported file resides. When you uncomment the above style sheet statement , you can format the XML document to these specifications when opening it. There is also a namespace definition and an optional xsi:schemaLocation definition that can be added (as attributes of the root element). XML namespaces are used to provide uniquely named elements and attributes in an XML document by a prefix. QMF uses a default namespace “http://www.ibm.com/qmf" and a prefix namespace “http://www.w3.org/2001/XMLSchema-instance“ with prefix “:xsi“ . Elements without prefix in the document belong to the first namespace domain , elements and attributes with the prefix :xsi will belong to the second. Attributes without a prefix have no namespace. Beware that the namespaces are in the form of an URL , but these URL’s are not real web- adresses but just a unique identifier. In many XML documents the namespace :xsi=“http://www.w3.org/2001/ XMLSchema-instance“ has a special meaning. It is used to reference the XML schema of the document with a :xsi.SchemaLocation attribute (for documentation reasons but sometimes also for processing reasons) . The basic idea is that it works like a magic cookie. Either the processing software has been programmed to recognize it, and thus acts on the basis of what it means, or it has not. Other attributes of this namespace are xsi:type, xsi:nil and xsi:noNamespaceSchemaLocation') with fixed predefined meanings. In DB2 V10 the schema location attribute is used to match the registered schema version in the XSR directory that should be used during automatic validation (see further) . It consists of a namespace name and a schema location hint which uniquely define a registered schema in XSR. 24 Ex : xsi:schemaLocation=“http://www.example.com/P02 The column metadata consists of the number of columns, column names, column labels (if applicable), data types, data lengths, whether the data is null, and the format. The exported XML file contains one column-description block for each column. Only the non-empty elements are present ( for varchar type and , for decimal type) . This is different from the relational model. The exported file contains one row-definition block for each row of exported data . Data records are in variable block spanned (VBS) format. A tag identifies each column in the row by number. The attribute “id” from corresponds with the attribute “id” from .

Mind the empty elements and represented by an empty tag.

25 Because of the UTF-8 encoding scheme, you have to use the “display UTF8” command in ISPF to browse an exported XML file .

26 Because of the UTF-8 encoding scheme, you have to use the “display UTF8” command in ISPF to browse an exported XML file .

27 In our shop we use IBM's Distributed File Service to allow users to access and share data in a distributed environment across IBM and the Windows platform. DFS support includes DFS client and file server support for DCE. DCE support is provided by the IBM z/OS DCE Base Services element. The DFS implementation is based on source code developed by the Open Software Foundation (OSF). To use the DFS support, the DCE Base Services element of z/OS must be installed, configured, and run on the system.

The easiest way to manipulate the content of an XML file is from the Windows platform. Many tools exist there. We used Windows Notepad and Windows Internet Explorer.

28 Here we show what happens if we open an XML exported file with Windows Internet Explorer The XML document is shown in a Windows Internet Explorer window (read only) . If the document is not a valid XML document, error messages are shown.

29 Here we show what happens if we copy the default QMF style sheet to the location where the exported file resides and uncomment the above style sheet statement. When opening the XML file with Windows Internet Explorer, the result is formatted according to the specifications in the style sheet.

30 31 We decided to build a test case with a DB2 table containing meta data of other existing DB2 tables in our environment. The new table SIDDAGO.TABLE_XML has 3 relational columns and 2 XML columns. The first and second relational columns are the table creator and table name of each table . The first XML column contains an XML document with the meta data of the corresponding table . The second XML column contains an XML document with the meta data of all columns of this table. Each XML document is produced by running a QMF query and exporting the result to an XML file on a shared DFS directory . The XML files are of the form T0000001.,T0000002.xml,T0000003.xml,…..T0002571.xml and C0000001.xml,C0000002.xml,C0000003.xml,…..C0002571.xml . The suffix 0000001,0000002,… of these files is stored in the relational column TB_ID for reference reasons. The QMF REXX script that creates all these xml files also builds an input file for the DB2 load utility to populate the table.

We used the TRANSLATE function to substitute ‘&’ to ‘A’ for tables and columns having the XML special character ‘&‘ in their REMARK or LABEL column in the DB2 catalog.

32 The input SYSREC file for the LOAD utility is a VB file with delimited input and containing 2571 records, one for each table .

33 We used above LOAD syntax to LOAD the generated XML data in the meta data table. The TEMPLATE definitions for the LOAD utility are not shown in the above example but are standard. The input XML fields are always parsed to UTF-8 and stored in a parsed format in the XML table space. If a parsing error occurs SQLCODE-20398 is returned to the LOAD utility. You can discard input records with bad XML files by specifying a discard file. In our case 6 input records were discarded because of special XML characters other than ‘&’ in their input XML data . Although the DB2 V9 manuals clearly state that the input XML files can be PDS, PDSE or HFS , we did not succeed in LOADing XML files from a PDS or PDSE (SQLCODE-452 reason 12) .The default behavior of LOAD Is not to preserve whitespace (CR, LF, tab) during parsing . We saw no difference between specifying CLOBF or BLOBF. You cannot use a SYSREC file coded in EBCDIC referring to .xml files with UTF-8 content. If the content of the .xml files is UTF-8 (as in our case) , the SYSREC file must also be coded in UTF-8 and UNICODE CCSID(01208,00000,00000) must be specified to indicate that the input file is in UTF-8 . We used following technique to convert EBCDIC strings to UTF- 8 in our QMF REXX coding : • convert the string to UTF-8 in HEX format : SELECT HEX(UNICODE_STR(‘string',UTF8)) FROM SYSIBM.SYSDUMMY1 • convert back to string using the REXX x2c function : string = x2c(string)

34 Opposite to LOB table spaces, XML table spaces can be compressed .We used above SQL statements and REORG statements to enable the compression. Because of the storage in parsed format, the XML table spaces are much smaller than the native .xml files , but even then compression is still significant.

35 36 This is a repeat of the layout of a QMF exported XML file : •Header record •Optional style sheet •Root element with default namespace •Metadata section : description of column 1 description of column 2 ….. •Data section : for column 1 : value row 1 value row 2 ……. for column 2 value row 1 value row 2 …….

37 The TB_DESCRIPTION XML column contains meta data about a table (in this example Q.APPLICANT) :

DBNAME=‘DSQ1STBB’ ; TSNAME=‘DSQ1STBT’ ; COLCOUNT =5 ; CARDF = 1.000E+01 ; AUDITING = ‘C’ ; CREATEDTS = ‘1990-03-12- 09.51.52.580000 ‘ ; REMARKS = ‘QMF SAMPLE TABLE’ ; LABEL= ‘ ‘

A lot of different XPATH expressions are possible that lead to the above result . Some XPath characteristics : • Everything is case sensitive • Character strings are delimited by “ “; numeric fields have no delimiter : “0" is character string ; 0 is numeric • Attention with sorting : 2 < 13 but "2" > "13" • The used namespaces must always be present , including the default namespace • PASSING defines the "initial context" • predicates are delimited by [ ] ; and, or etc are possible in predicates • ../ refers to the parent context ./ refers to the current context • * is wildcard for tags // is wildcard for paths • It’s best to avoid as much as possible wildcards for performance reasons, especially if indexes are present

38 The result came unexpectedly quick (because of parsed elements)

39 Here we combine SQL with two XMLEXISTS queries.

We also use a XPath variable $doc which holds the initial context.

40 XMLQUERY always returns XML data type XMLCAST returns the value of an XML element as an SQL type (without tags and attributes). Use XMLCAST to cast from XML to another data type . XMLCAST can only be used when QMLQUERY returns one element ;otherwise return the result in native XML format with multiple elements or use the XMLTABLE function .

41 The CO_DESCRIPTION XML column contains meta data about the columns of a DB2 table (in this example Q.STAFF) :

COLNO = 1 ; COLNAME =‘ID’ ; COLTYPE = ‘SMALLINT ‘ ; LENGTH = 2 ; SCALE = 0 ; NULLS = ‘N’ ; REMARKS = ‘’ COLNO = 2 ; COLNAME =‘NAME’ ; COLTYPE = ‘VARCHAR‘ ; LENGTH = 9 ; SCALE = 0 ; NULLS = ‘Y’ ; REMARKS = ‘’ COLNO = 3 ; COLNAME =‘DEPT’ ; COLTYPE = ‘SMALLINT ‘ ; LENGTH = 2 ; SCALE = 0 ; NULLS = ‘N’ ; REMARKS = ‘’ ……………..

We get SQLCODE-16003 because XMLCAST can only be used when QMLQUERY returns one value. Otherwise return the result in native XML format with multiple values or use the XMLTABLE function.

42 Use the XMLTABLE function to return XML data in relational format (table format) . Missing elements will be returned as the NULL value. SQL scalar functions and aggregate functions can be used on the result table afterwards.

Namespaces have to be mentioned in all XPath statements .

In this example we use the to find the id attribute of COLNAME : Cell[@id = /DataSet/ResultSet/MetaData/ColumnDescription[Name="COLNAME"]/@id ]‘ which is equivalent to 'Cell[@id = “2”]'

43 The XMLTABLE function contains one row-generating XQuery expression and, in the COLUMNS clause, multiple column-generating expressions. The row-generating expression is the beginning and is applied to each XML document in the XML column and produces one or multiple rows per document. The row-generating expression produces one element per document. The number of elements produced by the row-generating XQuery expression determines the number of rows produced by the XMLTABLE function. The COLUMNS clause transforms XML data into relational format. Each of the entries in this clause defines a column with a column name and an SQL data type. The row-generating expression provides the context for the column-generating expressions. This means that the column-generating expressions are not absolute paths, but relative to the row-generating expression. You can typically append the column-generating expressions to the row-generating expression to get an intuitive idea of what a given XMLTABLE function returns in its columns. The result set of the XMLTABLE query can be treated like any SQL table. You can query and manipulate it much like you use regular row sets or views. The column definitions in the COLUMNS clause can use any SQL data type, such as INTEGER, DECIMAL, CHAR, DATE, and so on. If an extracted XML value cannot be cast to the assigned SQL type, the query fails with an error message.

Namespaces have to be present in the row-generating expression and in all column-generating expressions. Use the XMLNAMESPACES function to define the default namespace for all generating expressions. 44 Additional SQL scalar functions and aggregate functions can be used on the result table afterwards.

45 46 An XML schema consists of 1 primary schema document and optional one or more additional schema documents (.xsd schema files) . The additional files are linked through include or import statements with the primary file to form one document. Schema’s can also evolve over time (ex: when new elements are added). Therefore a schema can have multiple versions over time. During validation it is important to use the right version.

In DB2 V9 validation is optional and always user controlled by using the DSN_XMLVALIDATE function. In DB2 V10 an XML document can be automatically validated during insert, update or LOAD. Therefore an XML type modifier is added to the XML column definition. This type modifier contains the registered schemas for the XML documents in this XML column . During validation DB2 will choose one of these schemas to do the validation based on the namespace of the root element or the namespace and schema location hint specified in the xsi:schemaLocation attribute.

47 QMF provides a default schema “qmf_data.xsd” as member DSQ1SCEM of the QMF samples data set SDSQSAPn (in EBCDIC)

The first implementation of XML validation was through user-defined- function SYSFUN.DSN_XMLVALIDATE from within the XMLPARSE function. This UDF is now deprecated (see APAR PK90040) ; always use the newer and better SYSIBM.DSN_XMLVALIDATE builtin function (2010). There are several forms of DSN_XML_VALIDATE available with one, two or three parameters . As an example we used the form shown above with two parameters. The first parameter is a clob-expression corresponding to the document to validate and the second parameter is the name of a registered schema in the XSR. The result is always an XML type . The form with one parameter will use the registered schema based on the namespace of the root element or the namespace and schema location hint specified in the xsi:schemaLocation attribute in the document. The form with three parameters will use the registered schema corresponding with the specified namespace and schema location hint. Because DB2 V9 only supports user controlled validation through the DSN_XMLVALIDATE function, validation cannot be done during the LOAD utility in V9. A technique to do it during LOAD time is to LOAD the XML data first to a temporary table and to use INSERT with DSN_XMLVALIDATE afterwards to the final target table. Or just issue a SELECT statement with DSN_XMLVALIDATE on the LOADed table to see if one or all XML documents are valid.

48 A schema is also an XML document . It defines which elements an XML document can contain, how they are organized and which attributes and attribute types elements can be assigned. It is written in the W3C XML Definition language or XSD language. It is much richer an allows more complex semantic rules than the DTD language (Document Type Definition) .

A schema will contain a targetNamespace which corresponds with the default namespace of the XML document it describes. It has also a default namespace and a prefixed :xs namespace.

We found that one :xs prefix was missing in the QMF default schema for element “Scale” in the “type” attribute. (“integer” belongs to the :xs namespace and not to the default namespace of the schema)

49 The XML Schema Repository (XSR) consists of a DSNXSR with 8 tables and 4 DB2 stored procedures (optional installation job DSNTIJRT) .

A schema can be registered using the stored procedures from a z/OS script or application , the Windows DB2 CLP or a Windows tool supporting the XSR like IBM Data Studio.

I used the CLP from DB2 Connect V9.5 and did not find an equivalent CLP command to remove a schema from the XSR.

Because SYSPROC.XSR_COMPLETE is a Java stored procedure, your z/OS environment must be set up to be able to use Java stored procedures and the IBM Data Server Driver for JDBC and SQLJ.

50 Here we show the commands used to register the QMF schema in the z/OS XSR from a DB2 Connect V9.5 CLP . Help is available using the ? Command

51 Because SYSPROC.XSR_COMPLETE is a Java stored procedure, your z/OS environment must be set up to be able to use Java stored procedures and the IBM Data Server Driver for JDBC and SQLJ. This was not our case. We encountered some difficulties to get it working. A very good list of actions to do and common error messages can be found in the mentioned article. Additionally we encountered two more problems. To solve the first problem we had to add an arbitrary non-APF authorized library to the steplib of the application environment for java stored procedures. To solve the second problem we had to do the shown SETPROG LPA command (to be repeated after each IPL)

52 53 An XML index can be used to improve the efficiency of queries on XML documents that are stored in an XML column. Instead of providing access to the beginning of a document, index entries in an XML index provide access to nodes within the document by creating index keys based on XML pattern expressions. Because multiple parts of a XML document can satisfy an XML pattern, DB2 might generate multiple index keys when it inserts values for a single document into the index. Such an index can then be used when this pattern is present in a predicate. In the above case indexes I_TABLE_XML_01 and I_TABLE_XML_02 will contain 8 keys per document and I_TABLE_XML_03 only 7 because the elements with id=“8” never contain a value (table LABELs are never present) . As with any XPath expression, the namespace must always be specified . An XMLPATTERN cannot contain a predicate [ … ] . For testing purposes we defined 3 different XML indexes : one on element node “” , one on attribute node “id from ” and one on text node “text of ” . These indexes are so called “lean” indexes because they contain only fully qualified paths (no wildcards * or // ) which is recommended for optimal performance. For the above query we have 2 predicates : one on path “DataSet/ResultSet/Data/Row/ Cell/@id' and one on path “DataSet/ResultSet/Data/Row/ Cell/text()' . Because we have a mix of character values and numeric values we use the VARCHAR type and not the DECFLOAT type . Use DECFLOAT with care because it can only be used to index numeric values . All nodes with values that cannot be casted to DECFLOAT will not be indexed and those indexes will only be used to evaluate numeric predicates (that’s why incomplete DECFLOAT indexes are safe after all because they can never lead to incomplete results). Assure that the VARCHAR length is big enough to contain the element value or attribute value . We cannot use UNIQUE indexes because all rows of table TABLE_XML will contain documents with the same nodes . 54 An XML Index is eligible if it can be used to answer a query predicate. An XML index stores only values of nodes that match the XPath pattern and the data type in the index definition .

55 Run explain and check the PLAN_TABLE to see if the DB2 optimizer considers the use of the XML indexes . In this case DB2 will use indexes I_TABLE_XML_02 and I_TABLE_XML_03 as expected to resolve the double predicate on Cell[@id] and Cell[text()] . DB2 starts with the index on text() which is very selective in this case.

The relevant ACCESSTYPE values for XML queries in the PLAN_TABLE are : • M : multiple index scans followed by an intersection or union of the returned lists • DX : an XML index scan on the index that is named in ACCESSNAME. It returns all documents that match the XMLPATTERN of the index and the predicate in the form of a DOCID list. • DI : an intersection of multiple DOCID lists • DU : a union of multiple DOCID lists • R : table space scan on XML table space ; no XML indexes used

To compare different access paths have a look at the PROCSU column to have an estimate of the query cost in Service Units. After dropping index I_TABLE_XML_03 DB2 will use indexes I_TABLE_XML_02 followed by I_TABLE_XML_01 in a similar way with a much higher cost of 5408 SU . After also dropping index I_TABLE_XML_02 , index I_TABLE_XML_01 will not be used and we get a scan of the complete XML table space (38967 SU) 56

57 58