An XSD 1.1 Schema Validator Written in XSLT 3.0

An XSD 1.1 Schema Validator Written in XSLT 3.0 Michael Kay, Saxonica Abstract The paper describes an implementation of an XML Schema validator written entirely in XSLT 3.0. The main benet of doing this is that the validator becomes available on any platform where XSLT 3.0 is implemented. At the same time, the experiment provides a valuable test of the performance and usability of an XSLT 3.0 implementation, and the experience gained from writing this application can feed back improvements to the XSLT processor that will benet a large range of applications. 131 Introduction 1. Introduction This paper describes the outline design of an XML Schema (XSD 1.1) validator written entirely in XSLT 3.0. There were two reasons for undertaking this project: • Firstly, we see a demand to make implementations of the latest XML standards from W3C available on a wider variety of platforms. The base standards (XML 1.0, XPath 1.0, XSLT 1.0, XSD 1.0) that came out between 1998 and 2001 were very widely implemented, but subsequent renements of these standards have seen far less vendor support. One can identify a number of reasons for this: perhaps the rst versions of the standards met 90% of the requirements, and the market for subsequent enhancements is therefore much smaller; perhaps the early implementors found that their version 1 products were commercially unsuccessful and they therefore found it dicult to make a business case for further investment. In the case of products like libxml/libxslt produced by enthusiasts working in their spare time, perhaps the enthusiasts are more excited by a brand new technology than by version 2 of something well established. Whatever the causes, XSD 1.1 in particular only has three known implementations (Altova, Saxon, and Xerces) compared with dozens of implementations of XSD 1.0, despite the fact that there is a high level of consensus in the user community that the new features are extremely useful. • Secondly, as editor of the XSLT 3.0 specication and developer of a leading implementation of the language, I felt a need to "get my hands dirty" by developing an application in XSLT 3.0 that would stretch the capabilities of the language. I am ofen called on to provide advice to XSLT 3.0 users, and even now that the language specication is complete, I nd myself designing extensions to meet requirements that go beyond the base standard, and this can only be done on the basis of personal experience in using the language "in anger" for real applications. In addition, the usability of a language compiler depends in high measure on the quality of diagnostics available, and improving the quality of diagnostics depends entirely on a feedback process: you need to make real programming mistakes, spot when the diagnostics could be more helpful, and make the required changes. More specically, Saxonica has released an implementation of XSLT 3.0 written in Javascript to run in the browser, and an obvious next step is to extend that implementation to run under Node.js on the server. But for the server environment we need to implement a larger subset of the language, and so the question arose, should we write a version of the schema processor in Javascript? However, Javascript is not the only language of interest: we're also seeing demand for our technology from users whose programming environment is C, C ++, C#, PHP, or Python. Since (using a variety of tools) we've been trying to deliver XSLT 3.0 capability in all these environments, a natural next step is to try and deliver a schema processor that will run on any of these platforms, by writing it in portable XSLT 3.0 code. Architecturally, there are two parts to an XSD schema processor: • The rst part is the schema compiler. This takes as input the source XSD schema documents, and performs as much pre-processing of this input as it can prior to the validation of actual XML instances against the schema. This preprocessing includes verication that the schema documents themselves represent a valid schema, and other tasks such as expanding defaults. The XSD specication from W3C describes a "schema component model" which is an abstract description of a data structure that can be used to represent a compiled schema; and it gives detailed rules for translating source schema documents to this abstract data model. The Saxon schema processor follows this approach very closely, and it also denes a concrete XML-based representation of the schema component model which realises the output of the schema compilation phase in concrete form. This output in fact contains a little more than the SCM components and their properties; it also contains a representation of the nite state machines used to implement the grammar of complex types dened in the schema. • The second part is the instance validator. The instance validator takes two inputs: the compiled schema (SCM) output by the schema compiler, and an XML instance document to be validated. In principle its output is a PSVI (post schema validation infoset) containing the instance document as a tree, decorated with information derived from schema validation, including information identifying nodes that are found to be invalid. In practice, in the Saxon implementation, there are two outputs: a validation report, which in its canonical form is an XML document listing all the validation errors that were found, and a tree representation of the instance document with added type annotations, in the form prescribed by the XDM data model used by XPath, XQuery, and XSLT. Because the Saxon schema validator is primarily designed to meet the schema processing requirements of XSLT and XQuery, it only delivers the subset of PSVI required in that environment: this means that the output XML tree is only delivered if it is found to be valid, and the only tree decorations added by the validator are references from valid element and attribute nodes to the schema types against which they were validated. In efect, if the input is valid, then the output of the Saxon schema validator is a representation of the input XML instances with added type annotations; while if the input is invalid, then the output is a report listing validation errors. The sofware described in this paper is an XSLT 3.0 implementation of the instance validator. It would certainly be possible to write an XSLT 3.0 implementation of the schema compiler, and we may well do that next, but we haven't tackled this yet. Moreover, the only output of the instance validator described here is a validation report showing the validation errors. The sofware doesn't attempt to produce an XDM representation of the input document with type annotations. In fact, this isn't possible to do using standard XSLT 3.0 without extensions. XSLT 3.0 has no instructions to output elements and attributes with specic type annotations; the only way it can generate typed output is by putting the output through a schema validator, which is assumed to exist as an external component. So to write that part of the validator in XSLT 3.0, we would need to invent some language extensions. 132 The Validation Task 2. The Validation Task In this section we describe what the validator actually does. Let's start with an example of a very simple schema, like this: <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:element name="books"> <xs:complexType> <xs:sequence> <xs:element ref="book" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> <xs:key name="isbn-key"> <xs:selector xpath="book"/> <xs:field xpath="@isbn"/> </xs:key> </xs:element> <xs:element name="book"> <xs:complexType> <xs:sequence> <xs:element name="title" type="xs:string"/> <xs:element name="publisher" type="xs:string"/> <xs:element name="author" type="xs:string" minOccurs="1" maxOccurs="5"/> <xs:element name="date" type="xs:gYear"/> <xs:element name="price" type="moneyType"/> </xs:sequence> <xs:attribute name="isbn" type="ISBNType" use="required"/> <xs:assert test="if (publisher eq 'McGraw-Hill') then starts-with(@isbn, '007') else if (publisher eq 'Academic Press') then starts-with(@isbn, '012') else true()"/> </xs:complexType> </xs:element> <xs:complexType name="moneyType"> <xs:simpleContent> <xs:extension base="xs:decimal"> <xs:attribute name="currency" type="currencyType"/> </xs:extension> </xs:simpleContent> </xs:complexType> <xs:simpleType name="currencyType"> <xs:restriction base="xs:string"> <xs:enumeration value="USD"/> <xs:enumeration value="GBP"/> <xs:enumeration value="EUR"/> <xs:enumeration value="CAD"/> </xs:restriction> </xs:simpleType> <xs:simpleType name="ISBNType"> <xs:restriction base="xs:string"> <xs:pattern value="[0-9]{9}[0-9X]"/> </xs:restriction> </xs:simpleType> </xs:schema> A valid XML instance conforming to this schema might look like this: 133 The Validation Task <?xml version="1.0" encoding="UTF-8"?> <books> <book isbn="0070491712"> <title>Apple PASCAL: a hands-on approach</title> <author>Arthur Luehrmann</author> <author>Herbert Peckham</author> <publisher>McGraw-Hill</publisher> <date>1981</date> <price currency="USD">13.95</price> </book> <book isbn="0124119700"> <title>An Introduction to Direct Access Storage Devices</title> <author>Hugh Sierra</author> <publisher>Academic Press</publisher> <date>1990</date> <price currency="USD">72.95</price> </book> </books> The Saxon-EE schema compiler can be invoked to convert the source schema into an SCM le, for example with a command such as: java com.saxonica.Validate xsd:books.xsd -scmout:xsd Here is the resulting SCM le: <?xml version="1.0" encoding="UTF-8"?> <scm:schema

Load more