XML Technologies Dissected

Spotlight XML Technologies Dissected Erik Wilde • Swiss Federal Institute of Technology, Zürich The lack of well-defined information models in many XML technologies can generate compatibility problems and lower the odds of their long-term success. ML technologies have become ubiquitous izing the entities or an API for accessing the in applications for computer communica- entities from within applications. X tions, business-to-business IT architec- tures, and Web-based services. They are evolving It’s important to clearly distinguish between the rapidly, with developer requirements fueling both properties of an information model and those of its the maturation of core technologies and the emer- data model. For example, I’m concerned here only gence of new ones. However, many of these XML with passive data structures. This might at times technologies lack well-designed support, making seem contradictory. The Document Object Model’s life harder for developers and creating possible application programming interface, for example, obstacles to ultimate success. clearly provides operations for such things as Here, I systematically examine the information adding attributes to elements. However, this is not and data models of several influential XML tech- a feature of the DOM information model of XML, nologies. I define information model and data which simply defines that elements can have any model in keeping with the terms discussed in RFC number of attributes. The DOM API as a data model 3444.1 In doing so, my hope is that the clarification provides functions for inserting attributes, but this that this RFC offers can both benefit XML tech- is essentially the same as using the XML serializa- nologies and help identify areas for future work. tion data model (that is, an XML 1.0 document) and inserting the attribute using a text editor. In both Information and Data Models cases, the information model behind the data model I’ve adapted the following definitions to fit XML is modified to include a new attribute, using a and a discussion of passive data structures; specif- mechanism that the particular data model enables. ically, the definitions exclude the notion of operations, which is only relevant for modeling objects. XML Technologies Researchers and developers have always hotly • An information model is an abstract descrip- debated the question of whether an XML information of a given application area’s entities and tion model should be clearly defined. Given the their relations. The information model’s goal is success of many Internet technologies without an to identify, describe, and relate the target enti- explicitly defined information model — including ties, rather than define a specific means for re- XML — an information model is clearly not presenting, accessing, or manipulating them. required as a prerequisite for success. However, an •A data model can be regarded as a lower-level information model is beneficial for the sake of implementation of the higher-level information proper layering, well-defined semantics, and the model. Data models are based on two things: ease of defining new data models. Widely used the information model they’re implementing XML technologies should also support at least two and their particular implementation method. data models: one for representation, and another This method might define a syntax for serial- for program access through an API. Although 2 SEPTEMBER • OCTOBER 2003 Published by the IEEE Computer Society 1089-7801/03/$17.00©2003 IEEE IEEE INTERNET COMPUTING XML Technologies technologies can evidently thrive without meeting through the infoset can be accessed through these these requirements, such models offer benefits for models (although SAX2 requires extensions). Data extensibility and portability. models for infoset serialization are XML 1.0 and Canonical XML, which is the normalized encoding XML Documents of an infoset as an XML 1.0 document. XML 1.0 basically defines a data model (a representation) without an explicit information model. XML Path Language 1.0. XPath 1.0 (www.w3. Whether this is a boon or a bane is the subject of org/TR/xpath) is the foundation for several other ongoing and lively discussions within various cir- XML technologies, in particular XSL Transforma- cles. XML 1.0 does contain an implicit informa- tions (XSLT; see www.w3.org/TR/xslt) 1.0. XPath tion model (through its markup structures), but — 1.0 uses a slight simplification of the infoset’s as specification writers discovered when they information model and is defined in terms of attempted to define other XML 1.0 data models — infoset information items. Consequently, the XPath there’s no clear consensus about exactly what that 1.0 information model excludes the same infor- information model is or should be. Because vari- mation as the infoset. ous communities hold different views of XML, XPath 1.0’s most popular data model uses XPath there are now several XML information models. 1.0’s path-based expression syntax. XPath 1.0 has been designed so that other specifications can Document Object Model. Developers introduced extend it; XSLT 1.0 and the XML Pointer Language DOM for HTML, then adapted it to XML. Different (XPointer), for example, extend XPath by adding versions implicitly defined different information constructs and functions. Developers can also use models; only the latest version, DOM level 3,2 XPath 1.0 syntax with DOM3’s XPath module. The includes enough information to make its informa- only established data model for XPath 1.0 serial- tion model a superset of the XML information set. ization is XML 1.0 itself, even though some XSLT A DOM information model’s data model processors can consume and produce SAX events, includes DOM itself, which is available in many thus pipelining multiple XSLT processing steps programming language mappings. The Simple API without repeated serialization and parsing. for XML (SAX; see www.saxproject.org) imple- ments only a subset of DOM’s information model, XML Schema. XML Schema (www.w3.org/xml/ but developers can easily extend SAX2 to support schema) is a rather complex schema language that the full DOM information model. The JDOM API, uses Post Schema Validation infoset (PSVI) contri- designed specifically for Java, also supports DOM’s butions to extend the infoset with information from information model (see www.jdom.org). XML Schema processing. Formally, XML Schema DOM has no specific serialization data model. defines XML Schema processing in terms of infos- Some DOM implementations proposed a persistent et augmentation, rather than as processing an XML DOM (PDOM) as a serialization format, but these document. Consequently, the PSVI lacks the same formats were proprietary and never caught on. information that is missing in the infoset. Developers can, however, use XML 1.0 document Unfortunately, there is no agreed-upon data syntax as a serialization format. model for the PSVI part of XML Schema’s information model. This is true for representations as XML Information Set. The XML Infoset (www.w3. well as for APIs. For a representation, it would be org/TR/xml-infoset) is an attempt to define an necessary to represent the considerable amount of XML information model. Although the infoset is PSVI information, and there are some examples the foundation for other information models (most for doing this using XML 1.0 syntax, but these notably XPath and XML Schema), it omits infor- have not caught on. On the API side, there are only mation from XML documents that is indispensable proprietary proposals. Although DOM3’s modular in some application scenarios — most notably, structure would be ideally suited to define a PSVI information about an XML document’s entity module, the DOM working group has yet to sup- structure and about CDATA sections. Furthermore, port PSVI access. only documents that are compliant with XML namespaces have an infoset. XML Query Language 1.0 and XML Path Language Data models for the infoset are DOM3 and SAX2. 2.0. For XQuery 1.0 (www.w3.org/TR/xquery) and Despite subtle differences (such as the treatment of XPath 2.0, developers had to define a new infor- namespace attributes) all information available mation model (because XQuery developers wanted IEEE INTERNET COMPUTING http://computer.org/internet/ SEPTEMBER • OCTOBER 2003 3 Spotlight to add data-type support). They’re still building the in the abstract questions behind the current diver- model — which they unfortunately called “data sity of XML information and data models, it’s model” in the specification — but it clearly builds important that they keep them in mind. The fact on top of the XML Schema information model that different XML technologies are built on dif- (that is, the infoset with PSVI augmentations) with ferent information models directly influences how some extensions, such as document fragments and a developer views and processes XML documents. atomic values. Basically, the XQuery 1.0 and XPath 2.0 information model is a stripped-down version XML Schema Languages of the XML Schema information model, with only Developers use XML schema languages to express partially visible PSVI information. XML document constraints. In many cases, schema As for data models, the XQuery 1.0 and XPath 2.0 languages are grammar-based — such as Document information model takes the same approach as the Type Definitions and XML Schema — but other XPath 1.0 information model: it serves as a founda- approaches have proven useful for certain appli- tion for other specifications, including XQuery 1.0 cation scenarios (such as the rule-based Schema- and XSLT 2.0. Perhaps later versions of DOM will tron language, used for expressing XPath-based include an XPath 2.0 module. For serialization, the constraints). In the following list, I focus on the same limitations apply as those for the XML Schema schema languages’ information and data models, information model: developers can easily serialize rather than on how validation might affect the val- idated document.

XML Technologies Dissected

On the Integrity and Trustworthiness of Web Produced Data

XML Signature/Encryption — the Basis of Web Services Security

Sams Teach Yourself XML in 21 Days

Bibliography of Erik Wilde

Feasibility and Performance Evaluation of Canonical XML

A Layered Approach to XML Canonicalization

Exploring the XML World

Information Integration with XML

Extensible Layout in Functional Documents♦

Oracle Xml Db

[ Team Lib ] .NET & XML Provides an In-Depth, Concentrated Tutorial for Intermediate to Advanced-Level Developers. Additiona

XML Family of Languages Overview and Classification of W3C Specifications