Extreme G22.3033-007

Session 3 - Sub-Topic 4 XML Information Processing

Dr. Jean-Claude Franchitti

New York University Computer Science Department Courant Institute of Mathematical Sciences

1

Agenda

Q XML applications development tools for Java

Q XML application Development using the XML Java

Q Java-based XML application support frameworks

Q Advanced XML Parser Technology

Q JDOM: Java-Centric API for XML

Q JAXP: Java API for XML Processing

Q Parsers comparison

Q Latest W3C APIs and Standards for Processing XML

Q XML Infoset, DOM Level 3, Canonical XML

Q XML Signatures, XBase, XInclude

Q XML Schema Adjuncts

Q Java-Based XML Data Processing Frameworks 2

1 XML-Based Development

Q Business Engineering Methodology

Q Language + Process + Tools

Q e.g., Rational Unified Process (RUP)

Q XML Application Development Infrastructure

Q Metadata Management (e.g., XMI)

Q XML APIs (e.g., JAXP, JAXB)

Q XML Tools (e.g., XML Editors, XML Parsers)

Q XML Applications:

Q Application(s) of XML

Q XML-based applications/services ( mediators)

Q MOM & POP

Q Other Services (e.g., persistence, transaction, etc.)

Q Application Infrastructure Frameworks 3

More on XML Information Modeling

Q Using UML use cases to support the development of DTDs and XML Schemas

Q Establish linking relationship

Q See Family tree application of XML

4

2 Part I

XML Application Development Tools for Java

5

Java-enabled XML Technologies

Q XML provides a universal syntax for Java semantics (behavior)

Q Portable, reusable data descriptions in XML

Q Portable Java code that makes the data behave in various ways

Q XML standard extension

Q Basic plumbing that translates XML into Java

Q parser, namespace support in the parser, simple API for XML (SAX), and (DOM)

Q XML data binding standard extension

6

3 XML Processors Characteristics

Q An XML engine is a general purpose XML data processor

Q An XML processor/parser is a software engine that checks the syntax (well-formedness)of XML documents

Q If a schema (or DTD) is included, the parser can (optionally) validate the correctness of XML documents’ structure against it

Q A parser reads the XML document’s information and makes it accessible to the XML application via a standard API

7

Common XML APIs

Q Document Object Model (DOM) API

Q Tree structure-based API

Q Issued as a W3C recommendation (10/98)

Q See Session 5 Sub-Topic 1 Presentation

Q Simple API for XML (SAX)

Q Event-driven API

Q Developed by David Megginson

Q ElementHandler API

Q Event-driven proprietary API provided by IBM’s XML4J

Q Pure Java APIs: JDOM (Open Source) and JAXP

8

4 Java API Packages

Q java..parsers

Q The JAXP APIs, which provide a common interface for different vendors' SAX and DOM parsers.

Q Two vendor-neutral factory classes: SAXParserFactory and DocumentBuilderFactory that give you a SAXParser and a DocumentBuilder, respectively. The DocumentBuilder, in turn, creates DOM-compliant Document object.

Q org.w3c.dom

Q Defines the Document class (a DOM), as well as classes for all of the components of a DOM.

Q org.xml.sax

Q Defines the basic SAX APIs.

Q jaxax.xml.transform 9 Q Defines the XSLT APIs that let you transform XML into other forms.

Simple API for XML (SAX) Parsing APIs

10

5 SAX API Packages

Q org.xml.sax

Q Defines the SAX interfaces.

Q org.xml.sax.ext

Q Defines SAX extensions that are used when doing more sophisticated SAX processing, for example, to process a document type definitions (DTD) or to see the detailed syntax for a file.

Q org.xml.sax.helpers

Q Contains helper classes that make it easier to use SAX -- for example, by defining a default handler that has null-methods for all of the interfaces, so you only need to override the ones you actually want to implement.

Q javax.xml.parsers

Q Defines the SAXParserFactory class which returns the SAXParser. Also defines exception classes for reporting errors.

11

DOM Parsing APIs

12

6 DOM API Packages

Q org.w3c.dom

Q Defines the DOM programming interfaces for XML (and, optionally, HTML) documents, as specified by the W3C.

Q javax.xml.parsers

Q Defines the DocumentBuilderFactory class and the DocumentBuilder class, which returns an object that implements the W3C Document interface. The factory that is used to create the builder is determined by the javax.xml.parsers system property, which can be set from the command line or overridden when invoking the newInstance method. This package also defines the ParserConfigurationException class for reporting errors.

13

XSLT APIs

14

7 XSLT API Packages

Q See Session 3 handout on “Processing XML Documents in Java Using XPath and XSLT”

Q javax.xml.transform

Q Defines the TransformerFactory and Transformer classes, which you use to get a object capable of doing transformations. After creating a transformer object, you invoke its transform() method, providing it with an input (source) and output (result).

Q javax.xml.transform.dom

Q Classes to create input (source) and output (result) objects from a DOM.

Q javax.xml.transform.sax

Q Classes to create input (source) from a SAX parser and output (result) objects from a SAX event handler.

Q javax.xml.transform.stream

Q Classes to create input (source) and output (result) objects from an I/O stream. 15

JAXP and Associated XML APIs

Q JAXP: Java API for XML Parsing

Q Common interface to SAX, DOM, and XSLT APIs in Java, regardless of which vendor's implementation is actually being used.

Q JAXB: Java Architecture for XML Binding

Q Mechanism for writing out Java objects as XML (marshalling) and for creating Java objects from such structures (unmarshalling).

Q JDOM: Java DOM

Q Provides an object tree which is easier to use than a DOM tree, and it can be created from an XML structure without a compilation step.

Q JAXM: Java API for XML Messaging

Q Mechanism for exchanging XML messages between applications.

Q JAXR: Java API for XML Registries

Q Mechanism for publishing available services in an external registry, 16 and for consulting the registry to find those services.

8 Content of Jar Files

Q jaxp.jar (interfaces)

Q javax.xml.parsers

Q javax.xml.transform

Q javax.xml.transform.dom

Q javax.xml.transform.sax

Q javax.xml.transform.stream

Q crimson.jar (interfaces and helper classes)

Q org.xml.sax

Q org.xml.sax.helpers

Q org.xml.sax.ext

Q org.w3c.dom

Q xalan.jar (contains all of the above implementation classes)

17

Sample XML parsers and engines

Q XML parsers

Q RXP, Dan Connolly’s XML parser, XML-Toolkit, LTXML, expat, TCLXML, xparse, XP, DataChannel XPLparser (DXP), XML:Parse, PyXMLTok, Lark, Microsoft’s XML parser, IBM’s XML for Java, Apache’s Xerces-J, Aefred, xmlproc, xmllib, Windows foundation classes, Java Project X Parser (Crimson), OpenXML Parser, Oracle XML Parser, etc.

Q SGML/XML parsers

Q SGMLSpm, SP

18

9 Sample XML Parsers and Engines (continued)

Q XML middleware: Xpublish (Media Design), XML middleware 1.0

Q DSSSL engines: Jade 1.1, DAE SDK, DAE Server SDK

Q XSL processors: Sparse, Microsoft XSL processor, doproc, xslj, LotusXSL, Xalan, XSL:P

Q XLink processors: xmllinks

19

Comprehensive List of XML Processors

Q A comprehensive list of parsers is available at http//www.xmlsoftware.com/parsers

Q Includes links to latest product pages

Q Includes Version numbers, Licensing information, and Platform details

Q Research work being done around MetaParsers and parallel XML parsers

20

10 Mainstream Java-Based XML Processors

Q Sun’s Java Project X Parser

Q Donated on April 13, 2000 to the Apache’s XML Project under the name “Crimson”

Q Apache’s XercesJ

Q XercesJ is strongly recommended for this course

Q Oracle’s XML Parser for Java

21

Other Java-Based XML Processors

Q Sun’s JAXP

Q Jason Hunter and Brett McLaughlin’s OpenSource JDOM

Q IBM Alphaworks’s XML for Java (XML4J)

Q Based on the XML Parser

Q DataChannel’s XJParser

22

11 XML Data Binding Standard Extension

Q Aims to automatically generate substantial portions of the Java platform code that processes XML data

Q A Sun project, codenamed “Adelard”

Q See JSR-31 XML Data Binding Specification

Q see http://java.sun.com/xml/jaxp-1.0.1/docs/binding/DataBinding.

23

Part II

XML Application Development Using the XML Java APIs

24

12 Typical XML Processor Installation

Q Pick a processor based on the features it provides to match your requirements

Q Download and install the latest (or supported) version of the JDK from http://www.javasoft.com

Q Install the XML processor

Q Update the PATH and CLASSPATH variables as needed, and test the processor

25

Reading XML Documents

Q Use Apache’s XercesJ or Alphaworks’ XML Parser for Java

Q The “SimpleParse.java” application provided in section 2.4 of “XML and Java” will need to be adapted to support the latest version of the parsers

Q We suggest looking at the source for the sample applications located in XercesSamples.jar

Q For testing, use XML and Java’s sample document or the “personal.xml” sample XML document provided with XML4J

26

13 Presenting XML Documents Using Java Tools

Q Presenting an XML document requires processing of the XML document by accessing its internal stucture

Q An XML document’s structure can be accessed using the various XML APIs

Q Various third party tools have been implemented using such APIs to apply XSL style sheets to XML documents and generate HTML output (e.g., Xalan, LotusXSL)

27

XML Data Exchange Protocols

Q Message format alternatives

Q Text-based (e.g., EDI, RFC822, SGML, XML)

Q Binary (e.g., ASN.1, CORBA/IIOP)

Q See XML and Java sections 7.2, and 7.4

Q An API that provide a common interface to work with EDI or XML/EDI objects is supported by OpenBusinessObjects

Q Guidelines for using XML for EDI are provided at http://www.geocities.com/WallStreet/Floor/5815/guide.htm and http://www.xmledi-group.org/

28

14 XML Fragment Interchange

Q Defines a way to send fragments of an XML document without having to send all of the containing document up to the fragment

Q Fragments are not limited to predetermined entities

Q The approach captures the context that the fragment had in the larger document to make it available to the recipient

Q See http://www.w3.org/TR/WD-xml-fragment

29

XML APIs Characteristics

Q DOM API: (See http://www.developerlife.com/domintro/default.htm)

Q In DOM, an XML document is represented as a tree, which becomes accessible via the API

Q The XML processor generates the whole tree in memory and hands it to an application program

Q SAX API: (See http://java.sun.com/xml/docs/tutorial/sax/index.html)

Q Does not generate a data structure

Q Scans an XML document and generate events as elements are processed

Q Events can be trapped by an application program via the API

Q ElementHandler:

Q Event-driven like SAX, but also creates a DOM tree

Q Open Source Pure Java API (JDOM) 30

15 Related Java Bindings

Q Sun’s Java API for XML Parsing (JAXP)

Q Provides a standard way to seamlessly integrate any XML-compliant parser with a Java application

Q Developers can swap between XML parsers without changing the application

Q The reference implementation uses Sun’s Java Project X as its default XML parser

Q DOM 3.0, DOM 2.0 and DOM 1.0 Java binding specification (http://www.w3.org/TR/1998/REC-DOM- Level-1-19981001/java-language-binding.zip )

31

XML Data Processing Examples

Q Section 2.7 of “XML and Java” covers various examples of XML document processing using the DOM, SAX, and ElementHandler APIs.

Q Session 2’s Sub-Topic 2.2.8.1 on “Enterprise Application Integration with XML and Java” illustrates the use of XML for data interchange

32

16 Part III

Java-Based Application Support Frameworks

33

XML MOM and POP Frameworks

Q An XML support framework must include:

Q XML Parser (conformity checker)

Q XML applications that use the output of the Parser to achieve unique objectives)

Q See sub-section 2.3.2 of the weekly notes on “XML MOM Application Server Frameworks” for a complete description of a general purpose XML MOM framework

34

17 POP Applications Support Frameworks

Q Objective is to “serve” XML

Q HTML generation applications are provided

Q Sample solutions

Q XML::Parser module with

Q XML processing via Java servlets

Q e.g., IBM Alphaworks’ XMLEnabler

Q See session 2’s sub-topic 2.3.2 on “XML POP Application Server Framework”

Q Apache’s Cocoon

Q Active Server Pages (ASP) with MSXML (see “Serving XML with ASP”, and rocket

35

MOM Applications Support Frameworks

Q Many applications can be envisioned

Q One objective is to support application integration via XML data interchange

Q Sample solutions:

Q XML::Parser module with Perl

Q XML processing via Java applications

36

18 Part IV Advanced XML Parser Technology

37

XML Data Processing Patterns

Q Processing and Transformation

Q Reading XML Documents

Q Working with Encodings in XML Documents

Q XML Document to/from DOM (generation)

Q Printing XML Documents from DOM

Q Building and Working with DOM Structures

Q Creating a DOM Structure from Scratch

Q Building a Valid DOM Tree

Q Manipulations Using DOM API 38

19 JDOM

Q JSR-102

Q Beta 6 Available

Q Lightweight API with best of SAX and DOM

Q Small memory footprint

Q Does not require whole document in memory

Q Easy to use Q JDOM: String text = element.getText(); Q DOM: String content = element.getFirstChild().getValue();

Q Converting from DOM to JDOM

Q Use org.jdom.input.DOMBuilder class

Q http://www.ibiblio.org/xml/slides/xmlsig/jdom/JDOM.html 39

JAXP

Q JAXP 1.1

Q Vendor-neutral code for parsing/transforming documents

Q Updated support for SAX 2.0 & DOM Level 2 standards

Q Addition of TraXP

Q DOM 2 usage patterns:

Q Manipulate Documents, Elements, and Nodes (DOM 1)

Q Views, Stylesheets, Events, traversal, DOM

Q http://www-106.ibm.com/developerworks/library/x-jaxp1.html?dwzone=xml

40

20 Parsers Comparison

Q See: http://www.webreference.com/xml/column22/2.html

Q Apache Xerces supports XML-Schema, XSL-T, SAX 2.0, DOM Level 2 1.0, and is open source

Q The Organization for Advancement of Structured Information Systems (OASYS) has defined an XML conformance test suite

Q Sun’s Parser passes all the tests

Q Oracle’s v2 XML parser is most efficient

41

Part V

Latest W3C APIs and Standards for Processing XML

42

21 XML Infoset

Q Candidate recommendation (May 14, 2001)

Q XML Information Set is a work in progress

Q Provides a means to describe the abstract logical representation of an XML document

Q Infoset’s prime citizens are information items

Q Information items can be used to express the data model of XML documents, and DTDs

Q Not an API

43

DOM Level 3

Q W3C Working Draft (June 5, 2001)

Q Focus on Platform and language neutral interface to DOM

44

22 Canonical XML

Q W3C Recommendation (March 15, 2001)

Q Canonical XML will be used to represent the result of parsing an XML document

Q Canonical XML is necessary to establish the logical “equivalence” of XML documents

Q Every well-formed XML document has a unique structurally equivalent canonical XML document

Q When “Canonicalizing” an XML document, parsers preserves the minimum information needed by an XML application 45

XML Signatures

Q Joint W3C and IETF Working Group

Q W3C Candidate Recommendation (April 19, 2001)

Q XML compliant syntax used for representing the signature of Web resources and portions of protocol messages

Q Procedures for computing and verifying such signatures (I.e., integrity and authentication)

Q Does not address encryption, and authorization

46

23 XBase

Q W3C Working Draft (February 21, 2001)

Q xml:base may be inserted in XML documents to specify a base URI other than the base URI of the document or external entity, which is normally used to resolve relative URIs

Q Equivalent of HTML BASE functionality

47

XInclude

Q W3C Working Draft (May 16, 2001)

Q Processing model and syntax for merging multiple Infosets in a single “composite” Infoset

Q Enables modularity and merging capabilities

48

24 XML Schema Adjuncts

Q Mechanism for extending XML schema languages, and for providing information from such extensions to programs that process XML instances

Q Schema Adjunct Framework

Q XML-based language used to associate domain- specific data (I.e., adjunct-data) with schemas and their instances, effectively extending the power of existing XML schema languages such as DTDs or XML Schema

Q Domain specific data include O/R mappings, etc.49

Part VI

Java-Based XML Data Processing Frameworks and APIs

50

25 Applets, Servlets, and JSPs

Q These component models provide infrastructure support for XML processing

Q See Session 6 handout on “Applets, Servlets, and JSPs”

51

Xerces-J

Q Xerces 2 is a redesigned implementation of Xerces that emphasizes modularity

Q Xerces 2’s architecture is based on the Xerces Native Interface (XNI) which streams the processing of XML documents via “Scanner”, “Validator” and “Parser” modules

Q A parser configuration object controls the use of various internal components (e.g., symbol table, validator, etc.)

52

26 Xalan

Q Xalan-J version 2.1.0 is the latest

Q Provides XSL-T processing for transforming XML documents into HTML, text, or other XML document types

Q Built on top of SAX 2.0, DOM Level 2 1.0, JAXP 1.1

Q Implements the TraX subset of JAXP 1.1

53

Xang

Q Framework for building data-driven, cross- platform Web applications that integrate disparate data sources

Q Separates data, logic and presentation

Q Example .xap command handler

Q

Q Xang:

Q Identifies element addressed byURLs

Q Map the HTTP request to an application method on the targeted element

Q Gather a body of script available to the command handler for the targeted element 54 Q Dispatch the HTTP request to the command handler

27 FOP

Q Latest version is 0.19

Q Print formatter driven by XSL-FO objects

Q Formatted output is in PDF format for now

Q Can be embedded in a Java application by instantiating org.apache.fop.apps.Driver

55

Cocoon

Q Cocoon 2 is a completely redesigned version

Q It supports an event-based model, and manages memory very efficiently

56

28 SOAP

Q Lightweight protocol for information exchange

Q Envelope

Q Framework that describes a message and its processing

Q Set of encoding rules

Q Used to express instances of application datatypes

Q Convention

Q RPC calls and responses

Q Apache distributes an implementation of SOAP referred to as “AXIS”

57

Batik

Q Provides a set of core modules to support SVG solutions

Q SVG parsers

Q SVG generators

Q SVG implementations

58

29 Crimson

Q Implements JAXP 1.1 without the java.xml.transform package

Q Supports SAX 2.0, and DOM Level 2 1.0

Q Xalan-J 2 implements the missing package

Q Based on Sun Project X parser

Q Will move under Xerces-J 2

59

Part VII

Conclusions

60

30 Summary

Q Mainstream MOM and POP application development tools are being supported by IBM, Sun, Oracle, and Microsoft

Q Java MOM and POP applications are developed using Java bindings to the DOM, and SAX APIs

Q XML provides a standard data interchange message format

Q Parsers support a common set of XML data processing patterns

Q JDOM is a lightweight API that supports the best of DOM and SAX

Q JAXP 1.1 provides parser-independent interfaces

Q Important comparison criteria for parsers are the support of the latest W3C specification, conformance, and efficiency

61

Summary (continued)

Q The latest W3C APIs and standards for processing XML have not reached the level of full recommendations and are not embedded in the functionality supported by parsers

Q Various specialized XML processing frameworks are being developed by the Apache software foundation. These frameworks are processing engines that leverage off of parsing and rendition mechanisms, and operate on top of common Java components models

Q The W3C XML-Fragments specification focuses on the handling of XML document fragments

Q MOM and POP (Java-based) application support frameworks are still emerging and are becoming common facilities in the ubiquitous Web Services Infrastructure 62

31 More on Industry-Specific Markup Languages (see http://www.oasis-open.org/cover/xml.html#contentsApps)

Q Extensible Business Reporting Language (XBRL)

Q Bank Internet Payment System (BIPS)

Q Electronic Business XML (EbXML)

Q Privacy-Enabled Customer Data Interchange (CPExchange)

Q Visa XML Invoice Specification

Q Legal XML

Q NewsML

Q Electronic Catalog XML (eCX)

Q Open eBook Publication Structure 63

32