XML and Related Technologies Certification Prep, Part 3: XML Processing Explore How to Parse and Validate XML Documents Plus How to Use Xquery
Total Page:16
File Type:pdf, Size:1020Kb
XML and Related Technologies certification prep, Part 3: XML processing Explore how to parse and validate XML documents plus how to use XQuery Skill Level: Intermediate Mark Lorenz ([email protected]) Senior Application Architect Hatteras Software, Inc. 26 Sep 2006 Parsing and validation represent the core of XML. Knowing how to use these capabilities well is vital to the successful introduction of XML to your project. This tutorial on XML processing teaches you how to parse and validate XML files as well as use XQuery. It is the third tutorial in a series of five tutorials that you can use to help prepare for the IBM certification Test 142, XML and Related Technologies. Section 1. Before you start In this section, you'll find out what to expect from this tutorial and how to get the most out of it. About this series This series of five tutorials helps you prepare to take the IBM certification Test 142, XML and Related Technologies, to attain the IBM Certified Solution Developer - XML and Related Technologies certification. This certification identifies an intermediate-level developer who designs and implements applications that make use of XML and related technologies such as XML Schema, Extensible Stylesheet Language Transformation (XSLT), and XPath. This developer has a strong understanding of XML fundamentals; has knowledge of XML concepts and related technologies; understands how data relates to XML, in particular with issues associated with information modeling, XML processing, XML rendering, and Web XML processing © Copyright IBM Corporation 1994, 2008. All rights reserved. Page 1 of 38 developerWorks® ibm.com/developerWorks services; has a thorough knowledge of core XML-related World Wide Web Consortium (W3C) recommendations; and is familiar with well-known, best practices. Anyone working in software development for the last few years is aware that XML provides cross-platform capabilities for data, just as the Java® programming language does for application logic. This series of tutorials is for anyone who wants to go beyond the basics of using XML technologies. About this tutorial This tutorial is the third in the "XML and Related Technologies certification prep" series that takes you through the key aspects of effectively using XML technologies on Java projects. This third tutorial focuses on XML processing -- that is, how to parse and validate XML documents. It lays the groundwork for Part 4, which focuses on transformation, including the use of XSLT, XPath, and Cascading Style Sheets (CSS). This tutorial is written for Java programmers who have a basic understanding of XML and whose skills and experience are at a beginning to intermediate level. You should have a general familiarity with defining, validating, and reading XML documents, as well as a working knowledge of the Java language. Objectives After completing this tutorial, you will know how to: • Parse XML documents using the Simple API for XML 2 (SAX2) and Document Object Model 2 (DOM2) parsers • Validate XML documents against Document Type Definitions (DTDs) and XML Schemas • Access XML content from databases using XQuery Prerequisites This tutorial is written for developers who have a background in programming and scripting and who have an understanding of basic computer-science models and data structures. You should be familiar with the following XML-related, computer-science concepts: tree traversal, recursion, and reuse of data. You should be familiar with Internet standards and concepts, such as Web browser, client-server, documenting, formatting, e-commerce, and Web applications. Experience designing and implementing Java-based computer applications and working with relational databases is also recommended. XML processing Page 2 of 38 © Copyright IBM Corporation 1994, 2008. All rights reserved. ibm.com/developerWorks developerWorks® System requirements To run the examples in this tutorial, you need a Linux® or Microsoft® Windows® box with at least 50MB of free disk space and administrative access to install software. The tutorial uses, but does not require, the following software: • Java software development kit (JDK) 1.4.2 or later • Eclipse 3.1 or later • XMLBuddy 2.0 or later (Note: Some portions of the series use capabilities of XMLBuddy Pro, which is not free.) See Resources for links to download the above software Section 2. Parsing XML documents You can parse an XML document in multiple ways (see Part 1 of this series, which focuses on architecture), but the SAX parser and the DOM parser constitute the primary ways. Part 1 features a high-level comparison of the two (see Resources). StAX A new API, called Streaming API for XML (StAX), is to be released in late 2006. It is a pull API, as opposed to SAX's push model, so it keeps control with the application rather than the parser. You can also use StAX to modify the document being parsed. Read more in "An Introduction to StAX" (see Resources). XML instance document This tutorial uses a store's catalog of available DVDs for purchase as the document throughout. Conceptually, the catalog contains a collection of DVDs with information about each DVD associated with it. The actual document is a short catalog with only four DVDs in it, but it has enough complexity for you to learn about XML processing, including validation. Listing 1 shows the file. Listing 1. The XML instance document for the DVD catalog <?xml version="1.0"?> <!DOCTYPE catalog SYSTEM "dvd.dtd"> <!-- DVD inventory --> <catalog> <dvd code="_1234567"> <title>Terminator 2</title> <description> A shape-shifting cyborg is sent back from the future XML processing © Copyright IBM Corporation 1994, 2008. All rights reserved. Page 3 of 38 developerWorks® ibm.com/developerWorks to kill the leader of the resistance. </description> <price>19.95</price> <year>1991</year> </dvd> <dvd code="_7654321"> <title>The Matrix</title> <price>12.95</price> <year>1999</year> </dvd> <dvd code="_2255577" genre="Drama"> <title>Life as a House</title> <description> When a man is diagnosed with terminal cancer, he takes custody of his misanthropic teenage son. </description> <price>15.95</price> <year>2001</year> </dvd> <dvd code="_7755522" genre="Action"> <title>Raiders of the Lost Ark</title> <price>14.95</price> <year>1981</year> </dvd> </catalog> Using the SAX parser As Part 1 of this series discussed, the SAX parser is an event-based parser. This means that the parser sends events to callback methods as it parses a document (see Figure 1). For simplicity, Figure 1 doesn't show all the events that would actually occur. Figure 1. SAX parser events XML processing Page 4 of 38 © Copyright IBM Corporation 1994, 2008. All rights reserved. ibm.com/developerWorks developerWorks® These events are pushed out to the application in real time, as the parser moves across the document contents. One benefit of this processing model is that you can handle large documents with relatively little memory. A downside is that you have more work to do to handle all these events. The org.xml.sax package contains a set of interfaces. One of these provides the XMLReader interface to the parser. You can set up for parsing like this: try { XMLReader parser = XMLReaderFactory.createXMLReader(); parser.parse( "myDocument.xml" ); //complete path } catch ( SAXParseException e ) { //document is not well-formed } catch ( SAXException e ) { //could not find an implementation of XMLReader } catch ( IOException e ) { //problem reading document file } Apache Xerces2 parser If you need a parser, you can download the open source Apache Xerces2 parser from The Apache Software Foundation Web site (see Resources). XML processing © Copyright IBM Corporation 1994, 2008. All rights reserved. Page 5 of 38 developerWorks® ibm.com/developerWorks Tip: Reuse the parser instance if possible. Creating a parser is expensive. If you have multiple threads running, you can reuse parser instances from a resource pool. This is all well and good so far, but how does your application get events from the parser? I'm glad you asked. Handling SAX events To receive events from the parser, you implement the ContentHandler interface. This interface has a number of methods that you can implement to process your document. Alternatively, if you only want to handle one or two callbacks, you can subclass DefaultHandler, which implements all the ContentHandler methods (doing nothing) and overrides only the methods you need. Either way, you write logic to do whatever processing you require upon receiving startElement, characters, endDocument, and other callback methods invoked by the SAX parser. You can see all the method calls from a document as they would occur on pages 351-355 of XML in a Nutshell, Third Edition (see Resources). The callback events are the normal events from a document as it's being parsed. You can also handle validity callbacks by implementing an ErrorHandler. I'll discuss this topic after I go over validation, so stay tuned. To learn more about parsing with SAX, check out Chapter 20 of XML in a Nutshell, Third Edition or read "Serial Access with the Simple API for XML (SAX)" (see Resources). SAX parser exception handling By default, the parser ignores errors. To take action upon an invalid or non-well-formed document, you must implement an ErrorHandler (note that DefaultHandler implements this as well as the ContentHandler interface) and define an error() method: public class SAXEcho extends DefaultHandler { ... //Handle validity errors public void error( SAXParseException e ) { echo( e.getMessage() ); echo( "Line " + e.getLineNumber() + " Column " + e.getColumnNumber(); } Then you must turn on the validation feature: parser.setFeature( "http://xml.org/sax/features/validation", true ); Finally, call this code: parser.setErrorHandler( saxEcho ); XML processing Page 6 of 38 © Copyright IBM Corporation 1994, 2008. All rights reserved. ibm.com/developerWorks developerWorks® Remember, parser is an instance of XMLReader. The parser calls the error() method if the document violates a schema (DTD or XML Schema) rule.