WritingWriting XMLXML withwith JavaJava

• You don’t need to know any special like DOM or SAX or JDOM. All you have to know is how to System.out.println(). If you want to store your XML document in a file, you can use the FileOutputStream class ApplicationApplication DevelopmentDevelopment instead. withwith XMLXML andand JavaJava • We are going to develop a program that writes Fibonacci numbers into an XML document. • the principles of XML we will use here Lecture 3 will be much more broadly applicable Writing XML with Java to other, more complex systems. Introduction to Parsing • The key idea is that data arrives from some source, is encoded in XML, and is then output. Where the input comes from, whether an algorithm, a file, a network socket, user input, or some other source, really doesn’t concern us here. Kosmas Kosmopoulos Application development with XML and Java Kosmas Kosmopoulos Application development with XML and Java FibonacciFibonacci NumbersNumbers FibonacciFibonacci NumbersNumbers

• As far as we know, the Fibonacci series • It’s very easy to calculate the was first discovered by Leonardo of Pisa Fibonacci sequence by computer. A around 1200 C.E. Leonardo “How many simple for loop will do. For example, pairs of rabbits are born in one year from this code fragment prints the first 40 one pair?” To solve his problem, Fibonacci numbers: Leonardo estimated that rabbits have a int low = 1; one month gestation period, and can first int high = 1; mate at the age of one month, so that for (int i = 1; i <= 40; i++) { each female rabbit has its first litter at two System.out.println(low); int temp = high; months. He made the simplifying high = high+low; assumption that each litter consisted of low = temp; } exactly one male and one female. • However, the Fibonacci numbers do • The rabbits aren’t so important, but the grow very large very quickly, and math is. Each integer in the series is exceed the bounds of an int shortly formed by the sum of the two previous before the fiftieth generation. integers. The first two integers in the Consequently, it’s better to use the series are 1 and 1.The Fibonacci series turns up in some very unexpected places java.math.BigInteger class instead including decimal expansions of π.

Kosmas Kosmopoulos Application development with XML and Java Kosmas Kosmopoulos Application development with XML and Java FibonacciFibonacci NumbersNumbers ProducingProducing XMLXML documentdocument import java.math.BigInteger; public class FibonacciNumbers { 1 public static void main(String[] 1 args) { 2 BigInteger low = BigInteger.ONE; 3 BigInteger high = BigInteger.ONE; 5 for (int i = 1; i <= 10; i++) 8 { 13 System.out.println(low); 21 BigInteger temp = high; 34 high = high.add(low); 55 low = temp; } } } • To produce this, just add string • When run, this program produces the first ten Fibonacci numbers: literals for the and tags inside the print java FibonacciNumbers statements, as well as a few extra print 1 1 2 3 5 8 13 21 34 55 statements to produce the XML declaration and the root element start- and end-tags. XML documents are just text, and you can output them any way you’d output any other text document.

Kosmas Kosmopoulos Application development with XML and Java Kosmas Kosmopoulos Application development with XML and Java JavaJava ExampleExample XMLXML--RPCRPC import java.math.BigInteger; • An XML-RPC request is just an XML public class FibonacciXML document sent over a network socket. { • We will write a very simple XML-RPC public static void main(String[] args) { client for a particular service. BigInteger low = BigInteger.ONE; – We will just ask the user for the data to BigInteger high = BigInteger.ONE; send to the server, wrap it up in some XML, and write it onto a URL object System.out.println(""); System.out.println(""); well. Since we haven’t yet learned how to read XML documents, Response will be for (int i = 0; i < 10; i++) { a text to System.out. Later we’ll pay more System.out.print(" "); attention to the response, and provide a System.out.print(low); nicer user interface. System.out.println(""); • The specific XML-RPC service we’re BigInteger temp = high; going to talk to is a Fibonacci high = high.add(low); generator. The request passes an int to low = temp; the server. The server responds with } the value of that Fibonacci number. System.out.println(""); } }

Kosmas Kosmopoulos Application development with XML and Java Kosmas Kosmopoulos Application development with XML and Java ExampleExample ExampleExample

• This request document asks for the • The client needs to read an integer input by value of the 23rd Fibonacci number: the user, wrap that integer in the XML-RPC envelope, then send that document to the server using HTTP POST. calculateFibonacci • In Java the simplest way to post a document is through the java.net.HttpURLConnection class. You get an instance of this class by calling a URL 23 object’s openConnection() method, and then casting the resulting URLConnection HttpURLConnection object to . • Once you have an HttpURLConnection, And here’s the response: you set its protected doOutput field to true using the setDoOutput() method and its request method to POST using the setRequestMethod() method. 28657< • Then you grab hold of an output stream /value> using getOutputStream() and proceed to write your request body on that.

Kosmas Kosmopoulos Application development with XML and Java Kosmas Kosmopoulos Application development with XML and Java XMLXML ParsersParsers WhatWhat isis anan XMLXML ParserParser

ƒ What is a Parser? The parser is a software library (a Java class) that reads the XML document and checks it ƒ Defining Parser Responsibilities for well-formedness. Client applications use ƒ Evaluating Parsers method calls defined in the parser API to receive or request information the parser ƒ Validation retrieves from the XML document. – The parser shields the client application from ƒ Validating v. Nonvalidating Parsers all the complex and not particularly relevant ƒ XML Interfaces details of XML including: – Transcoding the document to Unicode ƒ Object v. Tree Based Interfaces – Assembling the different parts of a document ƒ Interface Standards: DOM, SAX divided into multiple entities. – Resolving character references ƒ Java XML Parsers – Understanding CDATA sections – Checking hundreds of well-formedness constraints – Maintaining a list of the namespaces in-scope on each element. – Validating the document against its DTD or schema – Associating unparsed entities with particular URLs and notations – Assigning types to attributes

Kosmas Kosmopoulos Application development with XML and Java Kosmas Kosmopoulos Application development with XML and Java TheThe BigBig PicturePicture

Java Application XML XML Or Parser Document Servlet WhatWhat isis anan XMLXML Parser?Parser?

An XML Parser enables your Java application or Servlet to more easily access XML Data.

Kosmas Kosmopoulos Application development with XML and Java Kosmas Kosmopoulos Application development with XML and Java DefiningDefining ParserParser ThreeThree WhyWhy useuse anan XMLXML Parser?Parser? ResponsibilitiesResponsibilities 1. Retrieve and Read an XML ƒ If your application is going to use document XML, you could write your own ƒ For example, the file may reside on the local file system or on another web site. parser. ƒ The parser takes care of all the ƒ But, it makes more sense to use a necessary network connections and/or file connections. pre-built XML parser. ƒ This helps simplify your work, as you ƒ This enables you to do build your do not need to worry about creating application much more quickly. your own network connections. 2. Ensure that the document adheres to specific standards. Parser Evaluation ƒ Does the document match the DTD? ƒ When evaluating which XML ƒ Is the document well-formed? Parser to use, there are two very 3. Make the document contents important questions to ask: available to your application. ƒ Is the Parser validating or non- ƒ The parser will parse the XML validating? document, and make this data available to your application. ƒ What interface does the parser provide to the XML document?

Kosmas Kosmopoulos Application development with XML and Java Kosmas Kosmopoulos Application development with XML and Java XMLXML ValidationValidation PerformancePerformance andand MemoryMemory

ƒ Validating Parser ƒ Questions: ƒ Which parser will have better performance? ƒ a parser that verifies that the XML ƒ Which parser will take up less memory? document adheres to the DTD. ƒ Validating parsers: ƒ Non-Validating Parser ƒ more useful ƒ a parser that does not check the ƒ slower DTD. ƒ take up more memory ƒ Lots of parsers provide an option ƒ Non-validating parsers: ƒ less useful to turn validation on or off. ƒ faster ƒ take up less memory ƒ Therefore, when high performance and low-memory are the most important criteria, use a non-validating parser. ƒ Examples: ƒ Java applets ƒ Palm Pilot Applications ƒ Huge XML Documents

Kosmas Kosmopoulos Application development with XML and Java Kosmas Kosmopoulos Application development with XML and Java XMLXML InterfacesInterfaces SampleSample XMLXML DocumentDocument

ƒ Broadly, there are two types of interfaces provided by XML Parsers: ƒ Object/Tree Interface Object / Tree Interface ƒ Definition: Parser reads the 20 XML document, and creates an 12 in-memory “tree” of data. ƒ For example: ƒ Given a sample XML document on the next slide, what kind of tree would be produced?

Kosmas Kosmopoulos Application development with XML and Java Kosmas Kosmopoulos Application development with XML and Java EventEvent BasedBased ParserParser

Weather On Object Tree for a sample XML ƒ Definition: Parser reads the Document. The tree represents XML document, and generates the hierarchy of the XML document. events for each parsing event. City ƒ For example: ƒ Given the same XML document, Hi Note the Text what kind of events would be Nodes produced?

Text: 20

Lo

Text: 12

Kosmas Kosmopoulos Application development with XML and Java Kosmas Kosmopoulos Application development with XML and Java XMLXML ParsingParsing EventsEvents EventEvent BasedBased InterfaceInterface

ƒ Events generated: ƒ For each of these events, the 1. Start of Element your application implements 2. Start of Element “event handlers.” 3. Start of Element ƒ Each time an event occurs, a 4. Character Event: 20 different event handler is called. 5. End of Element ƒ Your application intercepts these 6. Start of Element events, and handles them in any 7. Character Event: 12 way you want. 8. End of Element 9. End of Element 10. End of Element

Kosmas Kosmopoulos Application development with XML and Java Kosmas Kosmopoulos Application development with XML and Java PerformancePerformance andand MemoryMemory XMLXML InterfaceInterface StandardsStandards

ƒ Questions: ƒ Standards are important: ƒ Which parser is faster? ƒ Easier to create XML applications ƒ Which parser takes up less memory? ƒ Tree based: ƒ You can swap parsers as your ƒ slower application evolves. ƒ takes up more memory ƒ There are two main XML ƒ Event based: Interface standards: ƒ faster ƒ Tree Based: Document Object ƒ takes up much less memory Model (DOM) ƒ Therefore, when high performance and ƒ Event Based: Simple API for low-memory are the most important XML (SAX) criteria, use an event-based parser. ƒ Examples: ƒ Java applets ƒ Palm Pilot Applications ƒ Parsing Huge Data files

Kosmas Kosmopoulos Application development with XML and Java Kosmas Kosmopoulos Application development with XML and Java DOMDOM // SAXSAX SAXSAX

DOM ƒ Simple API for XML ƒ ƒ Event Based ƒ Tree Based Interface ƒ Developed by the W3C ƒ Developed by volunteers on the ƒ Supports both XML and HTML xml-dev mailing list. ƒ Originally specified using an IDL (Interface ƒ http://saxproject.org Definition Language). ƒ Hence, DOM Versions exist for Java, ƒ More on this next lecture… JavaScript, C++, Perl, Python. ƒ In this course, we will be studying JDOM (which is similar to DOM.)

SAX ƒ Simple API for XML ƒ Event Based ƒ Developed by volunteers on the xml-dev mailing list. ƒ http://saxproject.org

Kosmas Kosmopoulos Application development with XML and Java Kosmas Kosmopoulos Application development with XML and Java JavaJava XMLXML ParsersParsers JavaJava XMLXML ParsersParsers

ƒ ƒ For a full list of XML Parsers, go ƒ Validating and Non-validating to Options http://www.xmlsoftware.com/p ƒ Supports DOM and SAX. arsers ƒ http://xml.apache.org ƒ Note that XML Parsers also exist for lots of other languages: C/C++, JavaScript, Python, Perl, etc. ƒ Most parsers support both DOM and SAX, and most have options for turning validation on or off.

Kosmas Kosmopoulos Application development with XML and Java Kosmas Kosmopoulos Application development with XML and Java OutputOutput WithoutWithout MarkupMarkup ParsingParsing KnownKnown XMLXML

• The clients for the XML-RPC server we Talking to specific server you know details have seen simply printed the entire about the server like: document on the console. Now we will – the root element is methodResponse. extract just the answer and strip out all – the methodResponse element contains a the markup. single params element that in turn contains a param element. • That is, the user interface will look – the param element contains a single value something like this: element. C:\XMLJAVA>java FibonacciClient 9 – the value element contains a single double 34 element that in turn contains a string • From the user’s perspective, the XML is representing a double. If the server returns a completely hidden. The user neither value with a type other than double, you’d probably respond by throwing an exception, knows nor cares that the request is being just as you would if a local method you sent and the response received in an expected to return a Double instead XML document. returned a String. • In fact, the user may not even know that All of this is specified by the XML-RPC the request is being sent over the specification. If any of this is violated in the network rather than being processed response you get back from the server, then locally. All the user sees is the very basic that server is not sending correct XML-RPC. command line interface. You’d probably respond to this by throwing an exception.

Kosmas Kosmopoulos Application development with XML and Java Kosmas Kosmopoulos Application development with XML and Java ParsingParsing KnownKnown XMLXML ExampleExample

• The main point is this: most programs you • The readFibonacciXMLRPCResponse() write are going to read documents written method in the first example does exactly in a specific XML vocabulary. this by • They are not going to be designed to – reading the entire XML document into a handle absolutely any well-formed StringBuffer document that comes down the pipe. – converting the buffer to a String, • However, you do need to make some – using the indexOf() and substring() methods assumptions about the format of your to extract the desired information. documents before you can reasonably – The main() method connects to the server process them. using the URL and URLConnection classes. • It’s simple enough to hook up an – sends a request document to the server using InputStreamReader to the document, and the OutputStream and OutputStreamWriter read it out. classes. public printXML(InputStream xml) { int – passes InputStream containing the response c; while ((c = xml.read()) != -1) System.out.write(c); } XML document to the • To actually extract the information you readFibonacciXMLRPCResponse() method. need to determine which pieces of the input you actually want and separate those out from all the rest of the text.

Kosmas Kosmopoulos Application development with XML and Java Kosmas Kosmopoulos Application development with XML and Java WhatWhat cancan gogo wrongwrong XMLXML ParserParser

• This stream- and string-based solution is • Straight text parsing is not the appropriate tool far from robust. In particular, it will fail if: with which to navigate an XML document. – The document returned is encoded in UTF- • The structure and semantics of an XML 16 instead of UTF-8 document is encoded in the document’s markup, – An earlier part of the document contains the its tags and its attributes; and we need a tool that text “”, even in a is designed to recognize and understand this comment. structure as well as reporting any possible errors – The response is written with line breaks in this structure. between the value and double tags like this: 28657 This tool is called an XML parser. • The parser is a software library (in Java it’s a – There’s extra white space inside the double class) that reads the XML document and checks tags like this: it for well-formedness. 28657 • One of the original goals of XML was that it be simple enough that a DPH be able to write an • And this is a simple example where we XML parser. The exact interpretation of this just want one piece of data that’s clearly requirement varied from person to person. On marked up. The more data you want from one extreme, the DPH was assumed to be a Web an XML document, and the more designer without any formal training in complex and flexible the markup, the programming who was going to do it in a harder it is to find using basic string weekend. On the other the DPH was assumed matching or even the regular expressions to be Larry Wall and he was allowed two months introduced in Java 1.4. for the task. The middle ground was a smart grad student and a couple of weeks. Kosmas Kosmopoulos Application development with XML and Java Kosmas Kosmopoulos Application development with XML and Java XMLXML ParserParser ChoosingChoosing anan APIAPI

• The parser shields the client application • There are two major standard APIs for from all the complex and not particularly processing XML documents with Java, relevant details of XML including: the Simple API for XML (SAX) and the – Transcoding the document to Unicode – Assembling the different parts of a Document Object Model (DOM) document divided into multiple entities. • In addition there are a host of other, – Resolving character references – Understanding CDATA sections somewhat idiosyncratic APIs including, – Checking hundreds of well-formedness JDOM, , ElectricXML, and constraints XMLPULL. – Maintaining a list of the namespaces in- scope on each element. • Each specific parser generally has a native – Validating the document against its DTD or schema API that it exposes below the level of the – Associating unparsed entities with particular standard APIs. For instance, the Xerces URLs and notations – Assigning types to attributes parser has the Xerces Native Interface • The most important decision you'll make • Each of these APIs has its own strengths at the start of an XML project is the application programming interface (API) and weaknesses. you'll use. Many APIs are implemented by multiple vendors, so if the specific parser gives you trouble you can swap in an alternative, often without even recompiling your code. Kosmas Kosmopoulos Application development with XML and Java Kosmas Kosmopoulos Application development with XML and Java ChoosingChoosing aa ParserParser ChoosingChoosing aa ParserParser

• When choosing a parser library many • Parsers also differ in their support for factors come into play. These include subsequent specifications and what features the parser has, how much technologies. All parsers worth it costs, which APIs it implements, how considering support namespaces and buggy it is, and last and certainly least automatically check for namespace well- how fast the parser parses. formedness as well as XML 1.0 well- Features formedness. – Parsers can be roughly divided into three categories: • Xerces and Oracle are Java parsers that – Fully validating parsers support schema validation – Parsers that do not validate, but do read the • Some parsers also provide extra external DTD information not required for normal – Parsers that read only the internal DTD subset and do not validate. XML parsing. For instance, at your – There’s also a fourth category of parsers that request, Xerces can inform you of the reads the instance document but do not ELEMENT, ATTLIST, and ENTITY perform all the mandated well-formedness declarations in the DTD. Crimson will checks. Technically such parsers are not allowed by the XML specification, but there not do this, so if you need to read the are still a lot of them. DTD you’d pick Xerces over Crimson.

Kosmas Kosmopoulos Application development with XML and Java Kosmas Kosmopoulos Application development with XML and Java ChoosingChoosing aa parserparser ChoosingChoosing aa parserparser

API support Correctness • Most of the major parsers support both • An often overlooked criterion for SAX and DOM. ( Xerces and Crimson). choosing a parser is correctness, how • The other APIs including JDOM and much of the relevant specifications are dom4j generally don’t provide parsers of implemented how well. Although no their own. Instead they use an existing parser is perfect, some parsers are SAX or DOM parser to read a document definitely more reliable than others. which they then convert into their own • the fact is that every single parser has tree model. Thus they can work with any sooner or later exhibited significant convenient parser. The notable exception conformance bugs. Most of the time here is ElectricXML which does include these fall into two categories: its own built-in parser. – Reporting correct constructs as errors License – Failing to report incorrect syntax. • One often overlooked consideration • Unnecessarily rejecting well-formed when choosing a parser is the license documents prevents you from handling under which the parser is published. data others send you. Most parsers are free however, license • If a parser fails to report incorrect XML documents, will cause problems for restrictions can still get in the way. people and systems who receive the malformed documents.

Kosmas Kosmopoulos Application development with XML and Java Kosmas Kosmopoulos Application development with XML and Java ChoosingChoosing aa ParserParser AvailableAvailable ParsersParsers

Efficiency Xerces • The last consideration is efficiency, how • This is a very complete, validating parser fast the parser is and how much memory that has the best conformance to the it uses. XML 1.0 and Namespaces in XML • Efficiency should be your last concern specifications. when choosing a parser. As long as you • It fully supports the SAX2 and DOM use standard APIs and keep parser- Level 2 APIs, as well as JAXP, dependent code to a minimum, you can • Xerces-J is highly configurable and always change the underlying parser later . suitable for almost any parsing needs. • The speed of parsing tends to be • Xerces-J is also notable supporting the dominated by I/O considerations. If the W3C XML Schema Language. XML document is served over the • The Apache XML Project publishes network, it’s entirely possible that the Xerces-J under the very liberal open speed with which data can move over the source Apache license. network is the bottleneck, not the XML • Essentially, you can do anything you like parsing at all. with it except use the Apache name in • In situations where the XML is being read your own advertising. from the disk, the time to read the data can still be significant even if it’s not quite • Xerces-J 1.x was based on IBM’s XML the bottleneck it is in network for Java parser, whose code base IBM applications. donated to the Apache XML Project.

Kosmas Kosmopoulos Application development with XML and Java Kosmas Kosmopoulos Application development with XML and Java AvailableAvailable ParsersParsers

Crimson • Crimson, previously known as Java Project X, is the parser Sun bundles with the JDK 1.4. • Crimson supports more or less the same APIs and specifications Xerces does— SAX2, DOM2, JAXP, XML 1.0, Namespaces in XML, etc. —with the notable exception of schemas. Piccolo • Yuval Oren’s Piccolo is the latest entry into the parser arena. It is a very small, very fast, open source, non-validating XML parser. However, it does read the external DTD subset in order to apply default attribute values and resolve external entity references. Piccolo supports the SAX API exclusively. It does not have a DOM implementation.

Kosmas Kosmopoulos Application development with XML and Java