11/8/10

Program for today

• Introducon to XML • Parsing XML XML and Java XML Parsing • Wring XML

ICW 2010 Lecture 11

Marco Cova Internet Compung Workshop 2

Extensible Markup Language (XML) Extensible Markup Language (XML)

• We know how to store data on the server • How can we send todo – Relaon database, SQL, and JDBC items from users to • We know how to present data to users server? – HTML, CSS, and JavaScript Alice’s todos – Plain text? • We know how to send data over the network – HTML? – Sockets • It would be easy if we had – HTTP tags to describe the items • We’ll see how you can handle user requests and send back Bob’s todos Todo server we exchange (todos, in appropriate responses over HTTP this case) – Servlets and JSP (next on!) • XML allows to do that • What if data has to be processed by another soware – Feeds: RSS, ATOM instead of being displayed to a human user? – RPC: SOAP Charlie’s todos

Marco Cova Internet Compung Workshop 3 Marco Cova Internet Compung Workshop 4

Example: RSS Well-formed, valid XML

• Well-formed document hp://wordpress.org/?v=2.7.1en Elements nest properly • Valid document <tle>Computer Science student named top 100 graduate in the UK hp://www.cs.bham.ac.uk/sys/news/content/2010/09/02/computer-science-student- – Well-formed named-top-100-graduate-in-the-uk/ Thu, 02 Sep 2010 16:51:26 +0100 – Has an associated document type declaraon – Complies with the constraints of the document type declaraon Marco Cova Internet Compung Workshop 5 Marco Cova Internet Compung Workshop 6

1 11/8/10

Reading XML documents Tree-based parsing: DOM

Parsing approaches: XML file Java DOM API DOM • Tree-based: object representaon of XML document is stored in memory (as a tree) Document BuilderFactory – +: easy to manipulate document – -: memory requirement <tle>slides 2010-09-30 Document • Event-based: as parts of XML document are read y Builder events are generated (e.g., element started) – +: memory requirement – -: cannot easily manipulate document

Marco Cova Internet Compung Workshop 7 Marco Cova Internet Compung Workshop 8

Tree-based parsing: DOM DOM use

// instantiate the factory DocumentBuilderFactory dbf = Note: Node types Tree navigaon DocumentBuilderFactory.newInstance(); • By default, parser is not • Ar • Node.getChildNodes // instantiate the document builder validang • Comments • Node.getFirstChild DocumentBuilder db = – See dbf.setValidang() • Document dbf.newDocumentBuilder(); • Node.getLastChild • • Element // parse the file Various opons to • Document.getDocumentEle Document doc = db.parse(new File tweak the generated • Text (“todo.xml”)); ment tree, e.g.: • Less common: CDATASecon, CharacterData, • getElementsByTagName – Whitespace DocumentFragment, – CDATA coalescing DocumentType, Enty, – comments EntyReference, Notaon, ProcessingInstrucon

Marco Cova Internet Compung Workshop 9 Marco Cova Internet Compung Workshop 10

Event-based parsing SAX

• Parser produces noficaons in XML file Java DOM API EventHandler correspondence of certain events (e.g.,

element started, element closed) SaxParser • 2 approaches: Factory – Push-based todo el. start <tle>slides • Once document read, parser must handle all events 2010-09-30 SaxParser tle el. start y • Simple API for XML (SAX) – Pull-based todo el. end • Parser controls the parsing (start, pause, resume) • Streaming API for XML (StAX)

Marco Cova Internet Compung Workshop 11 Marco Cova Internet Compung Workshop 12

2 11/8/10

SAX StAX

Parse setup Event handling XML file Java DOM API Program // instanate the factory class MyEventHandler SAXParserFactory factory = extends DefaultHandler { SAXParserFactory.newInstance(); SaxParser // instanate the parser // invoked when close tag found Factory SAXParser sp = factory.newSAXParser(); public void endElement(String uri, next // parse the document String localName, String qName); sp.parse(new File(“todo.xml”), new MyEventHandler()); // invoked when start tag found <tle>slides public void startElement(String 2010-09-30 SaxParser uri, String localName, String y todo el. start qName);

// invoked when text element found public void characters(char[] ch, int start, int length) }

Marco Cova Internet Compung Workshop 13 Marco Cova Internet Compung Workshop 14

StAX Modifying XML documents

Parse setup Event handling • Add element // instantiate the factory while (reader.hasNext()) { – Document.createElement() XMLInputFactory factory = // get event type XMLInputFactory.newFactory(); int et = reader.getEventType() – Element.appendChild // instantiate the reader XMLStreamReader reader = // handle START_ELEMENT • Add aributes factory.createXMLStreamReader if (et == START_ELEMENT) { (new FileReader(“todo.xml”)); – Element.setAribute(ar, value) }

// handle other event types • Remove element … – Element.removeChild // get next event reader.next(); • Remove aribute } – Element.removeAribute

Marco Cova Internet Compung Workshop 15 Marco Cova Internet Compung Workshop 16

Wring XML documents Wring XML documents – cont’d

We could simply do: But: • A beer approach: serialize a DOM object out = System.out; • It’s hard to read and • Two techniques maintain out.println(“”); • out.println(“”); It’s easy to make on current element … mistakes for (Item e : items) { – Leverage the transformaon mechanism out.println(“”); – Can you spot the error? out.println(“ <tle>” + e.tle + “”); transformaon that saves its result in a file (see .. } code example) out.write(“”);

Marco Cova Internet Compung Workshop 17 Marco Cova Internet Compung Workshop 18

3 11/8/10

Wring XML documents – cont’d Charset, character encodings writeNode(doc.getDocumentElement()); void writeElement(Element e, FileWriter fout) { • Character set: set of String n = e.getNodeName(); characters void writeNode(Node n, FileWriter fout) { // open tag • Coded character set: if (n instanceof Element) fout.write(“<“ + n + “>”); writeElement((Element)n, fout); character set in which else if (n instanceof Text) // handle children every leer is mapped writeText((Text) n, fout); NodeList c = e.getChildNodes(); … for (int i = 0; i < c.getLength(); i++) { to a number (code } writeNode(c.item(i), fout); point) } • Character encoding: // close tag // write a text node fout.write(“”); specify how coded void writeText(Text t, } FileWriter fout) { characters are mapped fout(t.nodeValue); } to bytes Courtesy of hp://www.w3.org/Internaonal/arcles/definions-characters/ Marco Cova Internet Compung Workshop 19 Marco Cova Internet Compung Workshop 20

Namespaces Namespaces – cont’d

• Certain elements and aributes define a Soluon: XML namespace • Elements are specified (“fully qualified”) by namespace vocabulary that may be reused in mulple prefix and name documents dc:tle – E.g., the “Dublin Core Metadata Element Set” defines • Namespace prefix is associated to a namespace name 15 general properes, such as creator, tle, etc. xmlns:dc=hp://purl.org/dc/elements/1.1/ • Document declares the bindings it will use: • XML documents may wish to combine several