XML and Java XML Parsing Program for Today Extensible Markup
Total Page:16
File Type:pdf, Size:1020Kb
11/8/10 Program for today • Introduc?on to XML • Parsing XML XML and Java XML Parsing • Wri?ng XML ICW 2010 Lecture 11 Marco Cova Internet Compu?ng Workshop 2 Extensible Markup Language (XML) Extensible Markup Language (XML) • We know how to store data on the server • How can we send todo – Rela?on database, SQL, and JDBC items from users to • We know how to present data to users server? – HTML, CSS, and JavaScript Alice’s todos – Plain text? • We know how to send data over the network – HTML? – Sockets • It would be easy if we had – HTTP tags to describe the items • We’ll see how you can handle user requests and send back Bob’s todos Todo server we exchange (todos, in appropriate responses over HTTP this case) – Servlets and JSP (next on!) • XML allows to do that • What if data has to be processed by another soUware – Feeds: RSS, ATOM instead of being displayed to a human user? – RPC: SOAP Charlie’s todos Marco Cova Internet Compu?ng Workshop 3 Marco Cova Internet Compu?ng Workshop 4 Example: RSS Well-formed, valid XML <?xml version="1.0" encoding="UTF-8"?> • Well-formed document <rss version="2.0”> <channel> – Starts with prolog, e.g., <?tle> » School News</?tle> <?xml version="1.0" encoding="UTF-8”?> <link>hkp://www.cs.bham.ac.uk/sys/news/content</link> <descripon>School of Computer Science, The University of Birmingham</descripon> – Has one root element, e.g., <pubDate>Tue, 21 Sep 2010 12:08:34 +0000</pubDate> <rss version="2.0”> <generator>hkp://wordpress.org/?v=2.7.1</generator> – <language>en</language> Elements nest properly <item> • Valid document <?tle>Computer Science student named top 100 graduate in the UK</?tle> <link>hkp://www.cs.bham.ac.uk/sys/news/content/2010/09/02/computer-science-student- – Well-formed named-top-100-graduate-in-the-uk/</link> <pubDate>Thu, 02 Sep 2010 16:51:26 +0100</pubDate> – Has an associated document type declara?on <category><![CDATA[School News]]></category> – Complies with the constraints of the document type <descripon><![CDATA[Computer Science graduate…]></descripon> </item> declara?on </channel> </rss> Marco Cova Internet Compu?ng Workshop 5 Marco Cova Internet Compu?ng Workshop 6 1 11/8/10 Reading XML documents Tree-based parsing: DOM Parsing approaches: XML file Java DOM API DOM • Tree-based: object representa?on of XML document is stored in memory (as a tree) Document BuilderFactory – +: easy to manipulate document <todo> – -: memory requirement <?tle>slides</?tle> <due>2010-09-30</due> Document • Event-based: as parts of XML document are read <done>y</done> Builder events are generated (e.g., element started) <todo> – +: memory requirement – -: cannot easily manipulate document Marco Cova Internet Compu?ng Workshop 7 Marco Cova Internet Compu?ng Workshop 8 Tree-based parsing: DOM DOM use // instantiate the factory DocumentBuilderFactory dbf = Note: Node types Tree naviga1on DocumentBuilderFactory.newInstance(); • By default, parser is not • Ar • Node.getChildNodes // instantiate the document builder valida?ng • Comments • Node.getFirstChild DocumentBuilder db = – See dbf.setValida?ng() • Document dbf.newDocumentBuilder(); • Node.getLastChild • • Element // parse the file Various op?ons to • Document.getDocumentEle Document doc = db.parse(new File tweak the generated • Text (“todo.xml”)); ment tree, e.g.: • Less common: CDATASecon, CharacterData, • getElementsByTagName – Whitespace DocumentFragment, – CDATA coalescing DocumentType, En?ty, – comments En?tyReference, Nota?on, ProcessingInstruc?on Marco Cova Internet Compu?ng Workshop 9 Marco Cova Internet Compu?ng Workshop 10 Event-based parsing SAX • Parser produces no?fica?ons in XML file Java DOM API EventHandler correspondence of certain events (e.g., element started, element closed) SaxParser • 2 approaches: Factory – Push-based <todo> todo el. start <?tle>slides</?tle> • Once document read, parser must handle all events <due>2010-09-30</due> SaxParser ?tle el. start <done>y</done> • Simple API for XML (SAX) <todo> – Pull-based todo el. end • Parser controls the parsing (start, pause, resume) • Streaming API for XML (StAX) Marco Cova Internet Compu?ng Workshop 11 Marco Cova Internet Compu?ng Workshop 12 2 11/8/10 SAX StAX Parse setup Event handling XML file Java DOM API Program // instan?ate the factory class MyEventHandler SAXParserFactory factory = extends DefaultHandler { SAXParserFactory.newInstance(); SaxParser // instan?ate the parser // invoked when close tag found Factory SAXParser sp = factory.newSAXParser(); public void endElement(String uri, next // parse the document String localName, String qName); sp.parse(new File(“todo.xml”), <todo> new MyEventHandler()); // invoked when start tag found <?tle>slides</?tle> public void startElement(String <due>2010-09-30</due> SaxParser uri, String localName, String <done>y</done> todo el. start qName); <todo> // invoked when text element found public void characters(char[] ch, int start, int length) } Marco Cova Internet Compu?ng Workshop 13 Marco Cova Internet Compu?ng Workshop 14 StAX Modifying XML documents Parse setup Event handling • Add element // instantiate the factory while (reader.hasNext()) { – Document.createElement() XMLInputFactory factory = // get event type XMLInputFactory.newFactory(); int et = reader.getEventType() – Element.appendChild // instantiate the reader XMLStreamReader reader = // handle START_ELEMENT • Add akributes factory.createXMLStreamReader if (et == START_ELEMENT) { (new FileReader(“todo.xml”)); – Element.setAkribute(akr, value) } // handle other event types • Remove element … – Element.removeChild // get next event reader.next(); • Remove akribute } – Element.removeAribute Marco Cova Internet Compu?ng Workshop 15 Marco Cova Internet Compu?ng Workshop 16 Wri?ng XML documents Wri?ng XML documents – cont’d We could simply do: But: • A beker approach: serialize a DOM object out = System.out; • It’s hard to read and • Two techniques maintain out.println(“<?xml version=\"1.0\" – Visit the DOM tree and produce output depending encoding=\"UTF-8\"?>”); • out.println(“<rss version=\"2.0\”>”); It’s easy to make on current element … mistakes for (Item e : items) { – Leverage the transforma?on mechanism out.println(“<item>”); – Can you spot the error? out.println(“ <tle>” + e.?tle + “</ (java.xml.transform) to apply an iden?ty ?tle>”); transforma?on that saves its result in a file (see .. } code example) out.write(“</rss>”); Marco Cova Internet Compu?ng Workshop 17 Marco Cova Internet Compu?ng Workshop 18 3 11/8/10 Wri?ng XML documents – cont’d Charset, character encodings writeNode(doc.getDocumentElement()); void writeElement(Element e, FileWriter fout) { • Character set: set of String n = e.getNodeName(); characters void writeNode(Node n, FileWriter fout) { // open tag • Coded character set: if (n instanceof Element) fout.write(“<“ + n + “>”); writeElement((Element)n, fout); character set in which else if (n instanceof Text) // handle children every leker is mapped writeText((Text) n, fout); NodeList c = e.getChildNodes(); … for (int i = 0; i < c.getLength(); i++) { to a number (code } writeNode(c.item(i), fout); point) } • Character encoding: // close tag // write a text node fout.write(“</“ + n + “>”); specify how coded void writeText(Text t, } FileWriter fout) { characters are mapped fout(t.nodeValue); } to bytes Courtesy of hkp://www.w3.org/Interna?onal/ar?cles/defini?ons-characters/ Marco Cova Internet Compu?ng Workshop 19 Marco Cova Internet Compu?ng Workshop 20 Namespaces Namespaces – cont’d • Certain elements and akributes define a Solu?on: XML namespace • Elements are specified (“fully qualified”) by namespace vocabulary that may be reused in mul?ple prefix and name documents dc:?tle – E.g., the “Dublin Core Metadata Element Set” defines • Namespace prefix is associated to a namespace name 15 general proper?es, such as creator, :tle, etc. xmlns:dc=hkp://purl.org/dc/elements/1.1/ • Document declares the bindings it will use: • XML documents may wish to combine several <rss version="2.0” vocabularies xmlns:dc=hkp://purl.org/dc/elements/1.1/ xmlns:atom=hp://www.w3.org/2005/Atom • Issue: collisions • When using an element, use its fully qualified name: – E.g., both Dublin and Atom define a :tle element <dc:?tle>Title</dc:?tle> Marco Cova Internet Compu?ng Workshop 21 Marco Cova Internet Compu?ng Workshop 22 Escaping References • What if we want to write a todo item such as <todo> • Extensible Markup Language (XML) 1.0, <?tle>add new field <category></?tle> hkp://www.w3.org/TR/xml/ </todo> • <category> should be interpreted as simple text, but parser • Introducing Character Sets and Encodings, will consider it to be a tag and will raise an error (Why?) hkp://www.w3.org/Interna?onal/ar?cles/ • Soluon: CDATA secon – Forces parser to consider its content as text (not markup) defini?ons-characters/ – <![CDATA[ … ]]> <todo> <?tle>add new field <![CDATA[<category>]]></?tle> </todo> Marco Cova Internet Compu?ng Workshop 23 Marco Cova Internet Compu?ng Workshop 24 4 .