XML Error Handling
Total Page:16
File Type:pdf, Size:1020Kb
Theories of Errors Bijan Parsia COMP60372 Feb. 19, 2010 Monday, 1 March 2010 1 Finishing Types Since we are gluttons for punishment Monday, 1 March 2010 2 Recall • We now have a theory of matching – i.e., we know what it is for a value to match a type – e.g., a simple value matches xs:string iff it’s a string – Matching was complex for elements • But matching isn’t our key service – validation is • Two conceptions of validation – Grammar based recognition • Validate as instance of some DTD • Validate “against” some DTD • Determine if valid wrt some DTD – PSVI production • Go from an untyped value (or its string rep) to a typed one • validation and erasure Monday, 1 March 2010 3 Validation (subset) Compare with matching! Monday, 1 March 2010 4 Erasure A complexity! integer-of-string(“01”) = integer-of-string(“1”) Wildcard info lost! Monday, 1 March 2010 5 Validation & Erasure • Features of “external representation” – Self-describing and round-tripping • Round-tripping failure comes from cases where – erases is a relation (trivial) • “01” to 1 to “1” – erases obliterates type Self-description failure! • {“1”, 1} to “1 1” to {1, 1} (or {“1”, “1”} Monday, 1 March 2010 6 Coursework Retrospective • Some Tricky Bits™ – No one expects the Spanish Inquisition • No one! – minidtdx.xsd describes the syntax of (mini)DTD/XML • It describes an XML syntax for a fragment of DTDs • It does not describe the semantics! – Esp. not by example! • It is incomplete – It is not the tightest schema possible – Questionable syntax choices? » Repetition in choice • XML Schema in different modes – Validate where? – Xerces Has A Bug • nillable=”false” Monday, 1 March 2010 7 Error Reporting • What’s wrong with... declare function ssd:convertDeclarationOrComment($dec) { validate { typeswitch ($dec) Severity: fatal ….. Description: Attribute @ref is not allowed on element case element(ref) return <element>. The name is in one of the disallowed <xs:element ref="$dec/@ref"/> namespaces for the wildcard …. default return... } }; declare function ssd:convertDeclarationOrComment($dec) { validate { typeswitch ($dec) ….. Severity: fatal case element(empty) return Description: Required attribute @name is missing <xs:complexType/> …. default return... } On what?! }; Monday, 1 March 2010 8 What is the problem? • Validation! declare function ssd:convertDeclarationOrComment($dec) { validate { typeswitch ($dec) ….. case element(empty) return <xs:complexType/> …. default return... } }; Why doesn’t validate find the “most appropriate” type? case element(empty) return validate {<xs:element name="foo"><xs:complexType/></xs:element>}/*[1] (Note this requires adjustment elsewhere.) Why don’t constructers support type? Monday, 1 March 2010 9 minidtdx2wxs.xquery • What does this describe? – Ideally, a set of WXS – What’s the most specific output type? – Given the input, and the functions • we have a constrained output • we can define additional constraints along the way – e.g., that no @ref appears on a global element • Compare with minidtdx.xsd – Both can be seen as constructive • WXS produces PSVI given input • mimnidtdx2wx.xquery produces a WXS – Both can be seen as checking • (May want to modify some aspects of the query) – Which (XQuery or WXS) is more expressive? • Which is more analyzable? Monday, 1 March 2010 10 What’s right? Wherein we think about correctness Monday, 1 March 2010 11 What is an XML “Document”? • Layers – A series of octets – A series of unicode characters Errors here mean no – A series of “events” XML! SAX ErrorHandler • SAX perspective • E.g., Start/End tags • Events are tokens – A tree structure • A DOM/Infoset Yay! XPath! XSLT! Etc. – A tree of a certain shape • A Validated Infoset – An adorned tree of a certain shape • A PSVI wrt an WXS Types in play Monday, 1 March 2010 12 What is an XML “Document”? • Layers validate – A series of octets – A series of unicode characters – A series of “events” • SAX perspective • E.g., Start/End tags • Events are tokens – A tree structure • A DOM/Infoset – A tree of a certain shape • A Validated Infoset – An adorned tree of a certain shape • A PSVI wrt an WXS erase Monday, 1 March 2010 13 What is an XML “Document”? • Layers – A series of octets – A series of unicode characters – A series of “events” • SAX perspective • E.g., Start/End tags • Events are tokens – A tree structure • A DOM/Infoset – A tree of a certain shape “Same” inputs can • A Validated Infoset have different “meanings”! – An adorned tree of a certain shape (external validation) • A PSVI wrt an WXS Monday, 1 March 2010 14 What is an XML “Document”? • Layers Generally looks like <configuration xmlns="http://saxon.sf.net/ns/configuration" edition="EE"> – A series of octets <serialization method="xml" /> – A series of unicode characters </configuration> – A series of “events” But can look otherwise! • SAX perspective element configuration { attribute edition {"ee"}, • E.g., Start/End tags element serialization {attribute method {"xml"}}} • Events are tokens – A tree structure Same “meaning”, • A DOM/Infoset different spelling – A tree of a certain shape • A Validated Infoset – An adorned tree of a certain shape • A PSVI wrt an WXS Monday, 1 March 2010 15 What is an XML “Document”? • Layers – A series of octets – A series of unicode characters – A series of “events” Can have many... • SAX perspective • E.g., Start/End tags • Events are tokens ..for “the same” meaning – A tree structure • A DOM/Infoset – A tree of a certain shape • A Validated Infoset – An adorned tree of a certain shape • A PSVI wrt an WXS – A picture (or document, or action, or…) • Application meaning Monday, 1 March 2010 16 A Case to Study Wherein we go all trendy Monday, 1 March 2010 17 A Case to Study • Consider weblogs – Chronologically reversed series of "items” – Each item has an author and a timestamp – Items are generally short, but can contain all sorts of hypermedia – Generally intended to be read by people • Closer to a magazine than to a stock ticker • Different aspects – Writing – Reading – Publishing • As a web site • As a "feed" for syndication – Aggregating Monday, 1 March 2010 18 A Weblog Workflow Monday, 1 March 2010 19 Weblog Data Formats • For writing – HTML (directly or via a Web App) – Markdown/Wikilike Languages • For reading – HTML • For publishing – HTML (websites) – RSSx/Atom (syndication) • For aggregation – RSSx/Atom – HTML? (via scraping) • hAtom? Monday, 1 March 2010 20 HTML as SSD • HTML files (tend to) correspond to documents – Text/narrative heavy – Complex, irregular (but with some, and some treelike) structure – Lots of features (doc structure, formatting, forms, etc.) • HTML is not XML – Most XHTML on the Web is served as text/html – Permissive parsing: No need for well formedness – Omit close tags, quotes around attributes – Misnest tags -- the browsers will cope! • HTML is Not SGML – See, for example, the case of comments Monday, 1 March 2010 21 A Simple HTML Weblog (1) Authentic Voice of a Person. Reverse Chronological Order. On the web. These are essential characteristics of a online Journal or weblog. Given the statements above, a well formed log entry would contain at a minimum an author, a creationDate, and a permaLink. And, of course, content. <h1>My Weblog</h1> -- Sam Ruby <h2>What I Did Today</h2> <h3><a id=“160220081”></a> Feb. 11, 2008; Bijan Parsia</h3> <p>Taught class and it went <i>very</i> well.</p> What is this notion of “well-formed”? Monday, 1 March 2010 22 A Simple HTML Weblog (2) • We can radically change the markup <h1>My Weblog</h1> <ul> <li> <b>What I Did Today</b><br/> <i><a id="160220081"> Feb. 11, 2008; Bijan Parsia</a></i></br> <p>Taught a class and it went <em>very</em> well. </li> </ul> • Which is “right”? • Where is the structure, semi or otherwise? – Is this a “well formed” weblog entry? • By Ruby’s critera? Monday, 1 March 2010 23 A Simple Atom Entry <feed xmlns="http://www.w3.org/2005/Atom"> <title>My Weblog</title> <updated>2008-02-13T18:30:02Z</updated> <id>urn:uuid:60a76c80-d399-11d9-b93C-0003939e0af6</id> <entry> <author> <name>Bijan Parsia</name> </author> <title>What I Did Today</title> <id>urn:uuid:1225c695-cfb8-4ebb-aaaa</id> <updated>2008-02-13T18:30:02Z</updated> <content type="xhtml" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml"> <p>Taught class and it went <em>very</em> well.</p> </content> </entry> </feed> Monday, 1 March 2010 24 A Structured HTML (1) <div class=title>My Weblog</div> <div class="entry"> <div class=title>What I Did Today</div> <div class=byline> <span class=date>Feb. 16, 2009</span> <span class=author>Bijan Parsia</span> </div> <div class="content"> <p>Taught a class and it went <i>very</i> well.</p> </div> </div> • What do we see? – Not well formed XML – Will not render as nicely • With the right style (CSS), will look like the others! – Not a well formed log entry! • Missing a permalink • Though perhaps it’s implicit? Monday, 1 March 2010 25 Some CSS • Which layer does this describe? <style type="text/css"> .title {font-weight: bold} div.title {text-align:center; font-size: 24; } div.entry div.title {text-align: left; font-variant: normal} span.date {font-style: italic} span.date:after{content:" by"} div.content {font-style: italic} div.content i {font-style: normal; font-weight: bold} </style> Structure Presentation Monday, 1 March 2010 26 A Structured HTML (2) <div class="hfeed"> <p>My Weblog</p> <div class="hentry" id="112993192128302715"> <strong class="entry-title"> What I Did Today </strong> <div class="entry-content"> <p>Taught a class and it went <i>very</i> well.</p> </div> </div> <div> <span class="byline">posted by <span class="author vcard"> <span class="fn">Bijan Parsia</span> at <a rel="bookmark" href="2009-16-02-post1"> <abbr class="published">Feb.