Theories of Errors
Bijan Parsia COMP60372 Feb. 19, 2010
Monday, 1 March 2010 1 Finishing Types
Since we are gluttons for punishment
Monday, 1 March 2010 2 Recall • We now have a theory of matching – i.e., we know what it is for a value to match a type – e.g., a simple value matches xs:string iff it’s a string – Matching was complex for elements • But matching isn’t our key service – validation is • Two conceptions of validation – Grammar based recognition • Validate as instance of some DTD • Validate “against” some DTD • Determine if valid wrt some DTD – PSVI production • Go from an untyped value (or its string rep) to a typed one • validation and erasure
Monday, 1 March 2010 3 Validation (subset)
Compare with matching!
Monday, 1 March 2010 4 Erasure
A complexity!
integer-of-string(“01”) = integer-of-string(“1”)
Wildcard info lost!
Monday, 1 March 2010 5 Validation & Erasure
• Features of “external representation” – Self-describing and round-tripping • Round-tripping failure comes from cases where – erases is a relation (trivial) • “01” to 1 to “1” – erases obliterates type Self-description failure! • {“1”, 1} to “1 1” to {1, 1} (or {“1”, “1”}
Monday, 1 March 2010 6 Coursework Retrospective • Some Tricky Bits™ – No one expects the Spanish Inquisition • No one! – minidtdx.xsd describes the syntax of (mini)DTD/XML • It describes an XML syntax for a fragment of DTDs • It does not describe the semantics! – Esp. not by example! • It is incomplete – It is not the tightest schema possible – Questionable syntax choices? » Repetition in choice • XML Schema in different modes – Validate where? – Xerces Has A Bug • nillable=”false”
Monday, 1 March 2010 7 Error Reporting • What’s wrong with... declare function ssd:convertDeclarationOrComment($dec) { validate { typeswitch ($dec) Severity: fatal ….. Description: Attribute @ref is not allowed on element case element(ref) return
declare function ssd:convertDeclarationOrComment($dec) { validate { typeswitch ($dec) ….. Severity: fatal case element(empty) return Description: Required attribute @name is missing
Monday, 1 March 2010 8 What is the problem? • Validation!
declare function ssd:convertDeclarationOrComment($dec) { validate { typeswitch ($dec) ….. case element(empty) return
case element(empty) return validate {
(Note this requires adjustment elsewhere.)
Why don’t constructers support type?
Monday, 1 March 2010 9 minidtdx2wxs.xquery • What does this describe? – Ideally, a set of WXS – What’s the most specific output type? – Given the input, and the functions • we have a constrained output • we can define additional constraints along the way – e.g., that no @ref appears on a global element • Compare with minidtdx.xsd – Both can be seen as constructive • WXS produces PSVI given input • mimnidtdx2wx.xquery produces a WXS – Both can be seen as checking • (May want to modify some aspects of the query) – Which (XQuery or WXS) is more expressive? • Which is more analyzable?
Monday, 1 March 2010 10 What’s right?
Wherein we think about correctness
Monday, 1 March 2010 11 What is an XML “Document”? • Layers – A series of octets – A series of unicode characters Errors here mean no – A series of “events” XML! SAX ErrorHandler • SAX perspective • E.g., Start/End tags • Events are tokens – A tree structure • A DOM/Infoset Yay! XPath! XSLT! Etc. – A tree of a certain shape • A Validated Infoset – An adorned tree of a certain shape • A PSVI wrt an WXS Types in play
Monday, 1 March 2010 12 What is an XML “Document”? • Layers validate – A series of octets – A series of unicode characters – A series of “events” • SAX perspective • E.g., Start/End tags • Events are tokens – A tree structure • A DOM/Infoset – A tree of a certain shape • A Validated Infoset – An adorned tree of a certain shape • A PSVI wrt an WXS erase
Monday, 1 March 2010 13 What is an XML “Document”? • Layers – A series of octets – A series of unicode characters – A series of “events” • SAX perspective • E.g., Start/End tags • Events are tokens – A tree structure • A DOM/Infoset – A tree of a certain shape “Same” inputs can • A Validated Infoset have different “meanings”! – An adorned tree of a certain shape (external validation) • A PSVI wrt an WXS
Monday, 1 March 2010 14 What is an XML “Document”? • Layers Generally looks like
Monday, 1 March 2010 15 What is an XML “Document”? • Layers – A series of octets – A series of unicode characters – A series of “events” Can have many... • SAX perspective • E.g., Start/End tags • Events are tokens ..for “the same” meaning – A tree structure • A DOM/Infoset – A tree of a certain shape • A Validated Infoset – An adorned tree of a certain shape • A PSVI wrt an WXS – A picture (or document, or action, or…) • Application meaning
Monday, 1 March 2010 16 A Case to Study
Wherein we go all trendy
Monday, 1 March 2010 17 A Case to Study
• Consider weblogs – Chronologically reversed series of "items” – Each item has an author and a timestamp – Items are generally short, but can contain all sorts of hypermedia – Generally intended to be read by people • Closer to a magazine than to a stock ticker • Different aspects – Writing – Reading – Publishing • As a web site • As a "feed" for syndication – Aggregating
Monday, 1 March 2010 18 A Weblog Workflow
Monday, 1 March 2010 19 Weblog Data Formats
• For writing – HTML (directly or via a Web App) – Markdown/Wikilike Languages • For reading – HTML • For publishing – HTML (websites) – RSSx/Atom (syndication) • For aggregation – RSSx/Atom – HTML? (via scraping) • hAtom?
Monday, 1 March 2010 20 HTML as SSD
• HTML files (tend to) correspond to documents – Text/narrative heavy – Complex, irregular (but with some, and some treelike) structure – Lots of features (doc structure, formatting, forms, etc.) • HTML is not XML – Most XHTML on the Web is served as text/html – Permissive parsing: No need for well formedness – Omit close tags, quotes around attributes – Misnest tags -- the browsers will cope! • HTML is Not SGML – See, for example, the case of comments
Monday, 1 March 2010 21 A Simple HTML Weblog (1) Authentic Voice of a Person. Reverse Chronological Order. On the web. These are essential characteristics of a online Journal or weblog. Given the statements above, a well formed log entry would contain at a minimum an author, a creationDate, and a permaLink. And, of course, content.
My Weblog
-- Sam RubyWhat I Did Today
Feb. 11, 2008; Bijan Parsia
Taught class and it went very well.
What is this notion of “well-formed”?
Monday, 1 March 2010 22 A Simple HTML Weblog (2) • We can radically change the markup
My Weblog
- What I Did Today
Feb. 11, 2008; Bijan ParsiaTaught a class and it went very well.
• Which is “right”? • Where is the structure, semi or otherwise? – Is this a “well formed” weblog entry? • By Ruby’s critera?
Monday, 1 March 2010 23 A Simple Atom Entry
Taught class and it went very well.
Monday, 1 March 2010 24 A Structured HTML (1)
Taught a class and it went very well.
• What do we see? – Not well formed XML – Will not render as nicely • With the right style (CSS), will look like the others! – Not a well formed log entry! • Missing a permalink • Though perhaps it’s implicit?
Monday, 1 March 2010 25 Some CSS • Which layer does this describe?
Structure Presentation
Monday, 1 March 2010 26 A Structured HTML (2)
My Weblog
Taught a class and it went very well.
This uses the hAtom microformat
Monday, 1 March 2010 27 MicroDigression
• class and similar attributes widely used – Hooks for CSS “queries” – “Semantic” names • Header, footer • Note, etc. • MicroFormats exploit this – Standardize class names • With standard semantics, not just style – Embed (semi-)structured data in regular old HTML • Map to established formats • For our purposes – MicroFormats are just surface syntax • But surface syntax is important!!!
Monday, 1 March 2010 28 Let’s Think Queries
Taught class and it went very well.
My Weblog
Taught a class and it went very well.
Monday, 1 March 2010 29 AIEEE! • So many models… – Which is the “right” one? • (For what purpose? Under what constraints?) – What is the model? • a well formed log entry – would contain at a minimum » an author, » a creationDate, » and a permaLink. » And, of course, content. • A weblog – Authentic Voice of a Person (so many entries?) – Reverse Chronological Order (Ah, yes) – On the web • Many encodings of this model! – At every level • Parsing vs. transforming – Some easy to see, others are not
Monday, 1 March 2010 30 Many DOMs
Think about the XQuery needed to get all titles…
Monday, 1 March 2010 31 How to cope?
• With which task? – Authoring, aggregating, querying… • Settle on a core representation of the model – Perhaps the Atom DOM • Coerce/transform/extract other models – To the representative one – Or build software that mediates the difference • Hope that there aren’t too many • Advocate standards! – Or make them – The nice thing about standards is that there are so many of them to choose from. • Kent Pitman and others
Monday, 1 March 2010 32 Postel’s Law Be liberal in what you accept, and conservative in what you send. • Liberality – Many DOMs, all expressing the same thing – Many surface syntaxes (perhaps) for each DOM • Conservativity – What should we send? • It depends on the receiver! – Minimal standards? • Well formed XML? • Valid according to a popular schema/format? • HTML?
Monday, 1 March 2010 33 Structure and Presentation
• We’ve called this “DOM” and “Application” Layer –A very common application layer is “rendering” • Text, images • Like, y’know, the web • Standard vs. default renderings • Goes back to SGML
This sentence is false. Correct rendering
This sentence is false. Fallback!
(Still see this in XSLT!) 34
Monday, 1 March 2010 34 Why Separate them?
• Presentation is more fluid than structure –The "look" may need updating • Presentation needs may vary –What works for 21" screens doesn't for mobile phones • (Or maybe not!) • Accessibility –(content should be perceivable by everyone) • Programmatic processing needs
35
Monday, 1 March 2010 35 Another digression: CSS
• The style language for the Web – Strong separation of presentation • CSS is – not an XML/angle brackets format • Oh NOES! Not another one! – annotative, not transformative • Well, sorta – mostly “formats” nodes – ubiquitous on the Web, esp. client side – works with arbitrary XML • But most clients work with (X)HTML
36
Monday, 1 March 2010 36 Basic Component
• Rules –Which consist of • Selectors – Like XPath expressions – But only forward, with some syntactic sugar • Declaration blocks –Sets of property/value pairs
div.title { text-align:center; font-size: 24; }
37
Monday, 1 March 2010 37
Taught a class and it went very well.
Try it in http://software.hixie.ch/utilities/js/live-dom-viewer/38
Monday, 1 March 2010 38 Media Types
• Different sets of rules can be contextualized to media – Screen, Print, Braille, Aural… • This is done with groupings called “@media rule”s @media print { BODY { font-size: 10pt }
} Larger font size @media screen { for screen BODY { font-size: 12pt } }
39
Monday, 1 March 2010 39 Cascading
• CSS Rules cascade –That is, there is overriding (and non-overriding) inheritance • That is, rules combine in different ways –http://www.w3.org/TR/CSS21/cascade.html#cascade • General principles –Distance to the node is significant –Precision of selectors is significant –Order of appearance is significant
40
Monday, 1 March 2010 40 Error Handling
• XML has “draconian” error handling – Well formedness error…BOOM • CSS has “forgiving” error handling – “Rules for handling parsing errors” http://www.w3.org/TR/CSS21/syndata.html#parsing-errors • That is, how to interpret illegal documents • Not just reporting errors, but working around them – E.g.,“User agents must ignore a declaration with an unknown property.” • Replace: “h1 { color: red; rotation: 70minutes }” • With: “h1 { color: red }” • Study the error handling rules!
41
Monday, 1 March 2010 41 CSS Robustness • Has to deal with Web conditions 1. People borrowing 2. People collaborating 3. Different devices 4. Different kinds of audiences (and authors) 5. Maintainability 6. Aesthetics • CSS is designed for this – Cascading & Inheritence help with 1, 2, 5 • And importing, of course – @media rules help with 3-6 – Error handling helps with 1, 2, 4
42
Monday, 1 March 2010 42 Error Detection and Reporting
Wherein we learn about Schematron
Monday, 1 March 2010 43 Error Reporting
declare function ssd:convertDeclarationOrComment($dec) { validate { typeswitch ($dec) Severity: fatal ….. Description: Attribute @ref is not allowed on element case element(ref) return
Monday, 1 March 2010 44 (Perceived) Affordances • (Perceived) Affordance – an available action that is salient to the actor
Donald Norman, The Design of Everyday Things
Monday, 1 March 2010 45 Attractive Nuisances • A dominant or attractive affordance – with a bad or wrong action – In law, “a hazardous object or condition on the land that is likely to attract children who are unable to appreciate the risk posed by the object or condition” -- ye olde Wikipedia – We can reformulate • “a hazardous or misleading language or UI feature that is likely to be misused by (even) an educated user” • Contrast with “merely” hard to use – An attractive nuisance is easy to attempt, hard to use (correctly), and has bad (to catastrophic) effects
Monday, 1 March 2010 46 Typical Schema Languages • Grammar (and maybe type based) – Recognize all or none • Though what the “all” is can be rather flexible – Restrictive by default • Slogan: What is not permitted is forbidden – Error detection and reporting • Is at the discretion of the system • “Not accepted” is the starting place • The point where an error is detected – might not be the point where it occurred – might not be the most helpful point to look at! • Programs! – Null pointer deref » Is the right point the deref or the setting to null? – Non-crashing errors
Monday, 1 March 2010 47 The SSD Way • Explore before prescribe • Describe rather than define • Take what you can, when you can take it • Extra or missing stuff is (can be) OK – Irregular structure! • Adhere to the task at hand • Adore Postel’s Law
Monday, 1 March 2010 48 Schematron
• A different sort of schema language – Not grammar or object/type based – Rule based – Test oriented – Complimentary • Conceptually simple – Patterns contain rules • Rules set a context and contain asserts and reports (A&Rs) • A&Rs contain – Tests, which are XPath expressions, and – Assertions, which are natural language descriptions
Monday, 1 March 2010 49 DTDx Schematron
• “Only 1 Element declaration with a given name” – (Ok, could handle this with Keys in XML Schema!)
Monday, 1 March 2010 50 From HTML5: Exclusions
• HTML5 validator • http://hsivonen.iki.fi/thesis/ –Relax NG schema –Schemetron assertions –Custom code • Often want contextual exclusions –To break circles: •Paragraphs contain footnotes •Footnotes contain paragraphs •Footnote paragraphs may not contain footnotes • Without exclusions, would need many paragraph productions
Monday, 1 March 2010 51 Exclusions Examples
Monday, 1 March 2010 52 DFN Defined
From:
## Defining Instance:
dfn.elem = element dfn { dfn.inner & dfn.attrs } dfn.attrs = (common.attrs &common.attrs.aria? ) dfn.inner = ( common.inner.phrasing )
common.elem.phrasing |= dfn.elem
From:
## Phrase Content
common.inner.phrasing = ( text & common.elem.phrasing* )
Monday, 1 March 2010 53 DFN Redefined
From:
## Defining Instance:
dfn.elem = element dfn { dfn.inner & dfn.attrs } dfn.attrs = (common.attrs &common.attrs.aria? ) dfn.inner = ( common.inner.phrasing.without.dfn ) common.elem.phrasing |= dfn.elem common.elem.phrasing.without.noscript |= dfn.elem …
We could extend our formalism (again!)
Monday, 1 March 2010 54 Tip of the iceberg
• Computations –Using XPath functions and variables • Dynamic checks –Can pull stuff from other file • Elaborate reports –diagnostics has (value-ofed) expressions –“Generate paths” to errors •Sound familiar? • General case –Thin shim over XSLT –Closer to “arbitrary code”
Monday, 1 March 2010 55 Interesting Points • DTDx has a WXS – Schematron doesn’t care – Two phase validation •RELAX NG has a way of embedding •WXSbis incorporating similar rules • Arbitrary XPath for context and test – Plus variables! • What isn’t forbidden is permitted – Unlike all the other schema languages! – We’re not performing runs • We’re firing rules – Somewhat easy to use • If you know XPath • If you don’t need coverage – What about analysis? Monday, 1 March 2010 56 Schematron Presumes… • …well formed XML –As do all XML schema languages •Work on DOM! –So can’t help with e.g., overlapping tags •Or tag soup in general •Namespace Analysis!? • …authorial repair –At least, in the default case •Communicate errors to people •Thus, not the basis of a modern browser! –Unlike CSS • Is this enough liberality? –Or rather, does it support enough liberality?
Monday, 1 March 2010 57 Styles of Error Handling
Wherein we reconsider the basics
Monday, 1 March 2010 58 XML Error Handling • De facto XML motto – Be strict about the well formedness of what you accept, and strict in what you send – Draconian error handling – Severe consequences on the Web • And other places • Fail early and fail hard • What about higher levels? – Validity and other analysis? – Most schema languages poor at error reporting • How about XQuery’s type error reporting?
Monday, 1 March 2010 59 XML Error Handling • The spec: – fatal error [Definition: An error which a conforming XML processor must detect and report to the application. After encountering a fatal error, the processor may continue processing the data to search for further errors and may report such errors to the application. In order to support correction of errors, the processor may make unprocessed data from the document (with intermingled character data and markup) available to the application. Once a fatal error is detected, however, the processor must not continue normal processing (i.e., it must not continue to pass character data and information about the document's logical structure to the application in the normal way).] • What should an application do? – To or for its users
Monday, 1 March 2010 60 Take the following sample XHTML code:
01. 02.
03.Hello to you!
08.Can you spot the problem? 09. 10.
61 Slide due to Iain Flynn
Monday, 1 March 2010 61 HTML:
XHTML:
62 Slide due to Iain Flynn
Monday, 1 March 2010 62 Validation In The Wild • HTML – 1%-5% of web pages are valid – Validation is very weak! – All sorts of breakage • E.g., overlapping tags • hi there, my good friend • Syndication Formats – 10% feeds not well-formed – Where do the problems come from? • Hand authoring • Generation bugs • String concat based generation • Composition from random sources
Monday, 1 March 2010 63 More recently In 2005, the developers of Google Reader (Google’s RSS and Atom feed parser) took a snapshot of the XML documents they parsed in one day. • Approximately 7% of these documents contained at least one well-formedness error. • Google Reader deals with millions of feeds per day. – That’s a lot of broken documents
Source: http://googlereader.blogspot.com/2005/12/xml-errors-in-feeds.html Slide due to Iain Flynn Monday, 1 March 2010 64 Encoding
Structure
Entity
Typo Text
Slide due to Iain Flynn
Monday, 1 March 2010 65 !""#"$%"&'()#*+,$
657$() 2,)
!"4.5() !"#$%&"'() **,) *+,)
!"#$%&"'() -./0#.0/1() !"4.5() 657$()
-./0#.0/1() 23,)
Slide due to Iain Flynn
Monday, 1 March 2010 66 A Thought Experiment • “Imagine...that all web browsers use strict XML parsers” • “...that you were using a publishing tool that [was strict] – “All of its default templates were valid XHTML.” – “It incorporated a nifty layout editor to ensure that you couldn’t introduce any invalid XHTML...” • “You click ‘Publish’” – “the page that you...validly authored is now not well-formed” • Problem: “a trackback with some illegal characters” – “...your publishing tool had a bug” – “The administration page itself tries to display the trackbacks you’ve received, and you get an XML processing error.”
http://diveintomark.org/archives/2004/01/14/thought_experiment Monday, 1 March 2010 67 Real Life
Monday, 1 March 2010 68 Lesson #1 • We are dealing with socio-political (and economic) phenomena – Complex ones! – Many players; many sorts of player – Lots of historical specifics – Lots of interaction effects • Human factors critical – What do people do (and why?) – How to influence them? – Affordances and incentives – Dealing with “bozos” • “There’s just no nice way to say this: Anyone who can’t make a syndication feed that’s well-formed XML is an incompetent fool.”
Monday, 1 March 2010 69 Error Handling Styles • Draconian – Fail hard and fast • Ignore errors – CSS • Hard coded DWIM repair – HTML, HTML5 • “Repair-sheet”/Schema based repair – Instead of just reports, Schematron rules could trigger repairs • Ultimately, (some) errors are propagated – The key is to fail correctly • In the right way, at the right time, for the right reason – With the right message!
Monday, 1 March 2010 70