Theories of Errors

Bijan Parsia COMP60372 Feb. 19, 2010

Monday, 1 March 2010 1 Finishing Types

Since we are gluttons for punishment

Monday, 1 March 2010 2 Recall • We now have a theory of matching – i.e., we know what it is for a value to match a type – e.g., a simple value matches xs:string iff it’s a string – Matching was complex for elements • But matching isn’t our key service – validation is • Two conceptions of validation – Grammar based recognition • Validate as instance of some DTD • Validate “against” some DTD • Determine if valid wrt some DTD – PSVI production • Go from an untyped value (or its string rep) to a typed one • validation and erasure

Monday, 1 March 2010 3 Validation (subset)

Compare with matching!

Monday, 1 March 2010 4 Erasure

A complexity!

integer-of-string(“01”) = integer-of-string(“1”)

Wildcard info lost!

Monday, 1 March 2010 5 Validation & Erasure

• Features of “external representation” – Self-describing and round-tripping • Round-tripping failure comes from cases where – erases is a relation (trivial) • “01” to 1 to “1” – erases obliterates type Self-description failure! • {“1”, 1} to “1 1” to {1, 1} (or {“1”, “1”}

Monday, 1 March 2010 6 Coursework Retrospective • Some Tricky Bits™ – No one expects the Spanish Inquisition • No one! – minidtdx.xsd describes the syntax of (mini)DTD/XML • It describes an XML syntax for a fragment of DTDs • It does not describe the semantics! – Esp. not by example! • It is incomplete – It is not the tightest schema possible – Questionable syntax choices? » Repetition in choice • XML Schema in different modes – Validate where? – Xerces Has A Bug • nillable=”false”

Monday, 1 March 2010 7 Error Reporting • What’s wrong with... declare function ssd:convertDeclarationOrComment($dec) { validate { typeswitch ($dec) Severity: fatal ….. Description: Attribute @ref is not allowed on element case element(ref) return . The name is in one of the disallowed namespaces for the wildcard …. default return... } };

declare function ssd:convertDeclarationOrComment($dec) { validate { typeswitch ($dec) ….. Severity: fatal case element(empty) return Description: Required attribute @name is missing …. default return... } On what?! };

Monday, 1 March 2010 8 What is the problem? • Validation!

declare function ssd:convertDeclarationOrComment($dec) { validate { typeswitch ($dec) ….. case element(empty) return …. default return... } }; Why doesn’t validate find the “most appropriate” type?

case element(empty) return validate {}/*[1]

(Note this requires adjustment elsewhere.)

Why don’t constructers support type?

Monday, 1 March 2010 9 minidtdx2wxs.xquery • What does this describe? – Ideally, a set of WXS – What’s the most specific output type? – Given the input, and the functions • we have a constrained output • we can define additional constraints along the way – e.g., that no @ref appears on a global element • Compare with minidtdx.xsd – Both can be seen as constructive • WXS produces PSVI given input • mimnidtdx2wx.xquery produces a WXS – Both can be seen as checking • (May want to modify some aspects of the query) – Which (XQuery or WXS) is more expressive? • Which is more analyzable?

Monday, 1 March 2010 10 What’s right?

Wherein we think about correctness

Monday, 1 March 2010 11 What is an XML “Document”? • Layers – A series of octets – A series of unicode characters Errors here mean no – A series of “events” XML! SAX ErrorHandler • SAX perspective • E.g., Start/End tags • Events are tokens – A tree structure • A DOM/Infoset Yay! XPath! XSLT! Etc. – A tree of a certain shape • A Validated Infoset – An adorned tree of a certain shape • A PSVI wrt an WXS Types in play

Monday, 1 March 2010 12 What is an XML “Document”? • Layers validate – A series of octets – A series of unicode characters – A series of “events” • SAX perspective • E.g., Start/End tags • Events are tokens – A tree structure • A DOM/Infoset – A tree of a certain shape • A Validated Infoset – An adorned tree of a certain shape • A PSVI wrt an WXS erase

Monday, 1 March 2010 13 What is an XML “Document”? • Layers – A series of octets – A series of unicode characters – A series of “events” • SAX perspective • E.g., Start/End tags • Events are tokens – A tree structure • A DOM/Infoset – A tree of a certain shape “Same” inputs can • A Validated Infoset have different “meanings”! – An adorned tree of a certain shape (external validation) • A PSVI wrt an WXS

Monday, 1 March 2010 14 What is an XML “Document”? • Layers Generally looks like – A series of octets – A series of unicode characters – A series of “events” But can look otherwise! • SAX perspective element configuration { attribute edition {"ee"}, • E.g., Start/End tags element serialization {attribute method {""}}} • Events are tokens – A tree structure Same “meaning”, • A DOM/Infoset different spelling – A tree of a certain shape • A Validated Infoset – An adorned tree of a certain shape • A PSVI wrt an WXS

Monday, 1 March 2010 15 What is an XML “Document”? • Layers – A series of octets – A series of unicode characters – A series of “events” Can have many... • SAX perspective • E.g., Start/End tags • Events are tokens ..for “the same” meaning – A tree structure • A DOM/Infoset – A tree of a certain shape • A Validated Infoset – An adorned tree of a certain shape • A PSVI wrt an WXS – A picture (or document, or action, or…) • Application meaning

Monday, 1 March 2010 16 A Case to Study

Wherein we go all trendy

Monday, 1 March 2010 17 A Case to Study

• Consider weblogs – Chronologically reversed series of "items” – Each item has an author and a timestamp – Items are generally short, but can contain all sorts of hypermedia – Generally intended to be read by people • Closer to a magazine than to a stock ticker • Different aspects – Writing – Reading – Publishing • As a web site • As a "feed" for syndication – Aggregating

Monday, 1 March 2010 18 A Weblog Workflow

Monday, 1 March 2010 19 Weblog Data Formats

• For writing – HTML (directly or via a Web App) – Markdown/Wikilike Languages • For reading – HTML • For publishing – HTML (websites) – RSSx/Atom (syndication) • For aggregation – RSSx/Atom – HTML? (via scraping) • hAtom?

Monday, 1 March 2010 20 HTML as SSD

• HTML files (tend to) correspond to documents – Text/narrative heavy – Complex, irregular (but with some, and some treelike) structure – Lots of features (doc structure, formatting, forms, etc.) • HTML is not XML – Most XHTML on the Web is served as text/ – Permissive parsing: No need for well formedness – Omit close tags, quotes around attributes – Misnest tags -- the browsers will cope! • HTML is Not SGML – See, for example, the case of comments

Monday, 1 March 2010 21 A Simple HTML Weblog (1) Authentic Voice of a Person. Reverse Chronological Order. On the web. These are essential characteristics of a online Journal or weblog. Given the statements above, a well formed log entry would contain at a minimum an author, a creationDate, and a permaLink. And, of course, content.

My Weblog

-- Sam Ruby

What I Did Today

Feb. 11, 2008; Bijan Parsia

Taught class and it went very well.

What is this notion of “well-formed”?

Monday, 1 March 2010 22 A Simple HTML Weblog (2) • We can radically change the markup

My Weblog

• Which is “right”? • Where is the structure, semi or otherwise? – Is this a “well formed” weblog entry? • By Ruby’s critera?

Monday, 1 March 2010 23 A Simple Atom Entry

My Weblog 2008-02-13T18:30:02Z urn:uuid:60a76c80-d399-11d9-b93C-0003939e0af6 Bijan Parsia What I Did Today urn:uuid:1225c695-cfb8-4ebb-aaaa 2008-02-13T18:30:02Z

Taught class and it went very well.

Monday, 1 March 2010 24 A Structured HTML (1)

My Weblog
What I Did Today

Taught a class and it went very well.

• What do we see? – Not well formed XML – Will not render as nicely • With the right style (CSS), will look like the others! – Not a well formed log entry! • Missing a permalink • Though perhaps it’s implicit?

Monday, 1 March 2010 25 Some CSS • Which layer does this describe?

Structure Presentation

Monday, 1 March 2010 26 A Structured HTML (2)

My Weblog

What I Did Today

Taught a class and it went very well.

This uses the hAtom microformat

Monday, 1 March 2010 27 MicroDigression

• class and similar attributes widely used – Hooks for CSS “queries” – “Semantic” names • Header, footer • Note, etc. • MicroFormats exploit this – Standardize class names • With standard semantics, not just style – Embed (semi-)structured data in regular old HTML • Map to established formats • For our purposes – MicroFormats are just surface syntax • But surface syntax is important!!!

Monday, 1 March 2010 28 Let’s Think Queries

My Weblog • For Atom 2008-02-13T18:30:02Z urn:uuid:60a76c80-d399-11d9-b93C-0003939e0af6 – Titles of entries Bijan Parsia • //title ? What I Did Today – /feed/title vs. //entry/title urn:uuid:1225c695-cfb8-4ebb-aaaa 2008-02-13T18:30:02Z – Entries by me • //entry[author/name/text()=…]

Taught class and it went very well.

– Dates • //updated

• For hAtom

My Weblog

What I Did Today – Titles
• //*[@class=“entry-title”]

Taught a class and it went very well.

– Entries by me
• //* [@class=“published”]

Monday, 1 March 2010 29 AIEEE! • So many models… – Which is the “right” one? • (For what purpose? Under what constraints?) – What is the model? • a well formed log entry – would contain at a minimum » an author, » a creationDate, » and a permaLink. » And, of course, content. • A weblog – Authentic Voice of a Person (so many entries?) – Reverse Chronological Order (Ah, yes) – On the web • Many encodings of this model! – At every level • Parsing vs. transforming – Some easy to see, others are not

Monday, 1 March 2010 30 Many DOMs

Think about the XQuery needed to get all titles…

Monday, 1 March 2010 31 How to cope?

• With which task? – Authoring, aggregating, querying… • Settle on a core representation of the model – Perhaps the Atom DOM • Coerce/transform/extract other models – To the representative one – Or build software that mediates the difference • Hope that there aren’t too many • Advocate standards! – Or make them – The nice thing about standards is that there are so many of them to choose from. • Kent Pitman and others

Monday, 1 March 2010 32 Postel’s Law Be liberal in what you accept, and conservative in what you send. • Liberality – Many DOMs, all expressing the same thing – Many surface syntaxes (perhaps) for each DOM • Conservativity – What should we send? • It depends on the receiver! – Minimal standards? • Well formed XML? • Valid according to a popular schema/format? • HTML?

Monday, 1 March 2010 33 Structure and Presentation

• We’ve called this “DOM” and “Application” Layer –A very common application layer is “rendering” • Text, images • Like, y’know, the web • Standard vs. default renderings • Goes back to SGML This sentence is false.

This sentence is false. Correct rendering

This sentence is false. Fallback!

(Still see this in XSLT!) 34

Monday, 1 March 2010 34 Why Separate them?

• Presentation is more fluid than structure –The "look" may need updating • Presentation needs may vary –What works for 21" screens doesn't for mobile phones • (Or maybe not!) • Accessibility –(content should be perceivable by everyone) • Programmatic processing needs

35

Monday, 1 March 2010 35 Another digression: CSS

• The style language for the Web – Strong separation of presentation • CSS is – not an XML/angle brackets format • Oh NOES! Not another one! – annotative, not transformative • Well, sorta – mostly “formats” nodes – ubiquitous on the Web, esp. client side – works with arbitrary XML • But most clients work with (X)HTML

36

Monday, 1 March 2010 36 Basic Component

• Rules –Which consist of • Selectors – Like XPath expressions – But only forward, with some syntactic sugar • Declaration blocks –Sets of property/value pairs

div.title { text-align:center; font-size: 24; }

37

Monday, 1 March 2010 37 A bit of style

My Weblog
What I Did Today

Taught a class and it went very well.

Try it in http://software.hixie.ch/utilities/js/live-dom-viewer/38

Monday, 1 March 2010 38 Media Types

• Different sets of rules can be contextualized to media – Screen, Print, Braille, Aural… • This is done with groupings called “@media rule”s @media print { BODY { font-size: 10pt }

} Larger font size @media screen { for screen BODY { font-size: 12pt } }

39

Monday, 1 March 2010 39 Cascading

• CSS Rules cascade –That is, there is overriding (and non-overriding) inheritance • That is, rules combine in different ways –http://www.w3.org/TR/CSS21/cascade.html#cascade • General principles –Distance to the node is significant –Precision of selectors is significant –Order of appearance is significant

40

Monday, 1 March 2010 40 Error Handling

• XML has “draconian” error handling – Well formedness error…BOOM • CSS has “forgiving” error handling – “Rules for handling parsing errors” http://www.w3.org/TR/CSS21/syndata.html#parsing-errors • That is, how to interpret illegal documents • Not just reporting errors, but working around them – E.g.,“User agents must ignore a declaration with an unknown property.” • Replace: “h1 { color: red; rotation: 70minutes }” • With: “h1 { color: red }” • Study the error handling rules!

41

Monday, 1 March 2010 41 CSS Robustness • Has to deal with Web conditions 1. People borrowing 2. People collaborating 3. Different devices 4. Different kinds of audiences (and authors) 5. Maintainability 6. Aesthetics • CSS is designed for this – Cascading & Inheritence help with 1, 2, 5 • And importing, of course – @media rules help with 3-6 – Error handling helps with 1, 2, 4

42

Monday, 1 March 2010 42 Error Detection and Reporting

Wherein we learn about Schematron

Monday, 1 March 2010 43 Error Reporting

declare function ssd:convertDeclarationOrComment($dec) { validate { typeswitch ($dec) Severity: fatal ….. Description: Attribute @ref is not allowed on element case element(ref) return . The name is in one of the disallowed namespaces for the wildcard …. default return... } }; • What’s wrong with this (from your experience)? – What’s accidental – What’s fundamental • What would be better? – Better error messages? – Fewer things being (type) errors? • That we can have, already

Monday, 1 March 2010 44 (Perceived) Affordances • (Perceived) Affordance – an available action that is salient to the actor

Donald Norman, The Design of Everyday Things

Monday, 1 March 2010 45 Attractive Nuisances • A dominant or attractive affordance – with a bad or wrong action – In law, “a hazardous object or condition on the land that is likely to attract children who are unable to appreciate the risk posed by the object or condition” -- ye olde Wikipedia – We can reformulate • “a hazardous or misleading language or UI feature that is likely to be misused by (even) an educated user” • Contrast with “merely” hard to use – An attractive nuisance is easy to attempt, hard to use (correctly), and has bad (to catastrophic) effects

Monday, 1 March 2010 46 Typical Schema Languages • Grammar (and maybe type based) – Recognize all or none • Though what the “all” is can be rather flexible – Restrictive by default • Slogan: What is not permitted is forbidden – Error detection and reporting • Is at the discretion of the system • “Not accepted” is the starting place • The point where an error is detected – might not be the point where it occurred – might not be the most helpful point to look at! • Programs! – Null pointer deref » Is the right point the deref or the setting to null? – Non-crashing errors

Monday, 1 March 2010 47 The SSD Way • Explore before prescribe • Describe rather than define • Take what you can, when you can take it • Extra or missing stuff is (can be) OK – Irregular structure! • Adhere to the task at hand • Adore Postel’s Law

Monday, 1 March 2010 48 Schematron

• A different sort of schema language – Not grammar or object/type based – Rule based – Test oriented – Complimentary • Conceptually simple – Patterns contain rules • Rules set a context and contain asserts and reports (A&Rs) • A&Rs contain – Tests, which are XPath expressions, and – Assertions, which are natural language descriptions

Monday, 1 March 2010 49 DTDx Schematron

• “Only 1 Element declaration with a given name” – (Ok, could handle this with Keys in XML Schema!) There can be only one element declaration with a given name. • “Every element reference must have a corresponding element declaration ” There must be an element declaration (with the right name) for elementref to refer to.

Monday, 1 March 2010 50 From HTML5: Exclusions

• HTML5 • http://hsivonen.iki.fi/thesis/ –Relax NG schema –Schemetron assertions –Custom code • Often want contextual exclusions –To break circles: •Paragraphs contain footnotes •Footnotes contain paragraphs •Footnote paragraphs may not contain footnotes • Without exclusions, would need many paragraph productions

Monday, 1 March 2010 51 Exclusions Examples

The "dfn" element cannot contain any nested "dfn" elements. The "noscript element cannot contain any nested "noscript" elements.

Monday, 1 March 2010 52 DFN Defined

From:

## Defining Instance:

dfn.elem = element dfn { dfn.inner & dfn.attrs } dfn.attrs = (common.attrs &common.attrs.aria? ) dfn.inner = ( common.inner.phrasing )

common.elem.phrasing |= dfn.elem

From:

## Phrase Content

common.inner.phrasing = ( text & common.elem.phrasing* )

Monday, 1 March 2010 53 DFN Redefined

From:

## Defining Instance:

dfn.elem = element dfn { dfn.inner & dfn.attrs } dfn.attrs = (common.attrs &common.attrs.aria? ) dfn.inner = ( common.inner.phrasing.without.dfn ) common.elem.phrasing |= dfn.elem common.elem.phrasing.without.noscript |= dfn.elem …

We could extend our formalism (again!)

Monday, 1 March 2010 54 Tip of the iceberg

• Computations –Using XPath functions and variables • Dynamic checks –Can pull stuff from other file • Elaborate reports –diagnostics has (value-ofed) expressions –“Generate paths” to errors •Sound familiar? • General case –Thin shim over XSLT –Closer to “arbitrary code”

Monday, 1 March 2010 55 Interesting Points • DTDx has a WXS – Schematron doesn’t care – Two phase validation •RELAX NG has a way of embedding •WXSbis incorporating similar rules • Arbitrary XPath for context and test – Plus variables! • What isn’t forbidden is permitted – Unlike all the other schema languages! – We’re not performing runs • We’re firing rules – Somewhat easy to use • If you know XPath • If you don’t need coverage – What about analysis? Monday, 1 March 2010 56 Schematron Presumes… • …well formed XML –As do all XML schema languages •Work on DOM! –So can’t help with e.g., overlapping tags •Or tag soup in general •Namespace Analysis!? • …authorial repair –At least, in the default case •Communicate errors to people •Thus, not the basis of a modern browser! –Unlike CSS • Is this enough liberality? –Or rather, does it support enough liberality?

Monday, 1 March 2010 57 Styles of Error Handling

Wherein we reconsider the basics

Monday, 1 March 2010 58 XML Error Handling • De facto XML motto – Be strict about the well formedness of what you accept, and strict in what you send – Draconian error handling – Severe consequences on the Web • And other places • Fail early and fail hard • What about higher levels? – Validity and other analysis? – Most schema languages poor at error reporting • How about XQuery’s type error reporting?

Monday, 1 March 2010 59 XML Error Handling • The spec: – fatal error [Definition: An error which a conforming XML processor must detect and report to the application. After encountering a fatal error, the processor may continue processing the data to search for further errors and may report such errors to the application. In order to support correction of errors, the processor may make unprocessed data from the document (with intermingled character data and markup) available to the application. Once a fatal error is detected, however, the processor must not continue normal processing (i.e., it must not continue to pass character data and information about the document's logical structure to the application in the normal way).] • What should an application do? – To or for its users

Monday, 1 March 2010 60 Take the following sample XHTML code:

01. 02. 03. Hello! 04. 05. 06. 07.

Hello to you!

08.

Can you spot the problem? 09. 10.

61 Slide due to Iain Flynn

Monday, 1 March 2010 61 HTML:

XHTML:

62 Slide due to Iain Flynn

Monday, 1 March 2010 62 Validation In The Wild • HTML – 1%-5% of web pages are valid – Validation is very weak! – All sorts of breakage • E.g., overlapping tags • hi there, my good friend • Syndication Formats – 10% feeds not well-formed – Where do the problems come from? • Hand authoring • Generation bugs • String concat based generation • Composition from random sources

Monday, 1 March 2010 63 More recently In 2005, the developers of Google Reader (Google’s RSS and Atom feed parser) took a snapshot of the XML documents they parsed in one day. • Approximately 7% of these documents contained at least one well-formedness error. • Google Reader deals with millions of feeds per day. – That’s a lot of broken documents

Source: http://googlereader.blogspot.com/2005/12/xml-errors-in-feeds.html Slide due to Iain Flynn Monday, 1 March 2010 64 Encoding

Structure

Entity

Typo Text

Slide due to Iain Flynn

Monday, 1 March 2010 65 !""#"$%"&'()#*+,$

657$() 2,)

!"4.5() !"#$%&"'() **,) *+,)

!"#$%&"'() -./0#.0/1() !"4.5() 657$()

-./0#.0/1() 23,)

Slide due to Iain Flynn

Monday, 1 March 2010 66 A Thought Experiment • “Imagine...that all web browsers use strict XML parsers” • “...that you were using a publishing tool that [was strict] – “All of its default templates were valid XHTML.” – “It incorporated a nifty layout editor to ensure that you couldn’t introduce any invalid XHTML...” • “You click ‘Publish’” – “the page that you...validly authored is now not well-formed” • Problem: “a trackback with some illegal characters” – “...your publishing tool had a bug” – “The administration page itself tries to display the trackbacks you’ve received, and you get an XML processing error.”

http://diveintomark.org/archives/2004/01/14/thought_experiment Monday, 1 March 2010 67 Real Life

Monday, 1 March 2010 68 Lesson #1 • We are dealing with socio-political (and economic) phenomena – Complex ones! – Many players; many sorts of player – Lots of historical specifics – Lots of interaction effects • Human factors critical – What do people do (and why?) – How to influence them? – Affordances and incentives – Dealing with “bozos” • “There’s just no nice way to say this: Anyone who can’t make a syndication feed that’s well-formed XML is an incompetent fool.”

Monday, 1 March 2010 69 Error Handling Styles • Draconian – Fail hard and fast • Ignore errors – CSS • Hard coded DWIM repair – HTML, HTML5 • “Repair-sheet”/Schema based repair – Instead of just reports, Schematron rules could trigger repairs • Ultimately, (some) errors are propagated – The key is to fail correctly • In the right way, at the right time, for the right reason – With the right message!

Monday, 1 March 2010 70