XMLXML andand WebWeb ApplicationApplication ProgrammingProgramming

Copyright © 2007 Anders Møller OverviewOverview

SchemaSchema languageslanguages forfor XMLXML XMLXML inin programmingprogramming languageslanguages WebWeb applicationapplication frameworksframeworks

2 PartPart II

Schema languages for XML – what is the essence of DTD / XML Schema / RELAX NG ?

• expressiveness of schema languages • formal models, variations of regular tree grammars

• foundation for Part II

3 PartPart II

Schema languages for XML here yourr – what is cthehem essencea isis oonnee of sspe peDTDcc ww /h ere you ““WW33CC XXMMLL SSchema XML Schema / aRELAXnd you NGrr hh ee?a add ssttiillll ggoo eyes wwililll ssttiillll ttwwiittcchh and you eyes ad it for tthhee tteenntthh ZZZ even wwhheenn yyoouu rreead it for bubuzzzzZZZ even r be • expressivenesshat y ofou schema wwililll nnoo languages lolonngeger be ttimimee.. EExxcceeptpt tthat you urprises..”” ssuurprrpr• formalisiseedd byby models, ssurpr isvariationses of regular tree grammars

• foundation for Part II T 2.0 y, editor ooff WW33CC’s’s XXSLSLT 2.0 ––MMiicchahaeell KKaay, editor

4 PartPart IIII

XML in programming languages – how can we integrate XML processing in programming languages?

• XML schemas as types in programming languages • research projects: –XDuce –XACT

5 PartPart IIIIII

Web application frameworks – why is Web programming so complicated? • Java Servlets and JSP • JWIG • Google Web Toolkit

6 PartPart IIIIII

» Wow ! Tha »WoWebw ! Thannkk ss applicationf oforr r reeleleasing another Jav aframeworks W f asing another Java Weebb a apppplilcicaatiton development rfraammeewwoorrkk. .T he 31,918 framewor ion development The 31,918 frameworkkss c cuurrreenntltyly available don't offer – why is Web programming availabl eso don 'tcomplicated? offereennoouugghh c chhooice. You know, ice. You know ,I I w waass j ujusstt s aying to my friend th saying to my friend thee o otthheerr d daayy that we really do not eennoouugghh J Jaava web development that we really do not h haavvee va web developmentf rfraammeewwoorrkkss. "Why," I said to him days si•nce JavaI read aServletsb and JSP. "Why," I said to him, ," "itit's's b beeeenn t hree days since I read aboouutt a a n neeww w web development fram three of innova eb development frameewwoorrkk o onn T TSS ! Is the pace of innovattioionn s slolowwining in the Java world? A SS ! Is the pace • g in the Java world? Arree w wee t hthatat f ar along the road to oobbssoolelessccenJWIGce ? Can death be far along the road to ence ? Can death bef afarr a awwaayy ? ?"" But now I feel bet But no•w I Googlefeel betteerr. .[ Web[...].] w whh eToolkitn I saw the AJAX wo agilen en I saw the AJAX worrdd I I k knneeww t ht at the richness and agileneessss o of ft hthee e end-user experience w hat the richness and nd-user experience woouuldld b bee e exxceeded only by the us ssccaaleleability, and, most imp ceeded only by the usaabbiliiltityy, , ability, and, most impoorrtatanntt o of fa all, testability of applica [>]] . . ced by the You are You are s soommee w wilidld and crazy guys, and and crazy guys, andI I, ,f oforr o onnee, ,a am going to nominate ththiningg n ame>] for the Web A m going to nominate [ [<] for the Web Apppplilcicaattioionn D evelopment Framew I'm serious ! Yo Development Framewoorrkk o of ft thhee W Week. No, I'm serious ! Youu'v'vee e eaarrnneedd it ! Stand up and tak eek. No, it ! Stand up and takee a a b booww ! ! « « ––on theserverside.c on theserverside.coommaass a a c comment to a tool annou omment to a tool announncceemmeenntt 7 ObjectivesObjectives

These topics illustrate a few areas in (XML∪WWW) ∩ Programming Languages At different stages: • XML and schemas: now well-understood, essential to other areas of XML/WWW • XML programming: many proposals, (still) no “silver bullet” • Web programming: new frameworks appear frequently, needs consolidation Lots of remaining research opportunities!

8 PrerequisitesPrerequisites

I assume that you have a basic knowledge of

• Regular languages

•Java (and ML)

• WWW and XML/HTML

9 AA tourtour ofof schemaschema languageslanguages forfor XML:XML:

DTDDTD,, XMLXML SchemaSchema,, andand RELAXRELAX NGNG OverviewOverview

Motivation DTD XML Schema RELAX NG

– focus on language constructs that are essential to expressiveness

11 Example:Example: XMLXML forfor businessbusiness cardscards

JohnJohn DoeDoe CEO,<title>CEO, WidgetWidget Inc.Inc. [email protected][email protected] the contents (202)(202) 555-1414555-1414 of the card uri="widget.gif"/> element cardcard XML document namename titletitle ...

JohnJohn DoeDoe CEO,CEO, ...... XML tree An XML-based language is a set (with some semantics) of XML documents 12 SchemasSchemas forfor XMLXML

Validation of XML documents: •A schema describes the syntax of an XML-based language •A schema language (DTD / XML Schema / RELAX NG / ...) is a formal notation for specifying schemas – like BNF for programming language syntax •A schema processor checks validity of a document, relative to a schema

13 ThisThis presentationpresentation isis basedbased on...on...

Anders Møller and Michael Schwartzbach, An Introduction to XML and Web Technologies, Addison-Wesley, 2006

Makoto Murata, Dongwon Lee, Murali Mani, and Kohsuke Kawaguchi, Taxonomy of XML schema languages using formal language theory, ACM Trans. Internet Techn. 5(4): 660- 704, 2005

14 XMLXML inin programmingprogramming languageslanguages

Many recent programming languages and APIs are designed for processing XML data

schema ~ type Type checking? Type inference??

We need formal models of schemas and schema languages!

15 OverviewOverview

Motivation DTD XML Schema RELAX NG

16 DTDDTD –– DocumentDocument TypeType DefinitionDefinition

Introduced with XML 1.0 Focuses on elements and attributes Little control of attribute values and chardata No support for namespaces Widely used (still!)

17 ElementElement declarationsdeclarations

> where content-model is generally a restricted regular expression over element names and #PCDATA Observations: the content model of a particular element only depends on its name! content models must be deterministic limited control over chardata

18 AttributeAttribute declarationsdeclarations

> Each attribute definition consists of • an attribute name • an attribute type (CDATA, enum, ID, IDREF, ...) •a default declaration (#REQUIRED, #IMPLIED, ...) Observations: the requirements for a particular attribute only depends on its name and the name of the element! limited control over attribute values

19 ExampleExample

,title,email,phone?,logo?)>

(#PCDATA)> (#PCDATA)> (#PCDATA)> (#PCDATA)>

EMPTY> #REQUIRED>

20 GeneralGeneral observationsobservations

DTD can only express

“local” properties cardcard on XML trees! namename titletitle ...

An XML document JohnJohn DoeDoe CEO,CEO, ...... can be validated by checking the element nodes in any order

21 OverviewOverview

Motivation DTD XML Schema RELAX NG

22 XMLXML SchemaSchema

W3C’s design goal: make “a better DTD”...

Successes: • Namespace support • Modularization • Data types • Type derivation mechanism Critique: • lots... it’s an easy prey ☺

23 TypesTypes andand declarationsdeclarations

Simple type definition: defines a family of Unicode text strings (i.e. no markup)

Complex type definition: defines a content and attribute model

Element declaration: associates an element name with a simple or complex type

Attribute declaration: associates an attribute name with a simple type

24 ExampleExample (1/2)(1/2)

://businesscard.org">

type="b:card_type"/> type="string"/> type="string"/> type="string"/> type="string"/> type="b:logo_type"/>

type="anyURI"/>

25 ExampleExample (2/2)(2/2)

name="card_type"> ref="b:name"/> ref="b:title"/> ref="b:email"/> minOccurs="0"/> minOccurs="0"/>

name="logo_type"> use="required"/>

26 ComplicatedComplicated structuralstructural featuresfeatures

Type derivation by restriction or extension Substitution groups Overloading with local declarations ...

– many of these features are not essential to the formal expressiveness

27 OverloadingOverloading withwith locallocal declarationsdeclarations

name=”cardlist”> ”/> .../>

name=”card”> ”/> ...... 28 TheThe ElementElement DeclarationsDeclarations ConsistentConsistent rulerule

If a content model contains two or more element declarations with the they must have the

name=”E”>same element name same type definition ”/> ”/> then

Simplifies implementation, central for formal modeling of schemas... 29 UniqueUnique interpretationsinterpretations

For some applications, an important side-effect of validation is that each (i.e. antree interpretation) node is assigned a Aka. the “post-schema-validation infoset” schema type

The EDC rule in XML Schema ensures unique interpretations!

30 WhatWhat’’ss beingbeing usedused inin practice?practice?

A study of 819 schemas shows: simple types and restrictions are heavily used! complex type extensions used in only 20% substitution groups used in only 6% only 27% pass the Schema Quality Check • of those, the element structure could in 85% be expressed using DTD! (i.e. no overloading) • in most of the remaining 15%, types only depend on the parent context! – Any good explanations?

[Martens et al., TODS] 31 GeneralGeneral observationsobservations

The type associated with an element may (only) card depend on the node card itself and its ancestors namename titletitle ...

JohnJohn DoeDoe CEO,CEO, ......

The crucial mechanisms are • overloading with local declarations • the EDC rule

32 OverviewOverview

Motivation DTD XML Schema RELAX NG

33 RELAXRELAX NGNG

OASIS + ISO competitor to XML Schema

Designed for simplicity and expressiveness, solid mathematical foundation

34 ProcessingProcessing modelmodel

For a valid instance document, the root element must match a designated pattern

A pattern may match elements, attributes, or character data

Element patterns can contain sub- patterns that describe contents and attributes

35 PatternsPatterns

... ...

... (concatenation) ... (union) ... (Kleene star)

... ... ...

36 ExampleExample

name="card"> name="name"> name="title"> name="email"> name="phone"> name="logo"> name="uri"> 37 GrammarsGrammars

Pattern definitions and references allow description of recursive structures

...> ......

name="..."> ...... ......

38 GrammarGrammar simplificationsimplification

The spec explains a simplification process

§4.19: • every element is the child of a define • the child of every define is an element (i.e. we may assume a 1-1 correspondence between named pattern definitions and element patterns) ...

(we need this later...)

39 NoNo surprisingsurprising restrictions!restrictions!

"> ”> name=”P”/> name=”A”> name=”B”> ”> 40 GeneralGeneral observationsobservations

RELAX NG schemas can express many card “non-local” card properties namename titletitle ... on XML trees! JohnJohn DoeDoe CEO,CEO, ......

Validation is more difficult but can be performed in linear time (in the size of the XML tree)! Unique-interpretations property fails!

We shall later examine the formal model underlying the design of RELAX NG... 41 TranslatingTranslating schemasschemas

Trang

DTD XML Schema

Trang Trang Trang Trang dk.brics.schematools

RELAX NG

Trang: http://www.thaiopensource.com/relaxng/trang.html dk.brics.schematools: http://www.brics.dk/schematools/ 42 FormalFormal modelsmodels ofof XMLXML schemas:schemas: regularregular treetree grammarsgrammars andand subclassessubclasses ∀ Σ δ ε ⇒ SchemasSchemas asas formalformal languageslanguages

Let S be a schema (written in DTD, XML Schema, or RELAX NG)

Define L(S ) as the set of XML documents that are valid relative to S:

L(S )= {X | X is valid relative to S }

44 DecisionDecision problemsproblems

[membership] The problem “given a schema S and an XML document X, is X∈ L(S ) ?” is (fortunately) decidable

How about the following?

[equality] “given two schemas S1 and S2, is L(S1) = L(S2) ?”

[subset] “given two schemas S1 and S2, ⊆ is L(S1) L(S2) ?” [emptiness] “given a schema S, is L(S )=Ø ?” 45 ClosureClosure propertiesproperties

Is DTD (or XML Schema or RELAX NG) closed under union? ... intersection? ... complement?

AllAll thesethese decisiondecision problemsproblems andand closureclosure propertiesproperties areare relevantrelevant whenwhen consideringconsidering programming languages! programmingschemasschemas languages! asas typestypes

46 inin OverviewOverview all sets of XML trees

regular tree grammars ~ RELAX NG single-type tree grammars ~ XML Schema

local tree grammars ~ DTD

[Murata et al., TOIT 2005] 47 RegularRegular languageslanguages

You know about regular languages on strings... (finite automata, regular grammars, etc.)

We shall now work with regular languages on trees

48 WhichWhich kindskinds ofof trees?trees?

Labeled, ordered, unranked trees • each node has a label (i.e. the element name; we ignore attributes and character data...) • each node has an ordered sequence of child nodes • the child sequences

may have cardcard different lengths namename titletitle ...

JohnJohn DoeDoe CEO,CEO, ......

49 ContextContext--freefree grammarsgrammars (over(over strings)strings)

A context-free grammar is a 4-tuple G = (N, T, S, P ) where • N is a finite set of nonterminals • T is a finite set of terminals • S is a set of start symbols, where S ⊆ N • P is a finite set of production rules of the form X → α where X ∈ N and α ∈ (T∪N )*

The language of G, denoted L(G ), is the set of strings over T that can be derived from a start symbol (with the usual definition of “derived”...) 50 RegularRegular treetree grammarsgrammars

A regular tree grammar is a 4-tuple G = (N, T, S, P ) where NN:: types types • N is a finite set of nonterminalsTT:: element element names names • T is a finite set of terminals • S is a set of start symbols, where S ⊆ N • P is a finite set of production rules of the form X → a[r ]where X∈N, a∈T, and r is a regular expression over N

The language of G, denoted L(G ), is the set of trees over T that can be derived from a start symbol (with the obvious definition of “derived”...) 51 ExampleExample

LetLet GG == ((NN,, TT,, SS,, PP )) bebe defineddefined byby ••NN == {{DocDoc,, Para1Para1,, Para2Para2,, PcdataPcdata}} N : types ••TT == {{docdoc,, parapara,, pcdatapcdata}} N : types TT:: element element names names ••SS == {{DocDoc}} ••PP == {{ DocDoc →→ docdoc[[Para1Para1,,Para2Para2*],*], Para1Para1 →→ parapara[[εε],], Para2 →→ para[Pcdata], Para2 para[Pcdata], doc PcdataPcdata →→ pcdatapcdata[[εε]] }}

A tree that can be derived from G : para para (The corresponding XML document: text chunk ) pcdata What is L(G )? 52 RegularRegular treetree grammarsgrammars andand finitefinite--statestate automataautomata Everything you know about regular languages and finite automata on strings generalizes elegantly to trees! • grammars vs. automata • automata operations • closure properties • decision procedures • ... – and (hopefully) also your intuition about expressibility and limitations!

53 RegularRegular treetree grammarsgrammars vs.vs. RELAXRELAX NGNG

From (simplified) RELAX NG schema to regular tree grammar*:

• named pattern ⇒ nonterminal

• element name ⇒ terminal

* Ignoring name classes, interleave, attributes, and chardata... 54 ExampleExample

...> name=”Start”/> name=”Start”> name=”addressBook”> name=”Card”/> ...name=”Card”>... ...... StartStart →→ addressBookaddressBook[[CardCard*]*] CardCard →→ ...... 55 LocalLocal treetree grammarsgrammars

A local tree grammar is a regular tree grammar G = (N, T, S, P ) where P contains no two productions on the form X → a[r ] Y → a[r’ ] where X≠Y, a∈T, and r and r’ are regular expressions over N

–in other words, the terminal (=element name) uniquely determines the regular expression (=content model)

56 ExampleExample

LetLet GG == ((NN,, TT,, SS,, PP)) bebe defineddefined byby ••NN == {{DocDoc,, Para1Para1,, Para2Para2,, PcdataPcdata}} ••TT == {{docdoc,, parapara,, pcdatapcdata}} ••SS == {{DocDoc}} ••PP == {{ DocDoc →→ docdoc[[Para1Para1,,Para2Para2*],*], Para1Para1 →→ parapara[[εε],], Para2Para2 →→ parapara[[PcdataPcdata],], PcdataPcdata →→ pcdatapcdata[[εε]] }}

–is G a local tree grammar?

57 LocalLocal treetree grammarsgrammars vs.vs. DTDDTD

In DTD, the element name uniquely determines the content model !

From DTD schema to local tree grammar*:

(bar|baz+)> > >

FooFoo →→ foofoo[[BarBar|Baz+]|Baz+] BarBar →→ barbar[...][...] BazBaz →→ bazbaz[...][...] * Ignoring attributes and chardata... 58 RegularRegular treetree grammarsgrammars vs.vs. XMLXML SchemaSchema

[Murata et al., TOIT 2005] explains how XML Schema can be modeled with regular tree grammars

• complex type ⇒ nonterminal • element declaration ⇒ terminal

We omit the (many) details.....

59 SingleSingle--typetype treetree grammarsgrammars

Let G = (N, T, S, P ) be a regular tree grammar ∈ Two different nonterminals X,Y N compete if P contains two productions on the form X → a[...] Y → a[...] for some a∈T G is a single-type tree grammar if • for each production rule in P, no two nonterminals on the right-hand-side compete, and • start symbols in S do not compete G local ⇒ G single-type

60 ExampleExample

LetLet GG == ((NN,, TT,, SS,, PP)) be be defineddefined byby ••NN == {{DocDoc,, Para1Para1,, Para2Para2,, PcdataPcdata}} ••TT == {{docdoc,, parapara,, pcdatapcdata}} ••SS == {{DocDoc}} ••PP == {{ DocDoc →→ docdoc[[Para1Para1,,Para2Para2*],*], Para1Para1 →→ parapara[[εε],], Para2Para2 →→ parapara[[PcdataPcdata],], PcdataPcdata →→ pcdatapcdata[[εε]] }}

–is G a single-type tree grammar?

61 SingleSingle--typetype treetree grammarsgrammars vs.vs. XMLXML SchemaSchema

Why the restriction of nonterminal competition?

• it implies that types can be uniquely determined from element names when processing a child sequence!

• it exactly corresponds to the infamous “Element Declarations Consistent” restriction in XML Schema!

62 ExampleExample ofof violationviolation

name=”E”> violates ”/> ”/> EDC!

→→ ...... EE[[FF11||FF22]] →→ F1 and F2 FF11 FF[[XX]] →→ compete! FF22 FF[[YY]]

63 AA designdesign bug...bug...

XML Schema can essentially be modeled as single-type tree grammars, but...

the any constructs in XML Schema makes it possible to express certain non -single-type tree grammars!

See example in [Murata et al., TOIT 2005]

64 ClosureClosure propertiesproperties

∩ ∪ \

regular (RELAX NG)

single-type (XML Schema) ÷ ÷

local (DTD) ÷ ÷

[Murata et al., TOIT 2005] 65 SummarySummary

all sets of XML trees regular tree grammars ~ RELAX NG single-type tree grammars ~ XML Schema

local tree grammars ~ DTD

– foundation for treating XML schemas as types in XML programming languages

66