The automaton approach to XML schema languages: from practice to theory

Frank Neven1

1Theoretical Computer Science Group Hasselt University Agoralaan, 3590 Diepenbeek, Belgium

27 February 2006

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 1 / 109 Introduction to XML Outline

1 Introduction to XML

2 Document Type Definitions

3 Unranked Tree Automata

4 Extended Document Type Definitions Definition XML Schema Properties of single-type EDTDs Single-type EDTDs in practice 1-pass preorder typing Relax NG

5 Decision problems for XML schema languages

6 Conclusion

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 2 / 109 Introduction to XML XML is a data exchange format W3C standard

geographical db XML user

XML

INTERNET

OODB Rel DB car retailer car reviews

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 3 / 109 Introduction to XML A self-describing data format

Example Fabuleux destin d’Amelie 17 Goodbye Lenin 20 4

start tag: element: <title>... end tag:

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 4 / 109 Introduction to XML XML as a hierarchical structure

Example store

dvd dvd

title price title price discount

“Amélie" 17 “Good bye, Lenin!" 20 4

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 5 / 109 Introduction to XML Attributes

Example Fabuleux ... d’Amelie 17 Goodbye Lenin 20 4

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 6 / 109 Introduction to XML XML as a hierarchical structure

Example store[name=“DVDPlanet”]

dvd[category=“romance”] dvd[category=“drama”]

title price title price discount

“Amélie" 17 “Good bye, Lenin!" 20 4

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 7 / 109 Introduction to XML Trees as conceptual abstraction of XML documents

XML documents are ordered unranked trees over a finite alphabet Σ of tag names. We assume an infinite set of data values D for attribute and leaf values.

store[name=“DVDPlanet”]

dvd[category=“romance”] dvd[category=“drama”]

title price title price discount

“Amélie" 17 “Good bye, Lenin!" 20 4

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 8 / 109 Introduction to XML Flexibility of XML Representation of the relational model

Relation XML encoding R A B a1 b1 a2 b2 a1 b1 XML Tree R a2 tuple tuple b2 A B A B a1 b1 a2 b2

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 10 / 109 Introduction to XML XML schema languages

Schema A schema defines the set of allowable tags and the way they can be structured.

Advantages automatic validation automatic integration of data automatic translation query optimization provides a user with a concrete semantics of the document aids in the specification of meaningful queries over XML data

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 11 / 109 Introduction to XML XML schema languages

Example DTDs (W3C) XML Schema (W3C) Relax NG (Clark, Murata) several dozen others (DSD, Schematron, ...)

In formal language theoretic terms A schema defines a tree language.

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 12 / 109 Introduction to XML Overview of XML Theory

Cross fertilization XML

Automata Logic

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 13 / 109 Introduction to XML Overview of XML Theory

Cross fertilization XML

Automata Logic

Different sorts of automata: grammars, tree automata, tree-walking automata, register automata, transducers, . . .

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 13 / 109 Introduction to XML Overview of XML Theory

Cross fertilization XML

Automata Logic

Different sorts of automata: grammars, tree automata, tree-walking automata, register automata, transducers, . . . Automata serve as an algorithmic toolbox an abstract formal model of schema languages, query and pattern languages

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 13 / 109 Introduction to XML Summary slide

What to remember? XML is an international standard XML documents or XML data are simply ordered unranked labeled trees with data values a schema defines a tree language (no data values)

Focus of this talk Automata as a formal model for schema languages

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 14 / 109 Introduction to XML Outline

1 Introduction to XML

2 Document Type Definitions

3 Unranked Tree Automata

4 Extended Document Type Definitions Definition XML Schema Properties of single-type EDTDs Single-type EDTDs in practice 1-pass preorder typing Relax NG

5 Decision problems for XML schema languages

6 Conclusion

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 15 / 109 Document Type Definitions Outline

1 Introduction to XML

2 Document Type Definitions

3 Unranked Tree Automata

4 Extended Document Type Definitions Definition XML Schema Properties of single-type EDTDs Single-type EDTDs in practice 1-pass preorder typing Relax NG

5 Decision problems for XML schema languages

6 Conclusion

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 16 / 109 Document Type Definitions Document Type Definitions (DTDs)

Example ]>

Corresponding grammar (start symbol store) store → dvd dvd∗ dvd → title price(discount + ε) title → DATA price → DATA Frank Neven (Hasselt University)discountAutomata→ andDATA XML schema languages 27 February 2006 18 / 109 Document Type Definitions Document Type Definitions (DTDs)

XML Document store

dvd dvd

title price title price

"Amélie" 17 "Good bye, Lenin!" 20

Corresponding grammar (start symbol store) store → dvd dvd∗ dvd → title price(discount + ε) title → DATA price → DATA discount → DATA Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 19 / 109 Document Type Definitions Document Type Definitions (DTDs) No data values

XML Document store

dvd dvd

title price title price

Corresponding grammar (start symbol store)) store → dvd dvd∗ dvd → title price(discount + ε)

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 20 / 109 Document Type Definitions Extended Context-free grammars as a formal abstraction

Definition

A DTD is a pair (d, sd ) where

sd ∈ Σ is the start symbol d maps every Σ-symbol to a regular expression over Σ

Definition A tree t satisfies d (is valid) iff

the root of t is labeled sd for every vertex v labeled a the string formed by the children of v belongs to d(a).

DTD validator

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 21 / 109 Document Type Definitions Optimization questions

Schema containment (⊆)

Given: schema’s d1, d2 Question: Is L(d1) ⊆ L(d2)?

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 22 / 109 Document Type Definitions Optimization questions

Schema containment (⊆)

Given: schema’s d1, d2 Question: Is L(d1) ⊆ L(d2)?

DTD containment reduces to containment of regular expressions

d1 ⊆ d2 iff d1(a) ⊆ d2(a), ∀a ∈ Σ

(when d1 and d2 are trimmed).

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 22 / 109 Document Type Definitions Optimization questions

Schema containment (⊆)

Given: schema’s d1, d2 Question: Is L(d1) ⊆ L(d2)?

DTD containment reduces to containment of regular expressions

d1 ⊆ d2 iff d1(a) ⊆ d2(a), ∀a ∈ Σ

(when d1 and d2 are trimmed).

Theorem (Meyer, Stockmeyer, 1973) Containment of regular expressions is PSPACE-complete.

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 22 / 109 Document Type Definitions Optimization questions

Schema containment (⊆)

Given: schema’s d1, d2 Question: Is L(d1) ⊆ L(d2)?

DTD containment reduces to containment of regular expressions

d1 ⊆ d2 iff d1(a) ⊆ d2(a), ∀a ∈ Σ

(when d1 and d2 are trimmed).

Theorem (Meyer, Stockmeyer, 1973) Containment of regular expressions is PSPACE-complete.

Corollary DTD containment is PSPACE-complete.

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 22 / 109 Document Type Definitions Regular Expressions in DTDs Should Be Deterministic How accurate is our abstraction?

Backward compatibility with SGML The XML specifications requires regular expressions to be deterministic: for every input symbol in the input string we can uniquely determine by which symbol in the regular expression it should match without looking ahead in the input string.

Example The expression (a + b)∗a is not deterministic. Counterexample: baa. The expression b∗a(b∗a)∗ is deterministic.

Why this restriction?

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 23 / 109 Document Type Definitions Regular Expressions in DTDs Should Be Deterministic

Relevant questions

1 How do we recognize deterministic regular expressions? DTD validator 2 Can every regular language be denoted by a deterministic regular expression? 3 Are deterministic regular languages a robust class? 4 If a regular expression is not deterministic, can you find an equivalent one that is? smart DTD validator

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 24 / 109 Document Type Definitions Formalization by Brüggemann-Klein and Wood [1998]

Definition A marking r 0 of a regular expression r is an assignment of numbers to every symbol in r.

Example ∗ ∗ (a1 + b2) a3 is a marking of (a + b) a

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 25 / 109 Document Type Definitions Formalization by Brüggemann-Klein and Wood [1998]

Definition A marking r 0 of a regular expression r is an assignment of numbers to every symbol in r. For w ∈ L(r 0), we denote by w # the corresponding unmarked string in L(r).

Example ∗ ∗ (a1 + b2) a3 is a marking of (a + b) a # For w = b2a1a3, w = baa

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 25 / 109 Document Type Definitions Formalization by Brüggemann-Klein and Wood [1998]

Definition A regular expression r is deterministic (one-unambiguous) iff there are no strings uxv, uyw ∈ L(r 0) with |x| = |y| = 1, x 6= y, (x and y are different marked symbols) x# = y # (their unmarking is the same).

Example (a + b)∗a is not deterministic: u x v u y w take and b2 a1 a3 b2 a3 ε

Tool Glushkov construction preserves one-step unambiguity.

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 26 / 109 Document Type Definitions Glushkov automaton for b∗a(b∗a)∗

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 27 / 109 Document Type Definitions Glushkov automaton for b∗a(b∗a)∗

∗ ∗ ∗ b1a2(b3a4)

a4

b1 a2

b3

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 27 / 109 Document Type Definitions Glushkov automaton for b∗a(b∗a)∗

∗ ∗ ∗ b1a2(b3a4)

a4

b1 a2

b3

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 27 / 109 Document Type Definitions Glushkov automaton for b∗a(b∗a)∗

∗ ∗ ∗ b1a2(b3a4)

a4

b1 a2

b3

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 28 / 109 Document Type Definitions Glushkov automaton for b∗a(b∗a)∗

∗ ∗ ∗ b1a2(b3a4)

a4

b1 a2

b3

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 28 / 109 Document Type Definitions Glushkov automaton for b∗a(b∗a)∗

∗ ∗ ∗ b1a2(b3a4)

a4

b1 a2

b3

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 28 / 109 Document Type Definitions Glushkov automaton for b∗a(b∗a)∗

∗ ∗ ∗ b1a2(b3a4)

a4

b1 a2

b3

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 28 / 109 Document Type Definitions Glushkov automaton for b∗a(b∗a)∗

∗ ∗ ∗ b1a2(b3a4)

a4

b1 a2

b3

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 28 / 109 Document Type Definitions Glushkov automaton for b∗a(b∗a)∗

∗ ∗ ∗ b1a2(b3a4)

a

b a

b

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 28 / 109 Document Type Definitions Glushkov automaton construction

∗ ∗ ∗ b1a2(b3a4) ∗ (a1 + b2) a3

a4 a1 b1 a2 a3 b3 b2

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 29 / 109 Document Type Definitions Recognition of deterministic regular expressions

Theorem (Book et al 1971, Brüggemann-Klein, Wood, 1998) A regular expression is deterministic (one-unambiguous) iff its Glushkov automaton is deterministic. It is decidable in quadratic time whether a regular expression is deterministic.

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 30 / 109 Document Type Definitions Properties of deterministic regular languages

Theorem (Brüggemann-Klein, Wood, 1998) Not every regular language can be denoted by a deterministic regular expression. E.g., (a + b)∗a(a + b).

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 31 / 109 Document Type Definitions Properties of deterministic regular languages

Theorem (Brüggemann-Klein, Wood, 1998) Not every regular language can be denoted by a deterministic regular expression. E.g., (a + b)∗a(a + b). Deterministic regular languages are not closed under union, concatenation, or Kleene-star. No syntax for deterministic regular languages

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 31 / 109 Document Type Definitions Properties of deterministic regular languages

Theorem (Brüggemann-Klein, Wood, 1998) Not every regular language can be denoted by a deterministic regular expression. E.g., (a + b)∗a(a + b). Deterministic regular languages are not closed under union, concatenation, or Kleene-star. No syntax for deterministic regular languages It can be decided in PTIME whether a DFA denotes a deterministic regular language. Orbit property.

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 31 / 109 Document Type Definitions Properties of deterministic regular languages

Theorem (Brüggemann-Klein, Wood, 1998) Not every regular language can be denoted by a deterministic regular expression. E.g., (a + b)∗a(a + b). Deterministic regular languages are not closed under union, concatenation, or Kleene-star. No syntax for deterministic regular languages It can be decided in PTIME whether a DFA denotes a deterministic regular language. Orbit property. If it exists, an equivalent deterministic regular expression can be constructed in exponential time.

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 31 / 109 Document Type Definitions Properties of deterministic regular languages

Theorem (Brüggemann-Klein, Wood, 1998) Not every regular language can be denoted by a deterministic regular expression. E.g., (a + b)∗a(a + b). Deterministic regular languages are not closed under union, concatenation, or Kleene-star. No syntax for deterministic regular languages It can be decided in PTIME whether a DFA denotes a deterministic regular language. Orbit property. If it exists, an equivalent deterministic regular expression can be constructed in exponential time.

Results provide formal machinery for dealing with DTDs.

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 31 / 109 Document Type Definitions Complexity of basic decision problems revisit

Schema containment (⊆)

Given: Schema’s d1, d2 Question: Is L(d1) ⊆ L(d2)?

DTD containment reduces to containment of regular expressions

d1 ⊆ d2 iff d1(a) ⊆ d2(a), ∀a ∈ Σ

(when d1 and d2 are trimmed).

Theorem Containment of DTDs with deterministic regular expressions is in PTIME.

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 32 / 109 Document Type Definitions Summary slide

What to remember? XML DTDs are context-free grammars with deterministic regular expressions

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 33 / 109 Document Type Definitions Summary slide

What to remember? XML DTDs are context-free grammars with deterministic regular expressions deterministic regular expressions are a semantical notion: no easy syntax – non-transparent to users

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 33 / 109 Document Type Definitions Summary slide

What to remember? XML DTDs are context-free grammars with deterministic regular expressions deterministic regular expressions are a semantical notion: no easy syntax – non-transparent to users advantage: optimization problems are tractable

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 33 / 109 Document Type Definitions Summary slide

What to remember? XML DTDs are context-free grammars with deterministic regular expressions deterministic regular expressions are a semantical notion: no easy syntax – non-transparent to users advantage: optimization problems are tractable

Question What is the largest robust class of regular expressions that can be translated to DFAs in PTIME?

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 33 / 109 Unranked Tree Automata Outline

1 Introduction to XML

2 Document Type Definitions

3 Unranked Tree Automata

4 Extended Document Type Definitions Definition XML Schema Properties of single-type EDTDs Single-type EDTDs in practice 1-pass preorder typing Relax NG

5 Decision problems for XML schema languages

6 Conclusion

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 34 / 109 Unranked Tree Automata Deterministic Tree Automata over Binary Trees Definition Formally,

M = (Q, Σ, δ, F)

∧ with Q = {f , t}, Σ= {0, 1, ∧, ∨}, ∨ ∧ F = {t}, and δ(0)= f δ(1)= t 0 1 1 1 δ(f , f , ∧)= f δ(f , f , ∨)= f δ(t, f , ∧)= f δ(t, f , ∨)= t δ(f , t, ∧)= f δ(f , t, ∨)= t δ(t, t, ∧)= t δ(t, t, ∨)= t

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 35 / 109 Unranked Tree Automata Deterministic Tree Automata over Binary Trees Definition Formally,

M = (Q, Σ, δ, F)

∧ with Q = {f , t}, Σ= {0, 1, ∧, ∨}, ∨ ∧ F = {t}, and δ(0)= f δ(1)= t 0 1 1 1 δ(f , f , ∧)= f δ(f , f , ∨)= f f t t t δ(t, f , ∧)= f δ(t, f , ∨)= t δ(f , t, ∧)= f δ(f , t, ∨)= t δ(t, t, ∧)= t δ(t, t, ∨)= t

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 35 / 109 Unranked Tree Automata Deterministic Tree Automata over Binary Trees Definition Formally,

M = (Q, Σ, δ, F)

∧ with Q = {f , t}, Σ= {0, 1, ∧, ∨},

t∨ t ∧ F = {t}, and δ(0)= f δ(1)= t 0 1 1 1 δ(f , f , ∧)= f δ(f , f , ∨)= f f t t t δ(t, f , ∧)= f δ(t, f , ∨)= t δ(f , t, ∧)= f δ(f , t, ∨)= t δ(t, t, ∧)= t δ(t, t, ∨)= t

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 35 / 109 Unranked Tree Automata Deterministic Tree Automata over Binary Trees Definition Formally,

M = (Q, Σ, δ, F)

t ∧ with Q = {f , t}, Σ= {0, 1, ∧, ∨},

t∨ t ∧ F = {t}, and δ(0)= f δ(1)= t 0 1 1 1 δ(f , f , ∧)= f δ(f , f , ∨)= f f t t t δ(t, f , ∧)= f δ(t, f , ∨)= t δ(f , t, ∧)= f δ(f , t, ∨)= t δ(t, t, ∧)= t δ(t, t, ∨)= t

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 35 / 109 Unranked Tree Automata Tree Automata over Binary Trees

Definition A set of binary trees is regular iff it is accepted by a tree automaton.

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 36 / 109 Unranked Tree Automata Tree Automata over Binary Trees

Definition A set of binary trees is regular iff it is accepted by a tree automaton.

Deterministic versus non-deterministic Det: δ : Q × Q × Σ → Q Non-Det: δ : Q × Q × Σ → 2Q Semantics: tree is accepted if there is a labeling of states consistent with the transition function, and root is labeled with accepting state top-down: δ : Q × Σ → 2Q×Q

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 36 / 109 Unranked Tree Automata Tree Automata over Binary Trees

Robust class det. bottom-up TA = non-det. bottom-up TA (subset construction)

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 37 / 109 Unranked Tree Automata Tree Automata over Binary Trees

Robust class det. bottom-up TA = non-det. bottom-up TA (subset construction) non-det. top-down TA = non-det bottom up TA

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 37 / 109 Unranked Tree Automata Tree Automata over Binary Trees

Robust class det. bottom-up TA = non-det. bottom-up TA (subset construction) non-det. top-down TA = non-det bottom up TA Closed under Boolean operations: Union, intersection: product construction Complement: complete automaton, determinize, swap final and non-final states

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 37 / 109 Unranked Tree Automata Tree Automata over Binary Trees

Robust class det. bottom-up TA = non-det. bottom-up TA (subset construction) non-det. top-down TA = non-det bottom up TA Closed under Boolean operations: Union, intersection: product construction Complement: complete automaton, determinize, swap final and non-final states Many equivalent notions: alternating, two-way, tree-walking + restricted pushdown, MSO, . . .

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 37 / 109 Unranked Tree Automata Tree Automata over Binary Trees

Robust class det. bottom-up TA = non-det. bottom-up TA (subset construction) non-det. top-down TA = non-det bottom up TA Closed under Boolean operations: Union, intersection: product construction Complement: complete automaton, determinize, swap final and non-final states Many equivalent notions: alternating, two-way, tree-walking + restricted pushdown, MSO, . . . Decision problems: containment is EXPTIME-complete for non-det TA [Seidl 1990], PTIME-complete for det TA.

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 37 / 109 Unranked Tree Automata Tree Automata over Binary Trees

Robust class det. bottom-up TA = non-det. bottom-up TA (subset construction) non-det. top-down TA = non-det bottom up TA Closed under Boolean operations: Union, intersection: product construction Complement: complete automaton, determinize, swap final and non-final states Many equivalent notions: alternating, two-way, tree-walking + restricted pushdown, MSO, . . . Decision problems: containment is EXPTIME-complete for non-det TA [Seidl 1990], PTIME-complete for det TA. PTIME minimization for det TA, unique minimal TA

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 37 / 109 Unranked Tree Automata Bottom-up Tree Automata over Unranked Trees

Binary versus unranked binary tree: δ : Q × Q × Σ → Q

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 38 / 109 Unranked Tree Automata Bottom-up Tree Automata over Unranked Trees

Binary versus unranked binary tree: δ : Q × Q × Σ → Q ∞ i unranked tree: δ : Si=0 Q × Σ → Q

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 38 / 109 Unranked Tree Automata Bottom-up Tree Automata over Unranked Trees

Binary versus unranked binary tree: δ : Q × Q × Σ → Q ∞ i unranked tree: δ : Si=0 Q × Σ → Q specify transition functions by regular string languages over states: δ(q, a) ⊆ Q∗ is a regular language

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 38 / 109 Unranked Tree Automata Bottom-up Tree Automata over Unranked Trees

Binary versus unranked binary tree: δ : Q × Q × Σ → Q ∞ i unranked tree: δ : Si=0 Q × Σ → Q specify transition functions by regular string languages over states: δ(q, a) ⊆ Q∗ is a regular language

q a

q1 q2 q3 ∈ δ(q, a)

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 38 / 109 Unranked Tree Automata Bottom-up Tree Automata over Unranked Trees

∨ ∧ ∨

0 1 0 1 1 1 0 1 1

Transition function, F = {t} δ(f , 0)= {ε}; δ(f , 1)= ∅ δ(t, 1)= {ε}; δ(t, 0)= ∅ δ(f , ∧) = (f + t)∗f (f + t)∗ δ(t, ∧)= t∗ δ(f , ∨)= f ∗ δ(t, ∨) = (f + t)∗t(f + t)∗ Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 39 / 109 Unranked Tree Automata Bottom-up Tree Automata over Unranked Trees

t ∧

t∨ t∧ t ∨

0 1 0 1 1 1 0 1 1

f t f t t t f t t Transition function, F = {t} δ(f , 0)= {ε}; δ(f , 1)= ∅ δ(t, 1)= {ε}; δ(t, 0)= ∅ δ(f , ∧) = (f + t)∗f (f + t)∗ δ(t, ∧)= t∗ δ(f , ∨)= f ∗ δ(t, ∨) = (f + t)∗t(f + t)∗ Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 39 / 109 Unranked Tree Automata Bottom-up Tree Automata over Unranked Trees

t ∧

t∨ t∧ t ∨

0 1 0 1 1 1 0 1 1

f t f t t t f t t Transition function, F = {t} δ(f , 0)= {ε}; δ(f , 1)= ∅ δ(t, 1)= {ε}; δ(t, 0)= ∅ δ(f , ∧) = (f + t)∗f (f + t)∗ δ(t, ∧)= t∗ δ(f , ∨)= f ∗ δ(t, ∨) = (f + t)∗t(f + t)∗ Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 39 / 109 Unranked Tree Automata Bottom-up Tree Automata over Unranked Trees

Definition A non-deterministic tree automaton (NTA) is a tuple B = (Q, Σ, δ, F), Q is a finite set of states, F ⊆ Q is the set of final states, ∗ δ is a function Q × Σ → 2Q such that δ(q, a) is a regular string language over Q for every a ∈ Σ and q ∈ Q.

History Resurrected by Brüggemann-Klein, Murata, Wood [1995-2001] in the context of XML Originally: Pair and Quere [1968], Takahashi [1975], Thatcher [1967], . . .

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 40 / 109 Unranked Tree Automata Unranked versus Binary

Trading width for depth: first-child next-sibling encoding b

b #

b

enc # a −→ b a b

dec a b ←− a a b a # a b #

# a

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 41 / 109 Unranked Tree Automata Unranked versus Binary

Trading width for depth: first-child next-sibling encoding b

b #

b

enc # a −→ b a b

dec a b ←− a a b a # a b #

# a

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 41 / 109 Unranked Tree Automata Unranked versus Binary

Trading width for depth: first-child next-sibling encoding b

b #

b

enc # a −→ b a b

dec a b ←− a a b a # a b #

# a

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 41 / 109 Unranked Tree Automata Binary Regular ≡ Unranked Regular

Theorem [Folklore] For every unranked NTA B there is a binary TA A such that L(A)= {enc(t) | t ∈ L(B)}. For every binary TA A there is an unranked NTA B such that L(B)= {dec(t) | t ∈ L(A)}.

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 42 / 109 Unranked Tree Automata Binary Regular ≡ Unranked Regular

Theorem [Folklore] For every unranked NTA B there is a binary TA A such that L(A)= {enc(t) | t ∈ L(B)}. For every binary TA A there is an unranked NTA B such that L(B)= {dec(t) | t ∈ L(A)}.

Encoding preserving properties closure properties (e.g., Boolean closure) equivalent characterizations (e.g., MSO definability), decidability (e.g., containment)

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 42 / 109 Unranked Tree Automata Binary Regular ≡ Unranked Regular

Theorem [Folklore] For every unranked NTA B there is a binary TA A such that L(A)= {enc(t) | t ∈ L(B)}. For every binary TA A there is an unranked NTA B such that L(B)= {dec(t) | t ∈ L(A)}.

Encoding preserving properties closure properties (e.g., Boolean closure) equivalent characterizations (e.g., MSO definability), decidability (e.g., containment)

not everything carries over

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 42 / 109 Unranked Tree Automata Encoding does not preserve complexity

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 43 / 109 Unranked Tree Automata Encoding does not preserve complexity

Representation NTA(S) is the class of NTAs where the transition functions are represented by elements from S. E.g., NTA(NFA), NTA(REG), NTA(2AFA), . . .

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 43 / 109 Unranked Tree Automata Encoding does not preserve complexity

Representation NTA(S) is the class of NTAs where the transition functions are represented by elements from S. E.g., NTA(NFA), NTA(REG), NTA(2AFA), . . .

Emptiness Given: automaton A Question: Is L(A)= ∅?

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 43 / 109 Unranked Tree Automata Encoding does not preserve complexity

Representation NTA(S) is the class of NTAs where the transition functions are represented by elements from S. E.g., NTA(NFA), NTA(REG), NTA(2AFA), . . .

Emptiness Given: automaton A Question: Is L(A)= ∅?

Theorem Emptiness of NTA(2AFA) is PSPACE-complete. [Martens, Nev. 2003] Emptiness of two-way alternating tree automata is EXPTIME-complete. [Vardi 1998, Kupferman, Piterman, Vardi 2002] Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 43 / 109 Unranked Tree Automata Deterministic unranked tree automata are not so deterministic

Definition An NTA(DFA) is bottom-up deterministic iff δ(q, a) ∩ δ(q0, a)= ∅ for all q, q0 ∈ Q and a ∈ Σ.

q a

q1 q2 q3 ∈ δ(q, a)

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 44 / 109 Unranked Tree Automata Equivalence of deterministic tree automata

Equivalence Given: DTA A and B Question: Is L(A)= L(B)?

Equivalence of deterministic unranked tree automata Compute complement ¬A and ¬B: 0 Make automaton complete: add δ(q, q , a) = qtrash for every undefined triple Exchange final and non-final states. in PTIME Test whether symmetric difference is empty: (A ∩ ¬B) S(B∩¬A)= ∅ in PTIME

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 45 / 109 Unranked Tree Automata Equivalence of deterministic unranked tree automata

Completing unranked deterministic automata is problematic ∗ δ(qtrash, a)= Q − Sq∈Q δ(q, a) is exponentially bigger.

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 46 / 109 Unranked Tree Automata Equivalence of deterministic unranked tree automata

Completing unranked deterministic automata is problematic ∗ δ(qtrash, a)= Q − Sq∈Q δ(q, a) is exponentially bigger.

Solution The binary encoding of a DTA is unambiguous. Testing equivalence of unambiguous binary TAs is in PTIME. [Seidl 1990]

Unranked bottom-up DTA(DFA)s are exponentially more succinct than binary bottom-up DTAs [Martens, Niehren 2005]

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 46 / 109 Unranked Tree Automata Minimization

Theorem (Martens, Niehren 2005) Minimization of DTA(DFA) is NP-complete. There does not always exists a unique minimal DTA(DFA).

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 47 / 109 Unranked Tree Automata Minimization

Theorem (Martens, Niehren 2005) Minimization of DTA(DFA) is NP-complete. There does not always exists a unique minimal DTA(DFA).

Crux Minimizing DTA(DFA)s is related to minimizing disjoint unions of DFAs: δ(q1, a) ∪···∪ δ(qn, a).

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 47 / 109 Unranked Tree Automata Minimization

Theorem (Martens, Niehren 2005) Minimization of DTA(DFA) is NP-complete. There does not always exists a unique minimal DTA(DFA).

Crux Minimizing DTA(DFA)s is related to minimizing disjoint unions of DFAs: δ(q1, a) ∪···∪ δ(qn, a).

Other models Stepwise tree automata [Carme, Niehren, Tommasi 2004]

Instead of n automata representing δ(q1, a),...,δ(qn, a), use one automaton Na with an output function [Cristau, Löding, Thomas 2005]

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 47 / 109 Unranked Tree Automata Summary slide

What to remember?

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 48 / 109 Unranked Tree Automata Summary slide

What to remember? Tree automata are a very robust class (much like string automata).

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 48 / 109 Unranked Tree Automata Summary slide

What to remember? Tree automata are a very robust class (much like string automata). Many properties for unranked automata carry over from the ranked case through the encoding, ...but not all.

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 48 / 109 Unranked Tree Automata Summary slide

What to remember? Tree automata are a very robust class (much like string automata). Many properties for unranked automata carry over from the ranked case through the encoding, ...but not all. A DTA is not 100 % deterministic.

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 48 / 109 Unranked Tree Automata Summary slide

What to remember? Tree automata are a very robust class (much like string automata). Many properties for unranked automata carry over from the ranked case through the encoding, ...but not all. A DTA is not 100 % deterministic. XML Schema is usually abstracted by unranked tree automata

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 48 / 109 Unranked Tree Automata Summary slide

What to remember? Tree automata are a very robust class (much like string automata). Many properties for unranked automata carry over from the ranked case through the encoding, ...but not all. A DTA is not 100 % deterministic. XML Schema is usually abstracted by unranked tree automata . . . but this is not entirely accurate (as we will explain next)

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 48 / 109 Unranked Tree Automata Summary slide

What to remember? Tree automata are a very robust class (much like string automata). Many properties for unranked automata carry over from the ranked case through the encoding, ...but not all. A DTA is not 100 % deterministic. XML Schema is usually abstracted by unranked tree automata . . . but this is not entirely accurate (as we will explain next)

Questions

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 48 / 109 Unranked Tree Automata Summary slide

What to remember? Tree automata are a very robust class (much like string automata). Many properties for unranked automata carry over from the ranked case through the encoding, ...but not all. A DTA is not 100 % deterministic. XML Schema is usually abstracted by unranked tree automata . . . but this is not entirely accurate (as we will explain next)

Questions Given an DTA A. Can you compute ¬A in PTIME?

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 48 / 109 Unranked Tree Automata Summary slide

What to remember? Tree automata are a very robust class (much like string automata). Many properties for unranked automata carry over from the ranked case through the encoding, ...but not all. A DTA is not 100 % deterministic. XML Schema is usually abstracted by unranked tree automata . . . but this is not entirely accurate (as we will explain next)

Questions Given an DTA A. Can you compute ¬A in PTIME? What is the right notion of deterministic unranked TA?

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 48 / 109 Extended Document Type Definitions Outline

1 Introduction to XML

2 Document Type Definitions

3 Unranked Tree Automata

4 Extended Document Type Definitions Definition XML Schema Properties of single-type EDTDs Single-type EDTDs in practice 1-pass preorder typing Relax NG

5 Decision problems for XML schema languages

6 Conclusion

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 49 / 109 Extended Document Type Definitions Definition Outline

1 Introduction to XML

2 Document Type Definitions

3 Unranked Tree Automata

4 Extended Document Type Definitions Definition XML Schema Properties of single-type EDTDs Single-type EDTDs in practice 1-pass preorder typing Relax NG

5 Decision problems for XML schema languages

6 Conclusion

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 50 / 109 Extended Document Type Definitions Definition Extended DTDs Grammar based approach to unranked regular tree languages

Definition (Papakonstantinou, Vianu, 2000) Let ΣN := {σn | σ ∈ Σ, n ∈ N} be the alphabet of types.

An extended DTD (EDTD) is a tuple D = (Σ, d, sd ), where (d, sd ) is a (finite) DTD over Σ ∪ ΣN. A tree t is valid w.r.t. D if there is an assignment of types such that the typed tree is a derivation tree of d.

Example store → (dvd1 + dvd2)∗dvd2(dvd1 + dvd2)∗ dvd1 → title price dvd2 → title price discount

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 51 / 109 Extended Document Type Definitions Definition Extended DTDs Grammar based approach to unranked regular tree languages

tree t store

dvd dvd dvd dvd

title price title price title price discount title price discount

"Amélie" 17 "Good bye, Lenin!" 20 "Gothika" 15 4 "Pulp Fiction" 11 6

Example store → (dvd1 + dvd2)∗dvd2(dvd1 + dvd2)∗ dvd1 → title price dvd2 → title price discount

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 52 / 109 Extended Document Type Definitions Definition Extended DTDs Grammar based approach to unranked regular tree languages

Typed tree t 0 store

dvd1 dvd1 dvd2 dvd2

title price title price title price discount title price discount

"Amélie" 17 "Good bye, Lenin!" 20 "Gothika" 15 4 "Pulp Fiction" 11 6

Example store → (dvd1 + dvd2)∗dvd2(dvd1 + dvd2)∗ dvd1 → title price dvd2 → title price discount

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 52 / 109 Extended Document Type Definitions Definition EDTDs versus Tree Automata

Theorem (Papakonstantinou, Vianu, 2000) NTAs and EDTDs define precisely the class of (homogeneous) regular unranked tree languages.

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 53 / 109 Extended Document Type Definitions Definition EDTDs versus Tree Automata

Theorem (Papakonstantinou, Vianu, 2000) NTAs and EDTDs define precisely the class of (homogeneous) regular unranked tree languages.

Example

EDTD NTA 00 → ε,11 → ε δ(f , 0)= {ε}; δ(t, 1)= {ε}; ∧0 → .∗ (00+∨0+∧0) .∗ δ(f , ∧)= . ∗ f .∗ ∧1 → (11 + ∨1 + ∧1)∗ δ(t, ∧)= t∗ ∨1 → .∗ (11+∨1+∧1) .∗ δ(t, ∨)= . ∗ t.∗ ∨0 → (00 + ∨0 + ∧0)∗ δ(f , ∨)= f ∗

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 53 / 109 Extended Document Type Definitions XML Schema Outline

1 Introduction to XML

2 Document Type Definitions

3 Unranked Tree Automata

4 Extended Document Type Definitions Definition XML Schema Properties of single-type EDTDs Single-type EDTDs in practice 1-pass preorder typing Relax NG

5 Decision problems for XML schema languages

6 Conclusion

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 54 / 109 Extended Document Type Definitions XML Schema XML Schema

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 56 / 109 Extended Document Type Definitions XML Schema XML Schema Rejected Violates the Constraint.

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 56 / 109 Extended Document Type Definitions XML Schema A formalization of XML Schema: single-type EDTDs

XML Schema 1: Element Declarations Consistent constraint (Section 3.8.6) It is illegal to have two elements of the same name [. . . ] but different types in a content model [. . . ].

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 57 / 109 Extended Document Type Definitions XML Schema A formalization of XML Schema: single-type EDTDs

XML Schema 1: Element Declarations Consistent constraint (Section 3.8.6) It is illegal to have two elements of the same name [. . . ] but different types in a content model [. . . ].

Definition (Murata, Lee, Mani, 2001) A single-type EDTD is an EDTD for which in no regular expression two types bi and bj with i 6= j occur.

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 57 / 109 Extended Document Type Definitions XML Schema A formalization of XML Schema: single-type EDTDs

XML Schema 1: Element Declarations Consistent constraint (Section 3.8.6) It is illegal to have two elements of the same name [. . . ] but different types in a content model [. . . ].

Definition (Murata, Lee, Mani, 2001) A single-type EDTD is an EDTD for which in no regular expression two types bi and bj with i 6= j occur.

Not single-type store → (dvd1 + dvd2)∗dvd2(dvd1 + dvd2)∗ dvd1 → title price dvd2 → title price discount

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 57 / 109 Extended Document Type Definitions XML Schema A formalization of XML Schema: single-type EDTDs

Definition (Murata, Lee, Mani, 2001) A single-type EDTD is an EDTD in which in no regular expression two types bi and bj with i 6= j occur.

Example store → regulars discounts regulars → (dvd1)∗ discounts → dvd2 (dvd2)∗ dvd1 → title price dvd2 → title price discount

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 58 / 109 Extended Document Type Definitions XML Schema A formalization of XML Schema: single-type EDTDs

Formal abstraction XML Schema ≈ single-type EDTDs

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 59 / 109 Extended Document Type Definitions XML Schema A formalization of XML Schema: single-type EDTDs

Formal abstraction XML Schema ≈ single-type EDTDs

Immediate Questions Can you recognize single-type EDTDs? Trivial XML Schema validator What kind of languages can be defined by single-type EDTDs? Is it decidable whether an EDTD is equivalent to a single-type EDTD? smart XML Schema validator

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 59 / 109 Extended Document Type Definitions Properties of single-type EDTDs Outline

1 Introduction to XML

2 Document Type Definitions

3 Unranked Tree Automata

4 Extended Document Type Definitions Definition XML Schema Properties of single-type EDTDs Single-type EDTDs in practice 1-pass preorder typing Relax NG

5 Decision problems for XML schema languages

6 Conclusion

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 60 / 109 Extended Document Type Definitions Properties of single-type EDTDs Validation and typing

Validation and typing: Given a tree t and an EDTD D = (Σ, d, a0) validation: does t ∈ L(D), i.e., does there exist a typed tree t0 ∈ L(d)? typing: compute for every b-labeled node its type bi in t0

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 61 / 109 Extended Document Type Definitions Properties of single-type EDTDs Single-type EDTDs: simple top-down typing

Algorithm to validate and type a tree [Murata et al., 2001] Given: tree t and single-type EDTD D = (Σ, d, a0) 1 Check if root of t is labeled with a, assign type a0 2 for every interior node u with type bi , test whether the children of u match µ(d(bj )). If so, assign unique type to every child. Else fail.

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 62 / 109 Extended Document Type Definitions Properties of single-type EDTDs Single-type EDTDs: simple top-down typing

Algorithm to validate and type a tree [Murata et al., 2001] Given: tree t and single-type EDTD D = (Σ, d, a0) 1 Check if root of t is labeled with a, assign type a0 2 for every interior node u with type bi , test whether the children of u match µ(d(bj )). If so, assign unique type to every child. Else fail.

µ(a1 + b1c2)= a + bc

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 62 / 109 Extended Document Type Definitions Properties of single-type EDTDs Single-type EDTDs: simple top-down typing

Algorithm to validate and type a tree [Murata et al., 2001] Given: tree t and single-type EDTD D = (Σ, d, a0) 1 Check if root of t is labeled with a, assign type a0 2 for every interior node u with type bi , test whether the children of u match µ(d(bj )). If so, assign unique type to every child. Else fail.

µ(a1 + b1c2)= a + bc Corollary Single-typedness implies unique top-down typing.

Motivation

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 62 / 109 Extended Document Type Definitions Properties of single-type EDTDs Two-pass and ambiguous typing

Example a → b1 + b2, b1 → c, b2 → d

Tree a

b1 or 2?

c

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 63 / 109 Extended Document Type Definitions Properties of single-type EDTDs Two-pass and ambiguous typing

Example a → b1 + b2, b1 → c, b2 → d

Example a → b1 + b2, b1 → c∗, b2 → d ∗

Tree a

b1 or 2?

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 63 / 109 Extended Document Type Definitions Properties of single-type EDTDs Towards a characterization of single-type EDTDs

The Ancestor-String

a

Notation anc-strt (u) = the ancestor-string of a tree t at node u

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 64 / 109 Extended Document Type Definitions Properties of single-type EDTDs Single-type EDTDs: simple top-down typing

Definition

An EDTD D = (Σ, d, sd ) has ancestor-based types if there is a function f : Σ∗ → ΣN such that, for each tree t ∈ L(D), t has exactly one witness t0 ∈ L(d), and t0 results from t by assigning to each node v the type f (anc-strt (v)).

Intuition: The type of a node depends on its ancestor-string, and on nothing else

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 65 / 109 Extended Document Type Definitions Properties of single-type EDTDs Single-type EDTDs: simple top-down typing

Proposition When a tree language T is definable by a single-type EDTD, then it has ancestor based types.

Proof Let T be defined by the single-type EDTD D = (Σ, d, a0). Then define f inductively as follows: f (a)= a0 for any string w · a · b with w ∈ Σ∗ and a, b ∈ Σ,

f (w · a · b)= bj

where bj occurs in d(ai ) and f (w · a)= ai .

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 66 / 109 Extended Document Type Definitions Properties of single-type EDTDs An exchange property for single-type EDTDs

Ancestor-Guarded Subtree Exchange T is a regular tree language

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 67 / 109 Extended Document Type Definitions Properties of single-type EDTDs An exchange property for single-type EDTDs

Theorem (Martens, Nev., Schw., 2005) A regular tree language is definable by a single-type EDTD iff it is closed under ancestor-guarded subtree exchange.

Proof ⇒ single-type EDTD has ancestor-based types. ⇐ Compute single-type closure D0 of given EDTD D: E.g, a1 → b1b2 and a2 → b3 becomes

a{1} → b{1}b{2} a{2} → b{3} a{1,2} → b{1,2,3}b{1,2,3} + b{1,2,3}

Obviously, L(D) ⊆ L(D0). Now, L(D) ⊇ L(D0) iff L(D) is closed under ancestor-guarded subtree exchange.

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 68 / 109 Extended Document Type Definitions Properties of single-type EDTDs Tool for proving inexpressibility

Evaluation of Boolean circuits is not single-type store store

dvd dvd dvd dvd title price title price discount title price discount title price

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 69 / 109 Extended Document Type Definitions Properties of single-type EDTDs Tool for proving inexpressibility

Evaluation of Boolean circuits is not single-type store store

dvd dvd dvd dvd title price title price discount title price discount title price

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 69 / 109 Extended Document Type Definitions Properties of single-type EDTDs Tool for proving inexpressibility

Evaluation of Boolean circuits is not single-type store store

dvd dvd dvd dvd title price title price discount title price discount title price

store

dvd dvd

title price title price

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 69 / 109 Extended Document Type Definitions Properties of single-type EDTDs Single-type EDTDs are not closed under union

Example

D1: a → b, b → c D2: a → bb, b → d a a

b b b

c d d

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 70 / 109 Extended Document Type Definitions Properties of single-type EDTDs Single-type EDTDs are not closed under union

Example

D1: a → b, b → c D2: a → bb, b → d a a

b b b

c d d

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 70 / 109 Extended Document Type Definitions Properties of single-type EDTDs Single-type EDTDs are not closed under union

Example

D1: a → b, b → c D2: a → bb, b → d a a

b b b

c d d

6∈ L(D1) ∪ L(D2) a

b

d

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 70 / 109 Extended Document Type Definitions Properties of single-type EDTDs Characterization of DTDs DTDs define precisely the local tree languages

Theorem (Papakonstantinou, Vianu, 2000) A regular tree language is definable by a DTD iff it is closed under subtree exchange.

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 71 / 109 Extended Document Type Definitions Properties of single-type EDTDs Smart validator

Theorem (Martens, Nev., Schw., 2005) Deciding whether an EDTD is equivalent to a single-type EDTD or a DTD is EXPTIME-complete.

Upper bound Compute single-type closure D0 of given EDTD D: E.g, a1 → b1b2 and a2 → b3 becomes

a{1} → b{1}b{2} a{2} → b{3} a{1,2} → b{1,2,3}b{1,2,3} + b{1,2,3}

L(D0)= L(D) iff L(D) is single-type. We know that L(D) ⊆ L(D0). So, only need to test L(D0) ⊆ L(D): D0 ∩ ¬D = ∅.

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 72 / 109 Extended Document Type Definitions Properties of single-type EDTDs Smart validator

Theorem (Martens, Nev., Schw., 2005) Deciding whether an EDTD is equivalent to a single-type EDTD or a DTD is EXPTIME-complete.

Lower bound For r and s arbitrary regular expressions over Σ − {b}, the EDTD

a → r · b1 + s · b2 b1 → c b2 → d

is equivalent to a single-type EDTD iff L(r)= L(s) (a PSPACE-hard problem). The equivalent DTD is a → r · b, b → c + d.

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 73 / 109 Extended Document Type Definitions Single-type EDTDs in practice Outline

1 Introduction to XML

2 Document Type Definitions

3 Unranked Tree Automata

4 Extended Document Type Definitions Definition XML Schema Properties of single-type EDTDs Single-type EDTDs in practice 1-pass preorder typing Relax NG

5 Decision problems for XML schema languages

6 Conclusion

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 74 / 109 Extended Document Type Definitions Single-type EDTDs in practice A practical study of XSDs

XML Schema: successor of DTDs data types, referencing mechanism, modularity, XML Syntax, more expressive power

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 75 / 109 Extended Document Type Definitions Single-type EDTDs in practice A practical study of XSDs

XML Schema: successor of DTDs data types, referencing mechanism, modularity, XML Syntax, more expressive power

Corpus 819 XSDs from the Cover pages. 726 XSDs through Google’s web services.

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 75 / 109 Extended Document Type Definitions Single-type EDTDs in practice A practical study of XSDs

XML Schema: successor of DTDs data types, referencing mechanism, modularity, XML Syntax, more expressive power

Corpus 819 XSDs from the Cover pages. 726 XSDs through Google’s web services. Only 225 are syntactically correct.

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 75 / 109 Extended Document Type Definitions Single-type EDTDs in practice A practical study of XSDs

XML Schema: successor of DTDs data types, referencing mechanism, modularity, XML Syntax, more expressive power

Corpus 819 XSDs from the Cover pages. 726 XSDs through Google’s web services. Only 225 are syntactically correct.

Practical XSDs are local 85% of the XSDs are structurally equivalent to a DTD: at most one type is associated to every element name. One example used types: a1 → b and a2 → b.

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 75 / 109 Extended Document Type Definitions Single-type EDTDs in practice How do the 15% non-local XSDs look like?

90% of the cases, types only depend on parent context:

store → regulars discounts regulars → (dvd1)∗ discounts → dvd2 dvd2 (dvd2)∗ dvd1 → title price dvd2 → title price discount

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 76 / 109 Extended Document Type Definitions Single-type EDTDs in practice How do the 15% non-local XSDs look like?

90% of the cases, types only depend on parent context:

store → regulars discounts regulars → (dvd1)∗ discounts → dvd2 dvd2 (dvd2)∗ dvd1 → title price dvd2 → title price discount

Remaining 10% are of the following form:

a → b + c h1 → j1 b → ed1 f h2 → j2 c → ed2 f j1 → k l d1 → gh1 i j2 → mn d2 → gh2 i

Frank NevenA j1 (Hasseltelement University) can onlyAutomata occur and XML as schema the great languages grandchild27 February of a b 2006element 76 / 109 Extended Document Type Definitions Single-type EDTDs in practice Why isn’t the expressiveness of XSDs used to its full extend?

Two possible reasons

1 Extra non-local expressiveness is simply not needed in practice. 2 Users are not aware of the possibilities of XSDs: provide simple formalism that make types dependent on ancestors.

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 77 / 109 Extended Document Type Definitions Single-type EDTDs in practice Making dependencies explicit

Definition An ancestor-based DTD A is a set of rules r → s where r and s are regular expressions over Σ.

Definition A tree t is valid w.r.t. A iff for every vertex v there is some r → s such that anc-strt (v) ∈ L(r) and the children of v match s.

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 78 / 109 Extended Document Type Definitions Single-type EDTDs in practice Making dependencies explicit

Theorem Ancestor-based DTDs and single-type EDTDs define the same class of tree languages.

Ancestor-guarded DTDs can be used as a light-weight front-end for XML Schema

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 79 / 109 Extended Document Type Definitions Single-type EDTDs in practice Making dependencies explicit

single-type EDTD store → regulars discounts regulars → (dvd1)∗ discounts → dvd2 dvd2 (dvd2)∗ dvd1 → title price dvd2 → title price discount

Ancestor-guarded DTD store → regulars discounts regulars → dvd∗ discounts → dvd dvd dvd∗ ∗· regulars · dvd ⇒ title price ∗· discounts · dvd ⇒ title price discount

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 80 / 109 Extended Document Type Definitions Single-type EDTDs in practice Making dependencies explicit

single-type EDTD a → b + c h1 → j1 b → ed1 f h2 → j2 c → ed2 f j1 → k l d1 → gh1 i j2 → mn d2 → gh2 i

Ancestor-guarded DTD a → b + c h → j b → e d f ∗· b ·∗· j ⇒ k l c → e d f ∗· c ·∗· j ⇒ mn d → g h i

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 81 / 109 Extended Document Type Definitions 1-pass preorder typing Outline

1 Introduction to XML

2 Document Type Definitions

3 Unranked Tree Automata

4 Extended Document Type Definitions Definition XML Schema Properties of single-type EDTDs Single-type EDTDs in practice 1-pass preorder typing Relax NG

5 Decision problems for XML schema languages

6 Conclusion

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 82 / 109 Extended Document Type Definitions 1-pass preorder typing 1-Pass preorder typing

Amelie 17 ... Streaming XML as an unparsed sequence of start and stop tags (SAX).

XML stream

validation XPath routing XML stream typing XML stream XML stream XML stream

Typing as the first operator in a pipeline

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 84 / 109 Extended Document Type Definitions 1-pass preorder typing 1-Pass Preorder Typing versus single-type EDTDs

Observations Streaming (preorder) typing is not possible for every EDTD: a a → b1 + b2 b1 → c b 2 b → d c Every single-type EDTD is preorder typable: type of child depends only on type of parent Single-type EDTDs are not the largest class which is preorder typeable: a a → b1b2 b1 → c bc b 2 b → d c d

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 85 / 109 Extended Document Type Definitions 1-pass preorder typing Restrained Competition EDTDs: left-to-right unique typing

Definition (Murata, Lee, Mani, 2001) A regular expression r restrains competition if there are no strings wai v and waj v 0 in L(r) with i 6= j. An EDTD is restrained competition iff all regular expressions occurring in rules restrain competition.

Not restrained-competition store → (dvd1 + dvd2)∗dvd2(dvd1 + dvd2)∗ dvd2(dvd1 + dvd2)∗ dvd1 → title price dvd2 → title price discount dvd1dvd2dvd2 dvd2dvd2dvd2 Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 86 / 109 Extended Document Type Definitions 1-pass preorder typing Restrained Competition EDTDs

Definition (Murata, Lee, Mani, 2001) A regular expression r restrains competition if there are no strings wai v and waj v 0 in L(r) with i 6= j. An EDTD is restrained competition iff all regular expressions occurring in rules restrain competition.

Restrained-competition store → (dvd1)∗ discounts dvd2 dvd2 (dvd2)∗ discounts → ε dvd1 → title price dvd2 → title price discount

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 87 / 109 Extended Document Type Definitions 1-pass preorder typing Towards characterizations of 1-pass preorder typing

The ancestor-sibling string

a

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 88 / 109 Extended Document Type Definitions 1-pass preorder typing Towards characterizations of 1-pass preorder typing

Theorem (Martens, Nev., Schw., 2005) For a regular tree language T, the following are equivalent T is 1-pass preorder typable T is definable by a restrained-competition EDTD T is closed under ancestor-sibling-guarded subtree exchange T is definable by an ancestor-sibling-based DTD

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 89 / 109 Extended Document Type Definitions 1-pass preorder typing Summary slide

What to remember?

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 90 / 109 Extended Document Type Definitions 1-pass preorder typing Summary slide

What to remember? DTD ≈ extended context-free grammars

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 90 / 109 Extended Document Type Definitions 1-pass preorder typing Summary slide

What to remember? DTD ≈ extended context-free grammars XML Schema ≈ single-type EDTDs

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 90 / 109 Extended Document Type Definitions 1-pass preorder typing Summary slide

What to remember? DTD ≈ extended context-free grammars XML Schema ≈ single-type EDTDs XML Schema is much closer to DTDs than to tree automata

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 90 / 109 Extended Document Type Definitions 1-pass preorder typing Summary slide

What to remember? DTD ≈ extended context-free grammars XML Schema ≈ single-type EDTDs XML Schema is much closer to DTDs than to tree automata single-typedness is not the most liberal restriction to get unique top-down (1-pass) typing: restrained-competition EDTDs. actually, determinism constraint alone already implies 1-pass typing

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 90 / 109 Extended Document Type Definitions Relax NG Outline

1 Introduction to XML

2 Document Type Definitions

3 Unranked Tree Automata

4 Extended Document Type Definitions Definition XML Schema Properties of single-type EDTDs Single-type EDTDs in practice 1-pass preorder typing Relax NG

5 Decision problems for XML schema languages

6 Conclusion

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 91 / 109 Extended Document Type Definitions Relax NG Relax NG

James Clark and Makoto Murata [2001] based on RELAX (Regular Language description for XML) and TREX (Tree Regular Expressions for XML) Clean specification: 40 pages, XML Schema: 170 pages O’Reilly book by Eric Van der Vlist Motivated by unranked regular tree languages. Very similar to extended DTDs. Closed under Boolean operations.

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 92 / 109 Extended Document Type Definitions Relax NG Relax NG: abbreviated syntax store = element store { (dvd1 | dvd2)*, dvd2, (dvd1 | dvd2)* } dvd1 = element dvd { element title { xsd:NCName }, element price { xsd:integer } } dvd2 = element dvd { element title { xsd:NCName }, element price { xsd:integer }, element discount { xsd:integer } }

EDTD store → (dvd1 + dvd2)∗dvd2(dvd1 + dvd2)∗ dvd1 → title price dvd2 → title price discount Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 94 / 109 Extended Document Type Definitions Relax NG Relax NG: XML syntax

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 96 / 109 Decision problems for XML schema languages Outline

1 Introduction to XML

2 Document Type Definitions

3 Unranked Tree Automata

4 Extended Document Type Definitions Definition XML Schema Properties of single-type EDTDs Single-type EDTDs in practice 1-pass preorder typing Relax NG

5 Decision problems for XML schema languages

6 Conclusion

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 97 / 109 Decision problems for XML schema languages Complexity of basic decision problems

Schema CONTAINMENT (⊆)

Given: Schema’s d1, d2 Question: Is L(d1) ⊆ L(d2)?

Schema EQUIVALENCE (=)

Given: Schema’s d1, d2 Question: Is L(d1)= L(d2)?

Schema intersection (∩)

Given: Schema’s d1,..., dn n Question: Is Ti=1 L(di )= ∅?

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 98 / 109 Decision problems for XML schema languages Complexity of basic decision problems

Theorem (Seidl 1990, 1994) CONTAINMENT, EQUIVALENCE, and INTERSECTION are EXPTIME-complete for EDTDs and NTA(NFA)s.

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 99 / 109 Decision problems for XML schema languages Complexity of basic decision problems

Theorem (Seidl 1990, 1994) CONTAINMENT, EQUIVALENCE, and INTERSECTION are EXPTIME-complete for EDTDs and NTA(NFA)s.

Proposition Let R be a class of regular expressions and C a complexity class. Then the following are equivalent: CONTAINMENT for R is in C; CONTAINMENT for DTD(R) is in C; CONTAINMENT for single-type EDTD(R) is in C; CONTAINMENT for restrained-competition EDTD(R) is in C.

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 99 / 109 Decision problems for XML schema languages Complexity of basic decision problems

Proposition Let R be a class of regular expressions and C a complexity class. Then the following are equivalent: INTERSECTION for R is in C; INTERSECTION for DTD(R) is in C.

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 100 / 109 Decision problems for XML schema languages Complexity of basic decision problems

Proposition Let R be a class of regular expressions and C a complexity class. Then the following are equivalent: INTERSECTION for R is in C; INTERSECTION for DTD(R) is in C.

Theorem (Martens, Nev., Schw. 2005) There is a class of regular expressions X such that INTERSECTION for X is NP-complete; INTERSECTION for single-type EDTD(X ) is EXPTIME-complete.

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 100 / 109 Decision problems for XML schema languages Complexity of regular expressions

Basic decision problems of regular expressions carry over to schema languages

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 101 / 109 Decision problems for XML schema languages Complexity of regular expressions

Basic decision problems of regular expressions carry over to schema languages Problem has been studied in depth (Hunt III et al., Kozen, Meyer and Stockmeyer, . . . )

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 101 / 109 Decision problems for XML schema languages Complexity of regular expressions

Basic decision problems of regular expressions carry over to schema languages Problem has been studied in depth (Hunt III et al., Kozen, Meyer and Stockmeyer, . . . ) more than ninety percent of the regular expressions occurring in practical DTDs and XSDs are Chain Regular Expressions (CHAREs). (Bex et al. 2004)

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 101 / 109 Decision problems for XML schema languages Complexity of regular expressions

Definition A base symbol is a regular expression s, s∗, s+, or s?, where s is a non-empty string; a factor is of the form e, e∗, e+, or e? where e is a disjunction of base symbols. A chain regular expression (CHARE) is ∅, ε, or a sequence of factors.

Example ((abc)∗ + b∗)(a + b)?(ab)+(ac + b)∗ is a CHARE (a + b) + (a∗b∗) is not a CHARE.

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 102 / 109 Decision problems for XML schema languages Chain Regular Expressions (CHAREs)

Abbreviations Factor Abbr. Factor Abbr. (a1 + ··· + an) (+a) (w1 + ··· + wn) (+w) ∗ ∗ ∗ ∗ (a1 + ··· + an) (+a) (w1 + ··· + wn) (+w) + + + + (a1 + ··· + an) (+a) (w1 + ··· + wn) (+w) (a1 + ··· + an)? (+a)? (w1 + ··· + wn)? (+w)? ∗ ∗ ∗ ∗ ∗ ∗ (a1 + ··· + an) (+a ) (w1 + ··· + wn ) (+w ) + + + + + + (a1 + ··· + an ) (+a ) (w1 + ··· + wn ) (+w )

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 103 / 109 Decision problems for XML schema languages Complexity of CHAREs

Known results CONTAINMENT for RE(a?, (+a)∗) is in PTIME [Abdulla, Bouajjani, Jonsson 1998] CONTAINMENT for RE(a, Σ, Σ∗) is in PTIME [Milo, Suciu 1999] INTERSECTION for RE((+w)∗) is PSPACE-hard [Bala 2002]

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 104 / 109 Decision problems for XML schema languages Complexity of CHAREs [Martens, Nev., Schw. 2004]

RE-fragment Inclusion Equivalence Intersection a, a+ in PTIME (DFA!) in PTIME in PTIME a, a∗ coNP in PTIME NP a, a? coNP in PTIME NP CHAREs − {(+a)∗, (+w)∗, coNP in coNP NP (+a)+, (+w)+} a, (+a)∗ PSPACE in PSPACE NP CHAREs − {(+w)∗, (+w)+} PSPACE in PSPACE NP a, (+w)∗ PSPACE in PSPACE PSPACE CHAREs PSPACE in PSPACE PSPACE RE≤k (k ≥ 3) in PTIME in PTIME PSPACE deterministic in PTIME in PTIME PSPACE

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 105 / 109 Decision problems for XML schema languages Equivalence of a, a∗ is in PTIME

Put expression in sequence normal form. E.g., aaa∗bb∗cccc∗ becomes a≥2b≥1c≥3.

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 106 / 109 Decision problems for XML schema languages Equivalence of a, a∗ is in PTIME

Put expression in sequence normal form. E.g., aaa∗bb∗cccc∗ becomes a≥2b≥1c≥3. There are equivalent expressions with a different sequence normal form:

a≥i b∗a∗b≥1a≥j = a≥i b≥1a∗b∗a≥j

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 106 / 109 Decision problems for XML schema languages Equivalence of a, a∗ is in PTIME

Put expression in sequence normal form. E.g., aaa∗bb∗cccc∗ becomes a≥2b≥1c≥3. There are equivalent expressions with a different sequence normal form:

a≥i b∗a∗b≥1a≥j = a≥i b≥1a∗b∗a≥j

Good news: this is the only exception. Non-trivial proof.

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 106 / 109 Decision problems for XML schema languages Equivalence of a, a∗ is in PTIME

Put expression in sequence normal form. E.g., aaa∗bb∗cccc∗ becomes a≥2b≥1c≥3. There are equivalent expressions with a different sequence normal form:

a≥i b∗a∗b≥1a≥j = a≥i b≥1a∗b∗a≥j

Good news: this is the only exception. Non-trivial proof. Conjecture: equivalence is tractable for much larger fragments

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 106 / 109 Decision problems for XML schema languages Summary slide

What to remember? Decision problems for XML Schema translate to decision problems for regular expressions.

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 107 / 109 Decision problems for XML schema languages Summary slide

What to remember? Decision problems for XML Schema translate to decision problems for regular expressions.

Question What is the largest class of regular expressions for which equivalence is in PTIME?

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 107 / 109 Conclusion Outline

1 Introduction to XML

2 Document Type Definitions

3 Unranked Tree Automata

4 Extended Document Type Definitions Definition XML Schema Properties of single-type EDTDs Single-type EDTDs in practice 1-pass preorder typing Relax NG

5 Decision problems for XML schema languages

6 Conclusion

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 108 / 109 Conclusion Conclusion

DTDs are almost extended context-free grammars Unranked tree automata are a robust class – questions remain XML Schema is closer to DTDs than to tree automata XML (schema) research is a good excuse to do theory

Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 109 / 109