Foundations of XML Types: Tree Automata

Pierre Genevès CNRS

(slides mostly based on slides by W. Martens and T. Schwentick)

University Grenoble Alpes, 2020–2021

1 / 43 Why Tree Automata?

• Foundations of XML type languages (DTD, XML Schema, Relax NG...) • Provide a general framework for XML type languages • A tool to define regular tree languages with an operational semantics • Provide algorithms for efficient validation • Basic tool for static analysis (proofs, decision procedures in logic)

2 / 43 Prelude: Word Automata

b b a

start even odd

a

Transitions even →a odd odd →a even ...

3 / 43 From Words to Trees: Binary Trees

Binary trees with an even number of a’s

a

a a

b a b b

How to write transitions?

4 / 43 From Words to Trees: Binary Trees

Binary trees with an even number of a’s

a even

a even a odd

b even a odd b even b even

How to write transitions?

4 / 43 From Words to Trees: Binary Trees

Binary trees with an even number of a’s

a even

a even a odd

b even a odd b even b even

How to write transitions? (even, odd) →a even (even, even) →a odd etc.

4 / 43 Ranked Trees?

They come from parse trees of data (or programs)...

A function call f (a, b) is a ranked tree

f

a b

5 / 43 Ranked Trees?

They come from parse trees of data (or programs)...

A function call f (g(a, b, c), h(i)) is a ranked tree

f

g h

a b c i

5 / 43 Ranked Alphabet

A ranked alphabet symbol is:

• a formalisation of a function call • a symbol a with an integer arity(a) • arity(a) indicates the number of children of a

Notation a(k): symbol a with arity(a)= k

6 / 43 Example

Alphabet: {a(2), b(2), c(3), #(0)} Possible tree:

a

b a

a c # #

# # a # b

# # a #

# #

7 / 43 Ranked Tree Automata

A ranked A consists in:

Alphabet(A): finite alphabet of symbols States(A): finite of states Rules(A): finite set of transition rules Final(A): finite set of final states (⊆ States(A)) where :

a(k) Rules(A) are of the form (q1, ..., qk ) → q

(0) if k = 0, we write  a→ q

8 / 43 Ranked Tree Automata

A ranked tree automaton A consists in:

Alphabet(A): finite alphabet of symbols States(A): finite set of states Rules(A): finite set of transition rules Final(A): finite set of final states (⊆ States(A)) where :

a(k) Rules(A) are of the form (q1, ..., qk ) → q

(0) if k = 0, we write  a→ q

8 / 43 How do they work? Example: Boolean Expressions

∨ ∧

∧ ∨ ∨ ∧

0 1 1 0 0 1 1 1

Rules(A) Principle 0 1  → q0  → q1 ∧ ∨ • Alphabet(A)= {∧, ∨, 0, 1} (q1, q1) → q1 (q0, q1) → q1 ∧ ∨ • States(A)= {q0, q1} (q0, q1) → q0 (q1, q0) → q1 ∧ ∨ • 1 accepting state at the root: (q1, q0) → q0 (q1, q1) → q1 ∧ ∨ Final(A)= {q1} (q0, q0) → q0 (q0, q0) → q0

9 / 43 How do they work? Example: Boolean Expressions

∨ ∧

∧ ∨ ∨ ∧

0 q0 1 q1 1 q1 0 q0 0 q0 1 q1 1 q1 1 q1

Rules(A) Principle 0 1  → q0  → q1 ∧ ∨ • Alphabet(A)= {∧, ∨, 0, 1} (q1, q1) → q1 (q0, q1) → q1 ∧ ∨ • States(A)= {q0, q1} (q0, q1) → q0 (q1, q0) → q1 ∧ ∨ • 1 accepting state at the root: (q1, q0) → q0 (q1, q1) → q1 ∧ ∨ Final(A)= {q1} (q0, q0) → q0 (q0, q0) → q0

9 / 43 How do they work? Example: Boolean Expressions

∧ q1

∨ q1 ∧ q1

∧q0 ∨q1 ∨q1 ∧q1

0 q0 1 q1 1 q1 0 q0 0 q0 1 q1 1 q1 1 q1

Rules(A) Principle 0 1  → q0  → q1 ∧ ∨ • Alphabet(A)= {∧, ∨, 0, 1} (q1, q1) → q1 (q0, q1) → q1 ∧ ∨ • States(A)= {q0, q1} (q0, q1) → q0 (q1, q0) → q1 ∧ ∨ • 1 accepting state at the root: (q1, q0) → q0 (q1, q1) → q1 ∧ ∨ Final(A)= {q1} (q0, q0) → q0 (q0, q0) → q0

9 / 43 Terminology

• Language(A) : set of trees accepted by A • For a tree automaton A, Language(A) is a regular tree language [Thatcher and Wright, 1968, Doner, 1970]

10 / 43 Example

Tree automaton A over {a(2), b(2), #(0)} which recognizes trees with an even number of a’s Alphabet(A): {a, b, #} States(A): Final(A): Rules(A):

11 / 43 Example

Tree automaton A over {a(2), b(2), #(0)} which recognizes trees with an even number of a’s Alphabet(A): {a, b, #} States(A): {even, odd} Final(A): {even} Rules(A): (even, even) →a odd (even, even) →b even (even, odd) →a even (even, odd) →b odd (odd, even) →a even (odd, even) →b odd (odd, odd) →a odd (odd, odd) →b even #  → even

11 / 43 Outline

1. Can we implement a tree automaton efficiently? (notion of determinism) 2. Are tree automata closed under set-theoretic operations? 3. Can we check type inclusion? 4. Can we build equivalent top-down tree automata? 5. Nice theory. But... what should I do with my unranked XML trees? → Can we apply this for XML type-checking?

12 / 43 Deterministic Tree Automata

Deterministic does not have two rules of the form:

a(k) (q1, ..., qk ) → q (k) a 0 (q1, ..., qk ) → q for two different states q and q0

Intuition At most one possible transition at a given node → implementation...

13 / 43 Can we Make a Tree Automaton Deterministic?

Theorem (determinization) From a given non-deterministic (bottom-up) tree automaton we can build a deterministic tree automaton Corollary Non-deterministic and deterministic (bottom-up) tree automata recognize the same languages. Complexity |States(A)| EXPTIME (|States(Adet)| = 2 )

14 / 43 Implementing Validation

Membership Checking Given a tree automaton A and a tree t, does t ∈ Language(A)?

Remark b We can implement even if A is b non-deterministic... b Example a Automaton with Final(A)= {qf } and : c b b  → q q → qb q → q b b b a qb → qf (q, q) → q c c

15 / 43 Implementing Validation

Membership Checking Given a tree automaton A and a tree t, does t ∈ Language(A)?

Remark b We can implement even if A is b non-deterministic... b Example a Automaton with Final(A)= {qf } and : c b b  → q q → qb q → q b b b a qb → qf (q, q) → q {q} c c {q}

15 / 43 Implementing Validation

Membership Checking Given a tree automaton A and a tree t, does t ∈ Language(A)?

Remark b We can implement even if A is b non-deterministic... b Example a Automaton with Final(A)= {qf } and : c b b  → q q → qb q → q {q, qb} b b {q, qb} b a qb → qf (q, q) → q {q} c c {q}

15 / 43 Implementing Validation

Membership Checking Given a tree automaton A and a tree t, does t ∈ Language(A)?

Remark b We can implement even if A is b non-deterministic... b Example a {q} Automaton with Final(A)= {qf } and : c b b  → q q → qb q → q {q, qb} b b {q, qb} b a qb → qf (q, q) → q {q} c c {q}

15 / 43 Implementing Validation

Membership Checking Given a tree automaton A and a tree t, does t ∈ Language(A)?

Remark b We can implement even if A is b non-deterministic... b {q, qb} Example a {q} Automaton with Final(A)= {qf } and : c b b  → q q → qb q → q {q, qb} b b {q, qb} b a qb → qf (q, q) → q {q} c c {q}

15 / 43 Implementing Validation

Membership Checking Given a tree automaton A and a tree t, does t ∈ Language(A)?

Remark b We can implement even if A is b {q, qb, qf } non-deterministic... b {q, qb} Example a {q} Automaton with Final(A)= {qf } and : c b b  → q q → qb q → q {q, qb} b b {q, qb} b a qb → qf (q, q) → q {q} c c {q}

15 / 43 Implementing Validation

Membership Checking Given a tree automaton A and a tree t, does t ∈ Language(A)?

Remark b {q, qb, qf } We can implement even if A is b {q, qb, qf } non-deterministic... b {q, qb} Example a {q} Automaton with Final(A)= {qf } and : c b b  → q q → qb q → q {q, qb} b b {q, qb} b a qb → qf (q, q) → q {q} c c {q}

15 / 43 Implementing Validation

Membership Checking Given a tree automaton A and a tree t, does t ∈ Language(A)?

Remark b {q, qb, qf } We can implement even if A is b {q, qb, qf } non-deterministic... b {q, qb} Example a {q} Automaton with Final(A)= {qf } and : c b b  → q q → qb q → q {q, qb} b b {q, qb} b a qb → qf (q, q) → q {q} c c {q} Complexity Membership-Checking is in PTIME (time linear in the size of the tree)

15 / 43 Set-Theoretic Operations

Recall • We have seen that neither local tree grammars nor single-type tree grammars are closed under boolean operations (e.g. union) • What about tree automata?

16 / 43 Closure under Union and Intersection... Example

• Automaton A: even number of a’s a • (even, even) → odd • Automaton B: even number of b’s a • (even, even) → even

17 / 43 Closure under Union and Intersection... Example

• Automaton A: even number of a’s a • (even, even) → odd • Automaton B: even number of b’s a • (even, even) → even

a odd even

b even a even even even

17 / 43 Closure under Union and Intersection... Example

• Automaton A: even number of a’s a • (even, even) → odd • Automaton B: even number of b’s a • (even, even) → even

a odd even

b even a even even even

((even, even), (even, even)) →a (odd, even)

17 / 43 Product Construction

Given A and B, build A × B • Alphabet(A × B)= Alphabet(A) ∪ Alphabet(B) • States(A × B)= States(A) × States(B)

• Final(A × B)= {(qa, qb) | qa ∈ Final(A) ∧ qb ∈ Final(B)} • Rules(A × B)=   a(k) (k) 1 k  1 1 k k  a qa , ..., qa → qa ∈ Rules(A)  (qa , qb ), ..., (qa , qb ) → (qa, qb) (k) 1 k a  qb , ..., qb → qb ∈ Rules(B) 

18 / 43 Closure under Union

Given A and B, build A ∪ B • Alphabet(A ∪ B)= Alphabet(A) ∪ Alphabet(B) • States(A ∪ B)= States(A) × States(B) • Rules(A ∪ B)=   a(k) (k) 1 k  1 1 k k  a qa , ..., qa → qa ∈ Rules(A)  (qa , qb ), ..., (qa , qb ) → (qa, qb) (k) 1 k a  qb , ..., qb → qb ∈ Rules(B) 

• Final(A ∪ B)= {(qa, qb) | qa ∈ Final(A) ∨ qb ∈ Final(B)}

19 / 43 Closure under intersection

Given A and B, build A ∩ B • Alphabet(A ∩ B)= Alphabet(A) ∪ Alphabet(B) • States(A ∩ B)= States(A) × States(B) • Rules(A ∩ B)=   a(k) (k) 1 k  1 1 k k  a qa , ..., qa → qa ∈ Rules(A)  (qa , qb ), ..., (qa , qb ) → (qa, qb) (k) 1 k a  qb , ..., qb → qb ∈ Rules(B) 

• Final(A ∩ B)= {(qa, qb) | qa ∈ Final(A) ∧ qb ∈ Final(B)}

20 / 43 Complexity of the Product

Size of the Result Automaton • |States(A × B)| = |States(A)| · |States(B)| • |Rules(A × B)| ≤ |Rules(A)| · |Rules(B)| Quadratic increase in size

21 / 43 Closure under Negation: Completion

Definition : Complete Tree Automaton (k) For each a ∈ Alphabet(A) and q1, ..., qk ∈ States(A), there exists a rule

a (q1, ..., qk ) → q with some q Intuition At least one transition at a given node...

22 / 43 Example

Incomplete (deterministic) tree automaton Tree automaton A for {a(b, b)} : b  → qb a (qb, qb) → qa with Final(A)= {qa}

Completion of A

Add a sink state qp b a  → qb  → qp a a a a (qb, qb) → qa (qb, qa) → qp (qa, qb) → qp (qa, qa) → qp b b b b (qb, qb) → qp (qb, qa) → qp (qa, qb) → qp (qa, qa) → qp σ (qp, q) → qp for all q ∈ {qa, qb, qp} and σ ∈ {a, b} σ (q, qp) → qp for all q ∈ {qa, qb, qp} and σ ∈ {a, b} with Final(A)= {qa}

23 / 43 Example

Incomplete (deterministic) tree automaton Tree automaton A for {a(b, b)} : b  → qb a (qb, qb) → qa with Final(A)= {qa}

Completion of A, Complementation of A

Add a sink state qp b a  → qb  → qp a a a a (qb, qb) → qa (qb, qa) → qp (qa, qb) → qp (qa, qa) → qp b b b b (qb, qb) → qp (qb, qa) → qp (qa, qb) → qp (qa, qa) → qp σ (qp, q) → qp for all q ∈ {qa, qb, qp} and σ ∈ {a, b} σ (q, qp) → qp for all q ∈ {qa, qb, qp} and σ ∈ {a, b} with Final(A)= {qb, qp}

23 / 43 Closure under Negation: Summary

Building the Complement of A

• Make A deterministic • Complete the result • Switch final ↔ non-final states

Complexity

• Determinization of A : exponential explosion (states: 2States(A)) • Completion of the result: exponential explosion of the number of rules:  k |Alphabet(A)| · 2|States(A)| where k is the maximal rank • Switching final ↔ non-final states: linear Total: exponential explosion

24 / 43 Emptiness Test

Given a tree automaton A, is Language(A) 6= ∅?

Principle Compute the set of reachable states and then see if any of them are in the final set

Complexity PTIME (time proportional to |A|)

25 / 43 Application for Checking Type Inclusion

Type Inclusion

Given two tree automata A1 and A2, is Language(A1) ⊆ Language(A2) ? Theorem Containment for non-deterministic tree automata can be decided in exponential time Principle

? • Language(A1 ∩ A2) = ∅

|A2| • For this purpose, we must make A2 deterministic (size: O(2 )) → EXPTIME • Essentially no better solution [Seidl, 1990]

26 / 43 Application for Checking Type Inclusion

Type Inclusion

Given two tree automata A1 and A2, is Language(A1) ⊆ Language(A2) ? Theorem Containment for non-deterministic tree automata can be decided in exponential time Principle

? • Language(A1 ∩ A2) = ∅

|A2| • For this purpose, we must make A2 deterministic (size: O(2 )) → EXPTIME • Essentially no better solution [Seidl, 1990]

26 / 43 Top-Down Tree Automata

Is that useful?... Example: Connection with Strings

a

b c

abcd = a(b(c(d))) = d

Reading strings from left to right = reading trees top-down (→ e.g. streaming validation...)

27 / 43 Top-Down Tree Automata: Example

∧ ∨ ∧

∧ ∨ ∨ ∧

0 1 1 0 0 1 1 1

Principle • starting from the root, guess Transitions ∧ ∨ correct values q1 → (q1, q1) q1 → (q0, q1) ∧ ∨ • check at leaves q0 → (q0, q1) q1 → (q1, q0) ∧ ∨ q0 → (q1, q0) q1 → (q1, q1) • 3 states: q0, q1, acc ∧ ∨ q0 → (q0, q0) q0 → (q0, q0) • initial state at the root: q1 1 0 q1 → acc q0 → acc • accepting if all leaves labeled acc 28 / 43 Top-Down Tree Automata: Example

∧ q1 ∨ ∧

∧ ∨ ∨ ∧

0 1 1 0 0 1 1 1

Principle • starting from the root, guess Transitions ∧ ∨ correct values q1 → (q1, q1) q1 → (q0, q1) ∧ ∨ • check at leaves q0 → (q0, q1) q1 → (q1, q0) ∧ ∨ q0 → (q1, q0) q1 → (q1, q1) • 3 states: q0, q1, acc ∧ ∨ q0 → (q0, q0) q0 → (q0, q0) • initial state at the root: q1 1 0 q1 → acc q0 → acc • accepting if all leaves labeled acc 28 / 43 Top-Down Tree Automata: Example

∧ q1

q1 ∨ ∧ q1

∧q0 ∨q1 ∨q1 ∧q1

0 q0 1 q1 1 q1 0 q0 0 q0 1 q1 1 q1 1 q1 acc acc acc acc acc acc acc acc

Principle • starting from the root, guess Transitions ∧ ∨ correct values q1 → (q1, q1) q1 → (q0, q1) ∧ ∨ • check at leaves q0 → (q0, q1) q1 → (q1, q0) ∧ ∨ q0 → (q1, q0) q1 → (q1, q1) • 3 states: q0, q1, acc ∧ ∨ q0 → (q0, q0) q0 → (q0, q0) • initial state at the root: q1 1 0 q1 → acc q0 → acc • accepting if all leaves labeled acc 28 / 43 Top-Down Tree Automata

A top-down tree automaton A consists in:

Alphabet(A): finite alphabet of symbols States(A): finite set of states Rules(A): finite set of transition rules Initial(A): finite set of initial states (⊆ States(A)) where:

a(k) Rules(A) are of the form q → (q1, ..., qk )

Top-down tree automata also recognize all regular tree languages

29 / 43 Top-Down Tree Automata

A top-down tree automaton A consists in:

Alphabet(A): finite alphabet of symbols States(A): finite set of states Rules(A): finite set of transition rules Initial(A): finite set of initial states (⊆ States(A)) where:

a(k) Rules(A) are of the form q → (q1, ..., qk )

Top-down tree automata also recognize all regular tree languages

29 / 43 Top-down Determinism

Deterministic Top-Down Tree Automaton

• for each q ∈ States(A) and a ∈ Alphabet(A) there is at most one rule

a(k) q → (q1, ...., qk )

• there is at most one initial state

30 / 43 Top-down Determinism

Deterministic Top-Down Tree Automaton

• for each q ∈ States(A) and a ∈ Alphabet(A) there is at most one rule

a(k) q → (q1, ...., qk )

• there is at most one initial state

Can We Make Top-Down Automata Deterministic?

(a) Yes (b) No (c) Maybe...

30 / 43 Can We Make Top-Down Automata Deterministic?

Maybe !

31 / 43 Can We Make Top-Down Automata Deterministic?

Maybe ! Deterministic top-down tree automata do not recognize all regular tree languages Example

a a

b c c b

31 / 43 Can We Make Top-Down Automata Deterministic?

Maybe ! Deterministic top-down tree automata do not recognize all regular tree languages Example

a a

b c c b

Initial(A)= q0

31 / 43 Can We Make Top-Down Automata Deterministic?

Maybe ! Deterministic top-down tree automata do not recognize all regular tree languages Example

a a

b c c b

Initial(A)= q0 a q0 → (q, q)

31 / 43 Can We Make Top-Down Automata Deterministic?

Maybe ! Deterministic top-down tree automata do not recognize all regular tree languages Example

a a

b c c b

Initial(A)= q0 a q0 → (q, q) q →b 

31 / 43 Can We Make Top-Down Automata Deterministic?

Maybe ! Deterministic top-down tree automata do not recognize all regular tree languages Example

a a

b c c b

Initial(A)= q0 a q0 → (q, q) q →b  q →c 

31 / 43 Can We Make Top-Down Automata Deterministic?

Maybe ! Deterministic top-down tree automata do not recognize all regular tree languages Example

a a

b c c b

Initial(A)= q0 a q0 → (q, q) q →b  q →c  also accepts...

a a

b b c c

31 / 43 Expressive Power of Tree Automata: Summary

Theorem The following properties are equivalent for a tree language L: (a) L is recognized by a bottom-up non-deterministic tree automaton (b) Lis recognized by a bottom-up deterministic tree automaton (c) L is recognized by a top-down non-deterministic tree automaton (d) L is generated by a regular tree grammar

Proof Idea (a) ⇒ (b): determinization (see [Comon et al., 1997]) (a) ⇔ (c): same thing seen from 2 different ways (d) ⇔ (a) : ? (horizontal recursion a∗?)

32 / 43 Unranked Trees

String Ranked Tree Unranked Tree as Tree a a a b b e c d a ... c d b c b a a c a a c b a ... c c a e a b a b c e b c b e b

Unranked Tree Automata? 1. either we adapt ranked tree automata 2. or we encode unranked trees are ranked trees...

33 / 43 Unranked Tree Automata

Ranked Trees q σ Transitions can be described by finite sets: σ1 σ2 δ(σ, q)= {(q1, q2), (q3, q4), ...}

q1 q2 Unranked Trees q σ

σ1 σ2 ...... σn

q1 q2 ...... qn ∈ δ(σ, q)?

δ(σ, q)

• For unranked trees, δ(σ, q) is a regular tree language • δ(σ, q) may be specified by a or by a finite word automaton [Murata, 1999]

34 / 43 Second Option

Can we encode unranked trees as ranked trees? a

b e c d a ... c d a a ? b a ... c c a e

e b c b e

35 / 43 Encoding Unranked Trees As Binary Trees

0 0 1 1 2 2 3 3

Bijective Encoding

• “first child; next sibling” encoding • Allows to focus on binary trees without loss of generality • Results for ranked trees hold for unranked trees as well

36 / 43 Tree Automata: Summary

Definition A tree language is regular iff it is recognized by a non-deterministic tree automaton Advantages

• Closure, decidable operations • General tool (theoretical and algorithmic)

Limitations • anbn

37 / 43 Application for Static Type-Checking

Recall: The XML Static Type-Checking Problem

• Given the following: • Source code of some program f (in e.g. XQuery) • A type Tin for input documents • A type Tout for expected output documents

• Problem: Does f (t) ∈ Tout for all t ∈ Tin?

38 / 43 Static Type-Checking for XQuery: Principles

{ FOR $author in /books/book/author[...] RETURN ... { /books/book/title } }

f ()

t ? T ⊆ Tout f (t) inf Tin

Approach in two steps:

1. Compute Tinf = {f (t)|t ∈ Tin}

2. Check whether Tinf ⊆ Tout holds

→ In case Tinf ⊆ Tout holds, then we know that for any t ∈ Tin, f (t) ∈ Tout

39 / 43 Static Type-Checking Using Tree Automata

• Many research papers • XQuery’s standard type system inspired from advances in PLs, e.g.: • XDuce [Hosoya and Pierce, 2003]: a domain specific language • Sound static type system based on an implementation of tree automata containment optimized for practical cases • Use pattern-matching for selection (no support for XPath)

XQuery’s standard type system

• As defined in the spec: • Sound • PTIME → Sound and more precise type systems in EXPTIME

40 / 43 XQuery and Static Type-Checking

Open Issues

• Type inference requires a balance between precision and complexity • Sound approximations can be more or less imprecise • Poor precision (e.g. type approximation) in type inference yields more false alarms • XQuery’s standard type system is limited: • Currently: no or poor support of XPath (e.g. only child and descendant axes are – poorly – supported) • Support for typing XPath has been open for years (in particular for backward axes) → Can you guess why is that difficult? → 2015 ICFP paper with a solution for this issue (based on advances in tree logics)

41 / 43 Concluding Remarks on Tree Automata

• Tree automata are part of the theoretical tools that provide the underlying guiding principles for XML (like the relational algebra provides the underlying principles for relational databases) • Important research challenges still remain

42 / 43

For those who want to learn more Pointers to related research results:

[1] XQuery and Static Typing: Tackling the Problem of Backward Axes. Pierre Genevès and Nils Gesbert. In ICFP’15: Proceedings of the ACM SIGPLAN International Conference on Functional Programming (ICFP), Sept. 2015. http://tyrex.inria.fr/publications/icfp15.pdf

43 / 43 Comon, H., Dauchet, M., Gilleron, R., Jacquemard, F., Lugiez, D., Tison, S., and Tommasi, M. (1997). Tree automata techniques and applications. Available on: http://www.grappa.univ-lille3.fr/tata. release October, 1st 2002.

Doner, J. (1970). Tree acceptors and some of their applications. Journal of Computer and System Sciences, 4:406–451.

Hosoya, H. and Pierce, B. C. (2003). XDuce: A statically typed XML processing language. ACM Trans. Inter. Tech., 3(2):117–148.

Murata, M. (1999). Hedge automata: a formal model for XML schemata. http://www.xml.gr.jp/relax/hedge_nice.html.

Seidl, H. (1990). Deciding equivalence of finite tree automata. SIAM J. Comput., 19(3):424–437.

Thatcher, J. W. and Wright, J. B. (1968). 43 / 43 Generalized finite with an application to a decision problem of second-order logic. Mathematical Systems Theory, 2(1):57–81.

43 / 43