The Automaton Approach to XML Schema Languages: from Practice to Theory

The Automaton Approach to XML Schema Languages: from Practice to Theory

The automaton approach to XML schema languages: from practice to theory Frank Neven1 1Theoretical Computer Science Group Hasselt University Agoralaan, 3590 Diepenbeek, Belgium 27 February 2006 Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 1 / 109 Introduction to XML Outline 1 Introduction to XML 2 Document Type Definitions 3 Unranked Tree Automata 4 Extended Document Type Definitions Definition XML Schema Properties of single-type EDTDs Single-type EDTDs in practice 1-pass preorder typing Relax NG 5 Decision problems for XML schema languages 6 Conclusion Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 2 / 109 Introduction to XML XML is a data exchange format W3C standard geographical db XML user XML INTERNET OODB Rel DB car retailer car reviews Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 3 / 109 Introduction to XML A self-describing data format Example <store> <dvd> <title> Fabuleux destin d’Amelie </title> <price> 17 </price> </dvd> <dvd> <title> Goodbye Lenin </title> <price> 20 </price> <discount> 4 </discount> </dvd> </store> start tag: <title> element: <title>...</title> end tag: </title> Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 4 / 109 Introduction to XML XML as a hierarchical structure Example store dvd dvd title price title price discount “Amélie" 17 “Good bye, Lenin!" 20 4 Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 5 / 109 Introduction to XML Attributes Example <store name=“DVDPlanet”> <dvd category=“romance”> <title> Fabuleux ... d’Amelie </title> <price> 17 </price> </dvd> <dvd category=“drama” > <title> Goodbye Lenin </title> <price> 20 </price> <discount> 4 </discount> </dvd> </store> Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 6 / 109 Introduction to XML XML as a hierarchical structure Example store[name=“DVDPlanet”] dvd[category=“romance”] dvd[category=“drama”] title price title price discount “Amélie" 17 “Good bye, Lenin!" 20 4 Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 7 / 109 Introduction to XML Trees as conceptual abstraction of XML documents XML documents are ordered unranked trees over a finite alphabet Σ of tag names. We assume an infinite set of data values D for attribute and leaf values. store[name=“DVDPlanet”] dvd[category=“romance”] dvd[category=“drama”] title price title price discount “Amélie" 17 “Good bye, Lenin!" 20 4 Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 8 / 109 Introduction to XML Flexibility of XML Representation of the relational model Relation XML encoding R A B <R> a1 b1 <tuple> a2 b2 <A> a1 </A> <B> b1 </B> XML Tree </tuple> R <tuple> <A> a2 </A> tuple tuple <B> b2 </B> A B A B </tuple> </R> a1 b1 a2 b2 Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 10 / 109 Introduction to XML XML schema languages Schema A schema defines the set of allowable tags and the way they can be structured. Advantages automatic validation automatic integration of data automatic translation query optimization provides a user with a concrete semantics of the document aids in the specification of meaningful queries over XML data Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 11 / 109 Introduction to XML XML schema languages Example DTDs (W3C) XML Schema (W3C) Relax NG (Clark, Murata) several dozen others (DSD, Schematron, ...) In formal language theoretic terms A schema defines a tree language. Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 12 / 109 Introduction to XML Overview of XML Theory Cross fertilization XML Automata Logic Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 13 / 109 Introduction to XML Overview of XML Theory Cross fertilization XML Automata Logic Different sorts of automata: grammars, tree automata, tree-walking automata, register automata, transducers, . Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 13 / 109 Introduction to XML Overview of XML Theory Cross fertilization XML Automata Logic Different sorts of automata: grammars, tree automata, tree-walking automata, register automata, transducers, . Automata serve as an algorithmic toolbox an abstract formal model of schema languages, query and pattern languages Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 13 / 109 Introduction to XML Summary slide What to remember? XML is an international standard XML documents or XML data are simply ordered unranked labeled trees with data values a schema defines a tree language (no data values) Focus of this talk Automata as a formal model for schema languages Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 14 / 109 Introduction to XML Outline 1 Introduction to XML 2 Document Type Definitions 3 Unranked Tree Automata 4 Extended Document Type Definitions Definition XML Schema Properties of single-type EDTDs Single-type EDTDs in practice 1-pass preorder typing Relax NG 5 Decision problems for XML schema languages 6 Conclusion Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 15 / 109 Document Type Definitions Outline 1 Introduction to XML 2 Document Type Definitions 3 Unranked Tree Automata 4 Extended Document Type Definitions Definition XML Schema Properties of single-type EDTDs Single-type EDTDs in practice 1-pass preorder typing Relax NG 5 Decision problems for XML schema languages 6 Conclusion Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 16 / 109 Document Type Definitions Document Type Definitions (DTDs) Example <!DOCTYPE store [ <!ELEMENT store (dvd,dvd*)> <!ELEMENT dvd (title,price,discount?)> <!ELEMENT title (#PCDATA)> <!ELEMENT price (#PCDATA)> <!ELEMENT discount (#PCDATA)> ]> Corresponding grammar (start symbol store) store → dvd dvd∗ dvd → title price(discount + ε) title → DATA price → DATA Frank Neven (Hasselt University)discountAutomata→ andDATA XML schema languages 27 February 2006 18 / 109 Document Type Definitions Document Type Definitions (DTDs) XML Document store dvd dvd title price title price "Amélie" 17 "Good bye, Lenin!" 20 Corresponding grammar (start symbol store) store → dvd dvd∗ dvd → title price(discount + ε) title → DATA price → DATA discount → DATA Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 19 / 109 Document Type Definitions Document Type Definitions (DTDs) No data values XML Document store dvd dvd title price title price Corresponding grammar (start symbol store)) store → dvd dvd∗ dvd → title price(discount + ε) Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 20 / 109 Document Type Definitions Extended Context-free grammars as a formal abstraction Definition A DTD is a pair (d, sd ) where sd ∈ Σ is the start symbol d maps every Σ-symbol to a regular expression over Σ Definition A tree t satisfies d (is valid) iff the root of t is labeled sd for every vertex v labeled a the string formed by the children of v belongs to d(a). DTD validator Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 21 / 109 Document Type Definitions Optimization questions Schema containment (⊆) Given: schema’s d1, d2 Question: Is L(d1) ⊆ L(d2)? Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 22 / 109 Document Type Definitions Optimization questions Schema containment (⊆) Given: schema’s d1, d2 Question: Is L(d1) ⊆ L(d2)? DTD containment reduces to containment of regular expressions d1 ⊆ d2 iff d1(a) ⊆ d2(a), ∀a ∈ Σ (when d1 and d2 are trimmed). Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 22 / 109 Document Type Definitions Optimization questions Schema containment (⊆) Given: schema’s d1, d2 Question: Is L(d1) ⊆ L(d2)? DTD containment reduces to containment of regular expressions d1 ⊆ d2 iff d1(a) ⊆ d2(a), ∀a ∈ Σ (when d1 and d2 are trimmed). Theorem (Meyer, Stockmeyer, 1973) Containment of regular expressions is PSPACE-complete. Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 22 / 109 Document Type Definitions Optimization questions Schema containment (⊆) Given: schema’s d1, d2 Question: Is L(d1) ⊆ L(d2)? DTD containment reduces to containment of regular expressions d1 ⊆ d2 iff d1(a) ⊆ d2(a), ∀a ∈ Σ (when d1 and d2 are trimmed). Theorem (Meyer, Stockmeyer, 1973) Containment of regular expressions is PSPACE-complete. Corollary DTD containment is PSPACE-complete. Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 22 / 109 Document Type Definitions Regular Expressions in DTDs Should Be Deterministic How accurate is our abstraction? Backward compatibility with SGML The XML specifications requires regular expressions to be deterministic: for every input symbol in the input string we can uniquely determine by which symbol in the regular expression it should match without looking ahead in the input string. Example The expression (a + b)∗a is not deterministic. Counterexample: baa. The expression b∗a(b∗a)∗ is deterministic. Why this restriction? Frank Neven (Hasselt University) Automata and XML schema languages 27 February 2006 23 / 109 Document Type Definitions Regular Expressions in DTDs Should Be Deterministic Relevant questions 1 How do we recognize

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    184 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us