Regular Rooted Graph Grammars a Web Type and Schema Language

Regular Rooted Graph Grammars A Web Type and Schema Language Dissertation zur Erlangung des akademischen Grades des Doktors der Naturwissenschaften an der Fakultat¨ fur¨ Mathematik, Informatik und Statistik der Ludwig-Maximilians-Universitat¨ Munchen¨ von Sacha Berger Dezember 2007 Erstgutachter: Prof. Dr. François Bry (Universitat¨ Munchen)¨ Zweitgutachter: Prof. Dr. Claude Kirchner (INRIA Bordeaux) Tag der m ündlichenPr üfung: 4. Februar 2008 Abstract This thesis investigates a pragmatic approach to typing, static analysis and static optimization of Web query languages, in special the Web query language Xcerpt[43]. The approach is pragmatic in the sense, that no restriction on the types are made for decidability or efficiency reasons, instead precision is given up if necessary. Prag- matics on the dynamic side means to use types not only to ensure validity of objects operating on, but also influencing query selection based on types. A typing language for typing of graph structured data on the Web is introduced. The Graphs in mind are based on spanning trees with references, the typing languages is based on regular tree grammars [37, 38] with typed reference extensions. Beside ordered data in the spirit of XML, unordered data (i.e. in the spirit of the Xcerpt data model or RDF) can be modelled using regular expressions under unordered interpretation – this approach is new. An operational semantics for ordered and unordered types is given based on specialized regular tree automata[21] and counting constraints[67] (them again based on Presburger arithmetic formulae). Static type checking of Xcerpt query and construct terms is introduced, as well as optimization of Xcerpt query terms based on schema information. Zusammenfassung In dieser Arbeit wir ein pragmatischer Ansatz zur Typisierung, statischen Analyse und Optimierung von Web-Anfragespachen, speziell Xcerpt [43], untersucht. Prag- matisch ist der Ansatz in dem Sinne, dass dem Benutzer keinerlei Einschrankungen¨ aus Entscheidbarkeits- oder Effizienzgrunden¨ auf modellierbare Typen gestellt werden. Effizienz und Entscheidbarkeit werden stattdessen, falls notig,¨ durch Vergroberun-¨ gen bei der Typprufung¨ erkauft. Eine Typsprache zur Typisierung von Graph-strukturierten Daten im Web wird eingefuhrt.¨ Modellierbare Graphen sind so genannte gewurzelte Graphen, welche aus einem Spannbaum und Querreferenzen aufgebaut sind. Die Typsprache basiert auf regulare¨ Baum Grammatiken [37, 38], welche um typisierte Referenzen erweitert wur- de. Neben wie im Web mit XML ublichen¨ geordneten strukturierten Daten, sind auch ungeordnete Daten, wie etwa in Xcerpt oder RDF ublich,¨ modellierbar. Der dazu ver- wendete Ansatz—ungeordnete Interpretation Regularer¨ Ausdrucke—ist¨ neu. Eine operationale Semantik fur¨ geordnete wie ungeordnete Typen wird auf Basis spezial- isierter Baumautomaten [21] und sog. Counting Constraints [67] (welche wiederum auf presburgerarithmetische Ausdrucke¨ [42] basieren). Es wird ferner statische Typ -Prufung¨ und -Inferenz von Xcerpt Anfrage- und Konstrukttermen, wie auch Opti- mierung von Xcerpt Anfragen auf Basis von Typinformation eingefuhrt.¨ III Acknowledgements This thesis has been developed during my four-year employment as research and teaching assis- tant at the University of Munich. I am grateful not only for having been able to study and get passionated about computer science at the University of Munich, but also for having an oppor- tunity to convey this passion to the next generation of computer science students. Especially in the course of this thesis, I would like to thank • Fran¸coisBry from the University of Munich, who supervised this thesis. Numerous fruitful, sometimes heated, but always supportive conversations with him have been an important driving force for me during all the time. • Claude Kirchner from INRIA (Institut National de Recherche en Informatique et en Automa- tique) France, for being the second reviewer of this thesis • Tim Furche from the University of Munich, for spending countless hours in discussing high level principles as well as the lowest details of Xcerpt, the value of typing and optimization of queries and generic module systems with me. • Norbert Eisinger from the University of Munich, for introducing me into scientific work and for spending much time with me discussing details of automata theory and rule based calculi. • Paula-Lavinia P˘atrânjan from the University of Munich, for being such a valuable friend and helping me to keep the head up when times were hard for me. • Sebastian Schaffert from Salzburg Research, not only for being the one to introduce me with Xcerpt, also (and specially) for being a good friend. • Wlodzimierz Drabent from the Polish Academy of Science and Linkopings¨ Universitet (LiU) for his expertise on the topic of descriptive type systems. • Philipp Obermeier who has worked with me on the implementation of a prototype and who by this has been invaluable in gaining insight about how to improve understandability of this thesis. • Andreas Häusler who has worked with me on the concepts of type based query optimization. and last but not least: most gratitude goes to my family and my friends for supporting me patiently during the course of this thesis. Special thanks to Stefan Kollmannsberger for reviewing text and supporting me when needed most. This research has been partly funded by the European Commission and by the Swiss Fed- eral Office for Education and Science within the 6th Framework Programme project REWERSE number 506779 (cf. http://rewerse.net/). Munich, 12th of December 2007 IV Contents I Introduction and Preliminaries 1 1 Introduction 3 1.1 Scope . .3 1.2 Contributions . .5 1.3 Outline of the Thesis . .5 2 Preliminaries 7 2.1 Context of this Research . .7 2.1.1 Information vs. Knowledge . .7 2.1.2 Documents and Data on The Web . .8 2.1.3 Obtaining Knowledge from Information . .8 2.2 Semistructured Data . .9 2.3 XML—The Extensible Markup Language . .9 2.4 From Schema-less Structure to Valid Data . 13 2.4.1 DTD—Document Type Declarations . 13 2.4.2 XML Schema . 17 2.4.3 Relax NG . 21 2.5 Querying The Web . 25 2.5.1 XPath . 25 2.5.2 XSLT . 25 2.5.3 XQuery . 27 2.5.4 Xcerpt . 30 II R2G2 39 3 R2G2—Regular Rooted Graph Grammars 41 3.1 Regular Tree Grammars . 41 3.2 Regular Tree Grammars for Unordered Unranked Trees . 43 3.3 Regular Rooted Graph Grammars . 46 3.3.1 Reference Types and Typed References . 46 3.3.2 About (Non Tree Structured) Graphs and Tree Grammars . 50 3.4 The Syntax of R2G2 ..................................... 51 3.4.1 Core R2G2 Syntax . 51 3.4.2 Base Data Types . 56 3.4.3 Conversion functions . 57 3.4.4 Base functions . 58 3.4.5 The Non-Core R2G2 Constructs . 59 3.5 XML Serialisation of R2G2 Valid Data Terms . 60 3.5.1 Examples and Explanations of (De-)Reference Serialisation . 61 3.6 Semantics of R2G2 ..................................... 63 V CONTENTS 4 From a Generic Module System to a Module System for R2G2 67 4.1 The Purpose of Modules . 67 4.2 Modules and XML Name Spaces . 68 4.3 Modular R2G2 ........................................ 69 4.3.1 Syntax of Modular R2G2 .............................. 69 4.4 Realizing the Module System using the “Divide and Rule” approach . 70 5 Use Cases—Modelling Data and Documents with R2G2 75 5.1 Beyond Regular Tree Grammars—The Use of Macros . 75 5.2 Design Patterns—R2G2 Best Practices . 76 5.2.1 Global versus Local . 77 5.2.2 Composition vs. Sub Classing . 79 5.2.3 ‘eXtreme eXtensibility’ . 80 III Automata Models 85 6 An Automaton Model for Regular Rooted Graph Languages 87 6.1 Introduction to Regular Tree Automata . 87 6.1.1 Handling Ranked Trees . 91 6.2 An Automaton Model for Unranked Regular Rooted Graph Languages . 92 6.2.1 Labelled Directed Hyper Graphs as Non-Det. Regular Tree Automata . 92 6.2.2 Membership Test for a Tree using Hyper Graph Automata . 94 6.2.3 Recognition of a Rooted Graph using Hyper Graph Automata . 98 6.3 A Calculus Relating Automata and R2G2 ........................ 100 6.4 Some Set Theoretic Computations on Hyper Graph Automata . 103 6.4.1 The Emptiness Test . 103 6.4.2 Intersection of Regular Rooted Graph Automata . 104 6.4.3 Automata Based Subset Test for two Regular Rooted Graph Languages . 107 7 A Model for Regular Languages of Multisets 113 7.1 Introduction to Multisets and Multiset Languages . 113 7.1.1 Multisets . 113 7.1.2 Multiset Languages . 114 7.2 Counting Constraints . 116 7.3 A Calculus for Translation of Regular Expressions to Counting Constraints . 116 7.3.1 Example of a Regular Expression Translated to a Counting Constraint . 118 7.4 Some Set Theoretic Computations on Counting Constraints . 119 IV Type Checking 121 8 Type Checking using Regular Rooted Graphs as Data Paradigm 123 8.1 Types and Query- and Programming Languages . 123 8.1.1 Dynamic Typing and Type Checking in Programming Languages . 123 8.1.2 Static Typing and Type Checking in Programming Languages . 124 8.1.3 Combined Static and Dynamic Typing . 124 8.1.4 From Typed Programming languages to typed Query Languages . 125 8.2 Type Systems for Xcerpt . 125 8.2.1 XcerptT—Descriptive Typing for Xcerpt . 126 8.2.2 Prescriptive typing: from CLP to Xcerpt . 126 8.2.3 Typing with R2G2—Differences to the Former Approaches . 127 8.2.4 The Syntax of R2G2 typed Xcerpt . 128 8.3 Type checking and Type Inference for R2G2 Typed Xcerpt Programs . 130 VI CONTENTS 8.3.1 Paying Attention to Type Annotation in Queries . 130 8.3.2 Typing Ordered Queries . 131 8.3.3 Typing Unordered Queries . 134 8.3.4 Typing of Construct Terms . 136 8.3.5 Typing Rules . 140 8.3.6 Typing Programs . 140 8.3.7 Coverage of Current Xcerpt Constructs . 140 V Outlook & Conclusion 143 9 Outlook 145 9.1 Type Based Querying—an Extension to Simulation Unification .

Regular Rooted Graph Grammars a Web Type and Schema Language

Regular Description of Context-Free Graph Languages

Weighted Regular Tree Grammars with Storage

Tree Automata Techniques and Applications

Taxonomy of XML Schema Languages Using Formal Language Theory

Tree Automata

Language, Automata and Logic for Finite Trees

A Model-Theoretic Description of Tree Adjoining Grammars 1

Uniform Vs. Nonuniform Membership for Mildly Context-Sensitive Languages: a Brief Survey

Context-Free Graph Grammars and Concatenation of Graphs

Hybrid Grammars for Parsing of Discontinuous Phrase Structures and Non-Projective Dependency Structures

Efficient Techniques for Parsing with Tree Automata

Practical MAT Learning of Natural Languages Using Treebanks