Regular Rooted Graph Grammars a Web Type and Schema Language

Total Page:16

File Type:pdf, Size:1020Kb

Regular Rooted Graph Grammars a Web Type and Schema Language Regular Rooted Graph Grammars A Web Type and Schema Language Dissertation zur Erlangung des akademischen Grades des Doktors der Naturwissenschaften an der Fakultat¨ fur¨ Mathematik, Informatik und Statistik der Ludwig-Maximilians-Universitat¨ Munchen¨ von Sacha Berger Dezember 2007 Erstgutachter: Prof. Dr. Franc¸ois Bry (Universitat¨ Munchen)¨ Zweitgutachter: Prof. Dr. Claude Kirchner (INRIA Bordeaux) Tag der m ¨undlichenPr ¨ufung: 4. Februar 2008 Abstract This thesis investigates a pragmatic approach to typing, static analysis and static optimization of Web query languages, in special the Web query language Xcerpt[43]. The approach is pragmatic in the sense, that no restriction on the types are made for decidability or efficiency reasons, instead precision is given up if necessary. Prag- matics on the dynamic side means to use types not only to ensure validity of objects operating on, but also influencing query selection based on types. A typing language for typing of graph structured data on the Web is introduced. The Graphs in mind are based on spanning trees with references, the typing languages is based on regular tree grammars [37, 38] with typed reference extensions. Beside ordered data in the spirit of XML, unordered data (i.e. in the spirit of the Xcerpt data model or RDF) can be modelled using regular expressions under unordered in- terpretation – this approach is new. An operational semantics for ordered and un- ordered types is given based on specialized regular tree automata[21] and counting constraints[67] (them again based on Presburger arithmetic formulae). Static type checking of Xcerpt query and construct terms is introduced, as well as optimization of Xcerpt query terms based on schema information. Zusammenfassung In dieser Arbeit wir ein pragmatischer Ansatz zur Typisierung, statischen Analyse und Optimierung von Web-Anfragespachen, speziell Xcerpt [43], untersucht. Prag- matisch ist der Ansatz in dem Sinne, dass dem Benutzer keinerlei Einschrankungen¨ aus Entscheidbarkeits- oder Effizienzgrunden¨ auf modellierbare Typen gestellt wer- den. Effizienz und Entscheidbarkeit werden stattdessen, falls notig,¨ durch Vergroberun-¨ gen bei der Typprufung¨ erkauft. Eine Typsprache zur Typisierung von Graph-strukturierten Daten im Web wird eingefuhrt.¨ Modellierbare Graphen sind so genannte gewurzelte Graphen, welche aus einem Spannbaum und Querreferenzen aufgebaut sind. Die Typsprache basiert auf regulare¨ Baum Grammatiken [37, 38], welche um typisierte Referenzen erweitert wur- de. Neben wie im Web mit XML ublichen¨ geordneten strukturierten Daten, sind auch ungeordnete Daten, wie etwa in Xcerpt oder RDF ublich,¨ modellierbar. Der dazu ver- wendete Ansatz—ungeordnete Interpretation Regularer¨ Ausdrucke—ist¨ neu. Eine operationale Semantik fur¨ geordnete wie ungeordnete Typen wird auf Basis spezial- isierter Baumautomaten [21] und sog. Counting Constraints [67] (welche wiederum auf presburgerarithmetische Ausdrucke¨ [42] basieren). Es wird ferner statische Typ -Prufung¨ und -Inferenz von Xcerpt Anfrage- und Konstrukttermen, wie auch Opti- mierung von Xcerpt Anfragen auf Basis von Typinformation eingefuhrt.¨ III Acknowledgements This thesis has been developed during my four-year employment as research and teaching assis- tant at the University of Munich. I am grateful not only for having been able to study and get passionated about computer science at the University of Munich, but also for having an oppor- tunity to convey this passion to the next generation of computer science students. Especially in the course of this thesis, I would like to thank • Fran¸coisBry from the University of Munich, who supervised this thesis. Numerous fruitful, sometimes heated, but always supportive conversations with him have been an important driving force for me during all the time. • Claude Kirchner from INRIA (Institut National de Recherche en Informatique et en Automa- tique) France, for being the second reviewer of this thesis • Tim Furche from the University of Munich, for spending countless hours in discussing high level principles as well as the lowest details of Xcerpt, the value of typing and optimization of queries and generic module systems with me. • Norbert Eisinger from the University of Munich, for introducing me into scientific work and for spending much time with me discussing details of automata theory and rule based calculi. • Paula-Lavinia P˘atrˆanjan from the University of Munich, for being such a valuable friend and helping me to keep the head up when times were hard for me. • Sebastian Schaffert from Salzburg Research, not only for being the one to introduce me with Xcerpt, also (and specially) for being a good friend. • Wlodzimierz Drabent from the Polish Academy of Science and Linkopings¨ Universitet (LiU) for his expertise on the topic of descriptive type systems. • Philipp Obermeier who has worked with me on the implementation of a prototype and who by this has been invaluable in gaining insight about how to improve understandability of this thesis. • Andreas H¨ausler who has worked with me on the concepts of type based query optimization. and last but not least: most gratitude goes to my family and my friends for supporting me patiently during the course of this thesis. Special thanks to Stefan Kollmannsberger for reviewing text and supporting me when needed most. This research has been partly funded by the European Commission and by the Swiss Fed- eral Office for Education and Science within the 6th Framework Programme project REWERSE number 506779 (cf. http://rewerse.net/). Munich, 12th of December 2007 IV Contents I Introduction and Preliminaries 1 1 Introduction 3 1.1 Scope . .3 1.2 Contributions . .5 1.3 Outline of the Thesis . .5 2 Preliminaries 7 2.1 Context of this Research . .7 2.1.1 Information vs. Knowledge . .7 2.1.2 Documents and Data on The Web . .8 2.1.3 Obtaining Knowledge from Information . .8 2.2 Semistructured Data . .9 2.3 XML—The Extensible Markup Language . .9 2.4 From Schema-less Structure to Valid Data . 13 2.4.1 DTD—Document Type Declarations . 13 2.4.2 XML Schema . 17 2.4.3 Relax NG . 21 2.5 Querying The Web . 25 2.5.1 XPath . 25 2.5.2 XSLT . 25 2.5.3 XQuery . 27 2.5.4 Xcerpt . 30 II R2G2 39 3 R2G2—Regular Rooted Graph Grammars 41 3.1 Regular Tree Grammars . 41 3.2 Regular Tree Grammars for Unordered Unranked Trees . 43 3.3 Regular Rooted Graph Grammars . 46 3.3.1 Reference Types and Typed References . 46 3.3.2 About (Non Tree Structured) Graphs and Tree Grammars . 50 3.4 The Syntax of R2G2 ..................................... 51 3.4.1 Core R2G2 Syntax . 51 3.4.2 Base Data Types . 56 3.4.3 Conversion functions . 57 3.4.4 Base functions . 58 3.4.5 The Non-Core R2G2 Constructs . 59 3.5 XML Serialisation of R2G2 Valid Data Terms . 60 3.5.1 Examples and Explanations of (De-)Reference Serialisation . 61 3.6 Semantics of R2G2 ..................................... 63 V CONTENTS 4 From a Generic Module System to a Module System for R2G2 67 4.1 The Purpose of Modules . 67 4.2 Modules and XML Name Spaces . 68 4.3 Modular R2G2 ........................................ 69 4.3.1 Syntax of Modular R2G2 .............................. 69 4.4 Realizing the Module System using the “Divide and Rule” approach . 70 5 Use Cases—Modelling Data and Documents with R2G2 75 5.1 Beyond Regular Tree Grammars—The Use of Macros . 75 5.2 Design Patterns—R2G2 Best Practices . 76 5.2.1 Global versus Local . 77 5.2.2 Composition vs. Sub Classing . 79 5.2.3 ‘eXtreme eXtensibility’ . 80 III Automata Models 85 6 An Automaton Model for Regular Rooted Graph Languages 87 6.1 Introduction to Regular Tree Automata . 87 6.1.1 Handling Ranked Trees . 91 6.2 An Automaton Model for Unranked Regular Rooted Graph Languages . 92 6.2.1 Labelled Directed Hyper Graphs as Non-Det. Regular Tree Automata . 92 6.2.2 Membership Test for a Tree using Hyper Graph Automata . 94 6.2.3 Recognition of a Rooted Graph using Hyper Graph Automata . 98 6.3 A Calculus Relating Automata and R2G2 ........................ 100 6.4 Some Set Theoretic Computations on Hyper Graph Automata . 103 6.4.1 The Emptiness Test . 103 6.4.2 Intersection of Regular Rooted Graph Automata . 104 6.4.3 Automata Based Subset Test for two Regular Rooted Graph Languages . 107 7 A Model for Regular Languages of Multisets 113 7.1 Introduction to Multisets and Multiset Languages . 113 7.1.1 Multisets . 113 7.1.2 Multiset Languages . 114 7.2 Counting Constraints . 116 7.3 A Calculus for Translation of Regular Expressions to Counting Constraints . 116 7.3.1 Example of a Regular Expression Translated to a Counting Constraint . 118 7.4 Some Set Theoretic Computations on Counting Constraints . 119 IV Type Checking 121 8 Type Checking using Regular Rooted Graphs as Data Paradigm 123 8.1 Types and Query- and Programming Languages . 123 8.1.1 Dynamic Typing and Type Checking in Programming Languages . 123 8.1.2 Static Typing and Type Checking in Programming Languages . 124 8.1.3 Combined Static and Dynamic Typing . 124 8.1.4 From Typed Programming languages to typed Query Languages . 125 8.2 Type Systems for Xcerpt . 125 8.2.1 XcerptT—Descriptive Typing for Xcerpt . 126 8.2.2 Prescriptive typing: from CLP to Xcerpt . 126 8.2.3 Typing with R2G2—Differences to the Former Approaches . 127 8.2.4 The Syntax of R2G2 typed Xcerpt . 128 8.3 Type checking and Type Inference for R2G2 Typed Xcerpt Programs . 130 VI CONTENTS 8.3.1 Paying Attention to Type Annotation in Queries . 130 8.3.2 Typing Ordered Queries . 131 8.3.3 Typing Unordered Queries . 134 8.3.4 Typing of Construct Terms . 136 8.3.5 Typing Rules . 140 8.3.6 Typing Programs . 140 8.3.7 Coverage of Current Xcerpt Constructs . 140 V Outlook & Conclusion 143 9 Outlook 145 9.1 Type Based Querying—an Extension to Simulation Unification .
Recommended publications
  • Regular Description of Context-Free Graph Languages
    Regular Description of ContextFree Graph Languages Jo ost Engelfriet and Vincent van Oostrom Department of Computer Science Leiden University POBox RA Leiden The Netherlands email engelfriwileidenunivnl Abstract A set of lab eled graphs can b e dened by a regular tree language and one regular string language for each p ossible edge lab el as follows For each tree t from the regular tree language the graph g r t has the same no des as t with the same lab els and there is an edge with lab el from no de x to no de y if the string of lab els of the no des on the shortest path from x to y in t b elongs to the regular string language for Slightly generalizing this denition scheme we allow g r t to have only those no des of t that have certain lab els and we allow a relab eling of these no des It is shown that in this way exactly the class of CedNCE graph languages generated by CedNCE graph grammars is obtained one of the largest known classes of contextfree graph languages Introduction There are many kinds of contextfree graph grammars see eg ENRR EKR Some are no de rewriting and others are edge rewriting In b oth cases a pro duc tion of the grammar is of the form X D C Application of such a pro duction to a lab eled graph H consists of removing a no de or edge lab eled X from H replacing it by the graph D and connecting D to the remainder of H accord ing to the embedding pro cedure C Since these grammars are contextfree in the sense that one no de or edge is replaced their derivations can b e mo d eled by derivation trees as
    [Show full text]
  • Weighted Regular Tree Grammars with Storage
    Discrete Mathematics and Theoretical Computer Science DMTCS vol. 20:1, 2018, #26 Weighted Regular Tree Grammars with Storage 1 2 2 Zoltan´ Ful¨ op¨ ∗ Luisa Herrmann † Heiko Vogler 1 Department of Foundations of Computer Science, University of Szeged, Hungary 2 Faculty of Computer Science, Technische Universitat¨ Dresden, Germany received 19th May 2017, revised 11th June 2018, accepted 22nd June 2018. We introduce weighted regular tree grammars with storage as combination of (a) regular tree grammars with storage and (b) weighted tree automata over multioperator monoids. Each weighted regular tree grammar with storage gen- erates a weighted tree language, which is a mapping from the set of trees to the multioperator monoid. We prove that, for multioperator monoids canonically associated to particular strong bimonoids, the support of the generated weighted tree languages can be generated by (unweighted) regular tree grammars with storage. We characterize the class of all generated weighted tree languages by the composition of three basic concepts. Moreover, we prove results on the elimination of chain rules and of finite storage types, and we characterize weighted regular tree grammars with storage by a new weighted MSO-logic. Keywords: regular tree grammars, weighted tree automata, multioperator monoids, storage types, weighted MSO- logic 1 Introduction In automata theory, weighted string automata with storage are a very recent topic of research [HV15, HV16, VDH16]. This model generalizes finite-state string automata in two directions. On the one hand, it is based on the concept of “sequential program + machine”, which Scott [Sco67] proposed in order to har- monize the wealth of upcoming automaton models.
    [Show full text]
  • Tree Automata Techniques and Applications
    Tree Automata Techniques and Applications Hubert Comon Max Dauchet Remi´ Gilleron Florent Jacquemard Denis Lugiez Sophie Tison Marc Tommasi Contents Introduction 7 Preliminaries 11 1 Recognizable Tree Languages and Finite Tree Automata 13 1.1 Finite Tree Automata . 14 1.2 The pumping Lemma for Recognizable Tree Languages . 22 1.3 Closure Properties of Recognizable Tree Languages . 23 1.4 Tree homomorphisms . 24 1.5 Minimizing Tree Automata . 29 1.6 Top Down Tree Automata . 31 1.7 Decision problems and their complexity . 32 1.8 Exercises . 35 1.9 Bibliographic Notes . 38 2 Regular grammars and regular expressions 41 2.1 Tree Grammar . 41 2.1.1 Definitions . 41 2.1.2 Regular tree grammar and recognizable tree languages . 44 2.2 Regular expressions. Kleene’s theorem for tree languages. 44 2.2.1 Substitution and iteration . 45 2.2.2 Regular expressions and regular tree languages. 48 2.3 Regular equations. 51 2.4 Context-free word languages and regular tree languages . 53 2.5 Beyond regular tree languages: context-free tree languages . 56 2.5.1 Context-free tree languages . 56 2.5.2 IO and OI tree grammars . 57 2.6 Exercises . 58 2.7 Bibliographic notes . 61 3 Automata and n-ary relations 63 3.1 Introduction . 63 3.2 Automata on tuples of finite trees . 65 3.2.1 Three notions of recognizability . 65 3.2.2 Examples of the three notions of recognizability . 67 3.2.3 Comparisons between the three classes . 69 3.2.4 Closure properties for Rec£ and Rec; cylindrification and projection .
    [Show full text]
  • Taxonomy of XML Schema Languages Using Formal Language Theory
    Taxonomy of XML Schema Languages using Formal Language Theory MAKOTO MURATA IBM Tokyo Research Lab DONGWON LEE Penn State University MURALI MANI Worcester Polytechnic Institute and KOHSUKE KAWAGUCHI Sun Microsystems On the basis of regular tree grammars, we present a formal framework for XML schema languages. This framework helps to describe, compare, and implement such schema languages in a rigorous manner. Our main results are as follows: (1) a simple framework to study three classes of tree languages (“local”, “single-type”, and “regular”); (2) classification and comparison of schema languages (DTD, W3C XML Schema, and RELAX NG) based on these classes; (3) efficient doc- ument validation algorithms for these classes; and (4) other grammatical concepts and advanced validation algorithms relevant to XML model (e.g., binarization, derivative-based validation). Categories and Subject Descriptors: H.2.1 [Database Management]: Logical Design—Schema and subschema; F.4.3 [Mathematical Logic and Formal Languages]: Formal Languages— Classes defined by grammars or automata General Terms: Algorithms, Languages, Theory Additional Key Words and Phrases: XML, schema, validation, tree automaton, interpretation 1. INTRODUCTION XML [Bray et al. 2000] is a meta language for creating markup languages. To represent an XML based language, we design a collection of names for elements and attributes that the language uses. These names (i.e., tag names) are then used by application programs dedicated to this type of information. For instance, XHTML [Altheim and McCarron (Eds) 2000] is such an XML-based language. In it, permissible element names include p, a, ul, and li, and permissible attribute names include href and style.
    [Show full text]
  • Tree Automata
    Foundations of XML Types: Tree Automata Pierre Genevès CNRS (slides mostly based on slides by W. Martens and T. Schwentick) University Grenoble Alpes, 2020–2021 1 / 43 Why Tree Automata? • Foundations of XML type languages (DTD, XML Schema, Relax NG...) • Provide a general framework for XML type languages • A tool to define regular tree languages with an operational semantics • Provide algorithms for efficient validation • Basic tool for static analysis (proofs, decision procedures in logic) 2 / 43 Prelude: Word Automata b b a start even odd a Transitions even !a odd odd !a even ... 3 / 43 From Words to Trees: Binary Trees Binary trees with an even number of a’s a a a b a b b How to write transitions? 4 / 43 From Words to Trees: Binary Trees Binary trees with an even number of a’s a even a even a odd b even a odd b even b even How to write transitions? 4 / 43 From Words to Trees: Binary Trees Binary trees with an even number of a’s a even a even a odd b even a odd b even b even How to write transitions? (even; odd) !a even (even; even) !a odd etc: 4 / 43 Ranked Trees? They come from parse trees of data (or programs)... A function call f (a; b) is a ranked tree f a b 5 / 43 Ranked Trees? They come from parse trees of data (or programs)... A function call f (g(a; b; c); h(i)) is a ranked tree f g h a b c i 5 / 43 Ranked Alphabet A ranked alphabet symbol is: • a formalisation of a function call • a symbol a with an integer arity(a) • arity(a) indicates the number of children of a Notation a(k): symbol a with arity(a)= k 6 / 43 Example Alphabet:
    [Show full text]
  • Language, Automata and Logic for Finite Trees
    Language, Automata and Logic for Finite Trees Olivier Gauwin UMons Feb/March 2010 Olivier Gauwin (UMons) Finite Tree Automata Feb/March 2010 1 / 66 Languages, Automata, Logic Example for regular word languages: Σ∗.a.b.Σ∗ Complexity a, b a, b a b ∃x. ∃y. lab (x) S A B Automata Logic a ∧ labb(y) ∧ succ(x, y) Grammars / Expressions S → aS S → aB X → aX X → ǫ S → bS B → bX X → bX Olivier Gauwin (UMons) Finite Tree Automata Feb/March 2010 2 / 66 Languages, Automata, Logic Example for regular word languages: Σ∗.a.b.Σ∗ Example for regular tree languages: all trees with an a-node having a b-child Complexity a, b a, b a b ∃x. ∃y. lab (x) S A B Automata Logic a ∧ labb(y) ∧ succ(x, y) (S, X ) → S ∃x. ∃y. laba(x) (X , S) → S ∧ labb(y) ∧ child(x, y) a(B, X ) → S a(X , B) → S Grammars / Expressions ... S → aS S → aB X → aX X → ǫ b → B S → bS B → bX X → bX ... S → (S, X ) S → a(B, X ) B → b(X , X ) X → (X , X ) S → (X , S) S → a(X , B) B → b X → Olivier Gauwin (UMons) Finite Tree Automata Feb/March 2010 2 / 66 References The main reference for this talk is the TATA book [CDG+07]: Tree Automata, Techniques and Applications by Hubert Comon, Max Dauchet, R´emi Gilleron, Christof L¨oding, Florent Jacquemard, Denis Lugier, Sophie Tison, Marc Tommasi. Other references will be mentionned progressively. Olivier Gauwin (UMons) Finite Tree Automata Feb/March 2010 3 / 66 1 Ranked Trees Trees on Ranked Alphabet Tree Automata Tree Grammars Logic 2 Unranked Trees Unranked Trees Automata Logic Olivier Gauwin (UMons) Finite Tree Automata Feb/March 2010 4 / 66 1 Ranked Trees Trees on Ranked Alphabet Tree Automata Tree Grammars Logic 2 Unranked Trees Unranked Trees Automata Logic Olivier Gauwin (UMons) Finite Tree Automata Feb/March 2010 5 / 66 Trees on Ranked Alphabet Ranked alphabet Ranked alphabet = finite alphabet + arity function Σr Σ= {a, b, c} ar:Σ → N ar(a) = 2 ar(b) = 2 ar(c) = 0 Ranked trees over Σr TΣr , the set of ranked trees, is the smallest set of terms f (t1,..., tk ) such that: f ∈ Σr , k = ar(f ), and ti ∈TΣr for all 1 ≤ i ≤ k.
    [Show full text]
  • A Model-Theoretic Description of Tree Adjoining Grammars 1
    p() URL: http://www.elsevier.nl/locate/entcs/volume53.html 23 pages A Model-Theoretic Description of Tree Adjoining Grammars 1 Frank Morawietz and Uwe M¨onnich 2,3 Seminar f¨ur Sprachwissenschaft Universit¨at T¨ubingen T¨ubingen, Germany Abstract In this paper we show that non-context-free phenomena can be captured using only limited logical means. In particular, we show how to encode a Tree Adjoining Gram- mar [16] into a weakly equivalent monadic context-free tree grammar (MCFTG). By viewing MCFTG-rules as terms in a free Lawvere theory, we can translate a given MCFTG into a regular tree grammar. The latter is characterizable by both a tree automaton and a corresponding formula in monadic second-order (MSO) logic. The trees of the resulting regular tree language are then unpacked into the intended “linguistic” trees through a model-theoretic interpretation in the form of an MSO transduction based upon tree-walking automata. This two-step approach gives a logical as well as an operational description of the tree sets involved. 1 Introduction Algebraic, logical and regular characterizations of (tree) languages provide a natural framework for the denotational and operational semantics of grammar formalisms relying on the use of trees for their intended models. In the present context the combination of algebraic, logical and regular techniques does not onlyadd another description of mildlycontext-sensitive languages to an alreadylong list of weak equivalences between grammar for- malisms. It also makes available the whole bodyof techniques that have been developed in the tradition of algebraic language theory, logic and automata theory.
    [Show full text]
  • Uniform Vs. Nonuniform Membership for Mildly Context-Sensitive Languages: a Brief Survey
    algorithms Article Uniform vs. Nonuniform Membership for Mildly Context-Sensitive Languages: A Brief Survey Henrik Björklund *, Martin Berglund and Petter Ericson Department of Computing Science, Umeå University, SE-901 87 Umeå, Sweden; [email protected] (M.B.); [email protected] (P.E.) * Correspondence: [email protected]; Tel.: +46-90-786-9789 Academic Editors: Henning Fernau and Florin Manea Received: 8 March 2016; Accepted: 27 April 2016; Published: 11 May 2016 Abstract: Parsing for mildly context-sensitive language formalisms is an important area within natural language processing. While the complexity of the parsing problem for some such formalisms is known to be polynomial, this is not the case for all of them. This article presents a series of results regarding the complexity of parsing for linear context-free rewriting systems and deterministic tree-walking transducers. We discuss the difference between uniform and nonuniform complexity measures and how parameterized complexity theory can be used to investigate how different aspects of the formalisms influence how hard the parsing problem is. The main results we survey are all hardness results and indicate that parsing is hard even for relatively small values of parameters such as rank and fan-out in a rewriting system. Keywords: mildly context-sensitive languages; parsing; formal languages; parameterized complexity 1. Introduction Context-free and context-sensitive grammars were introduced in the 1950s by Noam Chomsky, as possible formalisms for describing the structure of natural languages; see, e.g., [1]. Context-free grammars (CFG for short), in particular, have been very successful, both when it comes to natural language modelling and, perhaps to an even larger degree, the description of the syntax of programming languages.
    [Show full text]
  • Context-Free Graph Grammars and Concatenation of Graphs
    Context-free graph grammars and concatenation of graphs Citation for published version (APA): Engelfriet, J., & Vereijken, J. J. (1995). Context-free graph grammars and concatenation of graphs. (Computing science reports; Vol. 9533). Technische Universiteit Eindhoven. Document status and date: Published: 01/01/1995 Document Version: Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers) Please check the document version of this publication: • A submitted manuscript is the version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website. • The final author version and the galley proof are versions of the publication after peer review. • The final published version features the final layout of the paper including the volume, issue and page numbers. Link to publication General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal.
    [Show full text]
  • Hybrid Grammars for Parsing of Discontinuous Phrase Structures and Non-Projective Dependency Structures
    Hybrid Grammars for Parsing of Discontinuous Phrase Structures and Non-Projective Dependency Structures Kilian Gebhardt∗ Technische Universität Dresden Mark-Jan Nederhof∗∗ University of St. Andrews Heiko Vogler∗ Technische Universität Dresden We explore the concept of hybrid grammars, which formalize and generalize a range of existing frameworks for dealing with discontinuous syntactic structures. Covered are both discontinuous phrase structures and non-projective dependency structures. Technically, hybrid grammars are related to synchronous grammars, where one grammar component generates linear structures and another generates hierarchical structures. By coupling lexical elements of both components together, discontinuous structures result. Several types of hybrid grammars are characterized. We also discuss grammar induction from treebanks. The main advantage over existing frameworks is the ability of hybrid grammars to separate discontinuity of the desired structures from time complexity of parsing. This permits exploration of a large variety of parsing algorithms for discontinuous structures, with different properties. This is confirmed by the reported experimental results, which show a wide variety of running time, accuracy, and frequency of parse failures. 1. Introduction Much of the theory of parsing assumes syntactic structures that are trees, formalized such that the children of each node are ordered, and the yield of a tree, that is, the leaves read from left to right, is the sentence. In different terms, each node in the hierarchical syntactic structure of a sentence corresponds to a phrase that is a list of adjacent words, without any gaps. Such a structure is easy to represent in terms of bracketed notation, which is used, for instance, in the Penn Treebank (Marcus, Santorini, and Marcinkiewicz 1993).
    [Show full text]
  • Efficient Techniques for Parsing with Tree Automata
    Efficient techniques for parsing with tree automata Jonas Groschwitz and Alexander Koller Mark Johnson Department of Linguistics Department of Computing University of Potsdam Macquarie University Germany Australia groschwi|[email protected] [email protected] Abstract ing algorithm which recursively explores substruc- tures of the input object and assigns them non- Parsing for a wide variety of grammar for- terminal symbols, but their exact relationship is malisms can be performed by intersecting rarely made explicit. On the practical side, this finite tree automata. However, naive im- parsing algorithm and its extensions (e.g. to EM plementations of parsing by intersection training) have to be implemented and optimized are very inefficient. We present techniques from scratch for each new grammar formalism. that speed up tree-automata-based pars- Thus, development time is spent on reinventing ing, to the point that it becomes practically wheels that are slightly different from previous feasible on realistic data when applied to ones, and the resulting implementations still tend context-free, TAG, and graph parsing. For to underperform. graph parsing, we obtain the best runtimes Koller and Kuhlmann (2011) introduced Inter- in the literature. preted Regular Tree Grammars (IRTGs) in order to address this situation. An IRTG represents 1 Introduction a language by describing a regular language of Grammar formalisms that go beyond context-free derivation trees, each of which is mapped to a term grammars have recently enjoyed renewed atten-
    [Show full text]
  • Practical MAT Learning of Natural Languages Using Treebanks
    Practical MAT learning of natural languages using treebanks Petter Ericson Supervisor: Johanna Bj¨orklund Department of Computing Science, Ume˚aUniversity S{901 87 Ume˚a,Sweden, [email protected] Abstract. In this thesis, an implementation of the MAT algorithm for query learning is improved and tested, specifically by attempted simula- tion of the MAT oracle using large corpora. In order to arrive at suitable test corpora, algorithms for generating positive and negative examples in relation to regular tree languages are developed. Table of Contents Practical MAT learning of natural languages using treebanks ::::::::::: i Petter Ericson Supervisor: Johanna Bj¨orklund 1 Introduction . 1 1.1 Motivation . 1 1.2 Language identification . 1 1.3 MAT learners . 2 Basics . 2 1.4 Outline . 3 2 Preliminaries . 4 2.1 Introduction to trees and tree languages . 4 Alphabets, strings and trees . 4 Automata . 6 Tree automata . 6 Grammars . 8 2.2 MAT learners . 10 Contexts and substitution . 10 Myhill-Nerode and equivalence classes. 10 Observation tables . 11 Putting it all together. 12 Implementation status . 12 3 Method....................................................... 13 3.1 Implementation of the algorithm . 13 Optimising for speed . 13 Adding required functionality . 14 3.2 Tree generation . 14 3.3 Test runs . 16 Optimisations . 16 Tree generation . 16 Corpus runs . 17 4 Results . 18 4.1 Tower languages . 18 4.2 Optimisation . 18 4.3 Generalisation . 19 4.4 Corpora . 19 Real-world corpus gaps . 20 4.5 Tree generation . 22 5 Major difficulties . 23 5.1 Lack of robustness . 23 5.2 Selecting counterexample . 23 5.3 Suitable testing languages . 24 6 Future work .
    [Show full text]