Approximating Context-Free Grammars for Parsing and Verification Sylvain Schmitz

Total Page:16

File Type:pdf, Size:1020Kb

Approximating Context-Free Grammars for Parsing and Verification Sylvain Schmitz Approximating Context-Free Grammars for Parsing and Verification Sylvain Schmitz To cite this version: Sylvain Schmitz. Approximating Context-Free Grammars for Parsing and Verification. Software Engineering [cs.SE]. Université Nice Sophia Antipolis, 2007. English. tel-00271168 HAL Id: tel-00271168 https://tel.archives-ouvertes.fr/tel-00271168 Submitted on 8 Apr 2008 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Approximating Context-Free Grammars for Parsing and Verification The thorn bush of ambiguity. Sylvain Schmitz Laboratoire I3S, Universit´ede Nice - Sophia Antipolis & CNRS, France Ph.D. Thesis Universit´ede Nice - Sophia Antipolis, Ecole´ Doctorale ✭✭ Sciences et Technologies de l’Information et de la Communication ✮✮ Informatique Subject Jacques Farr´e, Universit´ede Nice - Sophia Antipolis Promotor Referees Eberhard Bertsch, Ruhr-Universit¨atBochum Pierre Boullier, INRIA Rocquencourt Defense September 24th, 2007 Date Sophia Antipolis, France Place Jury Pierre Boullier, INRIA Rocquencourt Jacques Farr´e, Universit´ede Nice - Sophia Antipolis Jos´eFortes G´alvez, Universidad de Las Palmas de Gran Canaria Olivier Lecarme, Universit´ede Nice - Sophia Antipolis Igor Litovsky, Universit´ede Nice - Sophia Antipolis Mark-Jan Nederhof, University of St Andrews G´eraud S´enizergues, Universit´ede Bordeaux I Approximating Context-Free Grammars for Parsing and Verification Abstract Programming language developers are blessed with the availability of efficient, powerful tools for parser generation. But even with au- tomated help, the implementation of a parser remains often overly complex. Although programs should convey an unambiguous meaning, the parsers produced for their syntax, specified by a context-free gram- mar, are most often nondeterministic, and even ambiguous. Fac- ing the limitations of traditional deterministic parsers, developers have embraced general parsing techniques, capable of exploring ev- ery nondeterministic choice, but offering no unambiguity warranty. The real challenge in parser development lies then in the proper identification and treatment of ambiguities—but these issues are undecidable. The grammar approximation technique discussed in the thesis is ap- plied to nondeterminism and ambiguity issues in two different ways. The first application is the generation of noncanonical parsers, less prone to nondeterminism, mostly thanks to their ability to exploit an unbounded context-free language as right context to guide their decision. Such parsers enforce the unambiguity of the grammar, and furthermore warrant a linear time parsing complexity. The second application is ambiguity detection in itself, with the insurance that a grammar reported as unambiguous is actually so, whatever level of approximation we might choose. Key Words Ambiguity, context-free grammar, noncanonical parser, position automaton, grammar engineering, verification of infinite systems ACM Categories D.2.4 [Software Engineering]: Software/Program Verification; D.3.1 [Programming Languages]: Formal Definitions and Theory— Syntax ; D.3.4 [Programming Languages]: Processors—Parsing ; F.1.1 [Computation by Abstract Devices]: Models of Computation; F.3.1 [Logics and Meanings of Programs]: Specifying and Verify- ing and Reasoning about Programs; F.4.2 [Mathematical Logic and Formal Languages]: Grammars and Other Rewriting Systems Approximation de grammaires alg´ebriques pour l’analyse syntaxique et la v´erification R´esum´e La th`ese s’int´eresse au probl`eme de l’analyse syntaxique pour les langages de programmation. Si ce sujet a d´ej`a´et´etrait´e`amaintes reprises, et bien que des outils performants pour la g´en´eration d’analyseurs syntaxiques existent et soient largement employ´es, l’impl´ementation de la partie frontale d’un compilateur reste en- core extrˆemement complexe. Ainsi, si le texte d’un programme informatique se doit de n’avoir qu’une seule interpr´etation possible, l’analyse des langages de pro- grammation, fond´ee sur une grammaire alg´ebrique, est, pour sa part, le plus souvent non d´eterministe, voire ambigu¨e. Confront´es aux insuffisances des analyseurs d´eterministes traditionnels, les d´eveloppeurs de parties frontales se sont tourn´es massivement vers des techniques d’analyse g´en´erale, `amˆeme d’explorer des choix non d´eterministes, mais aussi au prix de la garantie d’avoir bien trait´e toutes les ambigu¨ıt´es grammaticales. Une difficult´emajeure dans l’impl´ementation d’un compilateur r´eside alors dans l’identification (non d´ecidable en g´en´eral) et le traitement de ces ambigu¨ıt´es. Les techniques d´ecrites dans la th`ese s’articulent autour d’appro- ximations des grammaires `adeux fins. L’une est la g´en´eration d’a- nalyseurs syntaxiques non canoniques, qui sont moins sensibles aux difficult´es grammaticales, en particulier parce qu’ils peuvent exploi- ter un langage alg´ebrique non fini en guise de contexte droit pour r´esoudre un choix non d´eterministe. Ces analyseurs r´etablissent la garantie de non ambigu¨ıt´ede la grammaire, et en sus assurent un traitement en temps lin´eaire du texte `aanalyser. L’autre est la d´etection d’ambigu¨ıt´een tant que telle, avec l’assurance qu’une grammaire accept´ee est bien non ambigu¨equel que soit le degr´e d’approximation employ´e. Mots clefs Ambigu¨ıt´e, grammaire alg´ebrique, analyseur syntaxique non canon- ique, automate de positions, ing´enierie des grammaires, v´erification de syst`emes infinis Foreword This thesis compiles and corrects the results published in several articles (Schmitz, 2006; Fortes G´alvez et al., 2006; Schmitz, 2007a,b) written over the last four years at the Laboratoire d’Informatique Signaux et Syst`emes de Sophia Antipolis (I3S). I am indebted for the excellent working environment I have been offered during these years, and further grateful for the advice and encouragements provided by countless colleagues, friends, and family members. I am indebted to many friends and members of the laboratory, notably to Ana Almeida Matos, Jan Cederquist, Philippe Lahire, Igor Litovsky, Bruno Martin and S´ebastien Verel, and to the members of my research team, Carine F´ed`ele, Michel Gautero, Vincent Granet, Franck Guingne, Olivier Lecarme, and Lionel Nicolas, who have been extremely helpful and supportive beyond any possible repay. I especially thank Igor Litovsky and Olivier Lecarme for accepting to examine my work and attend to my defense. I am immensely thankful to Jacques Farr´efor his insightful supervision, excellent guidance, and extreme kindness, that greatly exceed everything I could have wished for. I thank Jos´eFortes G´alvez for his fruitful collaboration, his warm wel- come on the occasion of my visit in Las Palmas, and his acceptance to work on my thesis jury. I gratefully acknowledge several insightful email discus- sions with Dick Grune, who taught me more on the literature on parsing technologies than anyone could wish for, and who first suggested that study- ing the construction of noncanonical parsers from the NFA to DFA point of view could bring interesting results. I am also thankful to Terence Parr for our discussions on LL(*), to Bas Basten for sharing his work on the comparison of ambiguity checking algorithms, and to Claus Brabrand and Anders Møller for our discussions on ambiguity detection and for granting me access to their tool. I am obliged to the many anonymous referees of my articles, whose remarks helped considerably improving my work, and I am much obliged to Eberhard Bertsch and Pierre Boullier for their reviews of the present thesis, and the numerous corrections they offered. I am honored by their interest x Foreword in my work, as well as by the interest shown by the remaining members of my jury, Mark-Jan Nederhof and G´eraud S´enizergues. Finally, I thank all my friends and my family for their support, and my beloved Adeline for bearing with me through these years. Sophia Antipolis, October 16, 2007 Contents Abstract v R´esum´e vii Foreword ix Contents xi 1 Introduction 1 1.1 Motivation ............................ 1 1.2 Overview ............................. 3 2 General Background 5 2.1 Topics ............................... 5 2.1.1 Formal Syntax ...................... 5 2.1.1.1 Context-Free Grammars ............ 7 2.1.1.2 Ambiguity ................... 8 2.1.2 Parsing .......................... 10 2.1.2.1 Pushdown Automata . 10 2.1.2.2 The Determinism Hypothesis . 12 2.2 Scope ............................... 13 2.2.1 Parsers for Programming Languages . 13 2.2.1.1 Front Ends ................... 13 2.2.1.2 Requirements on Parsers . 15 2.2.2 Deterministic Parsers . 16 2.2.2.1 LR(k) Parsers . 17 2.2.2.2 Other Deterministic Parsers . 18 2.2.2.3 Issues with Deterministic Parsers . 20 2.2.3 Ambiguity in Programming Languages . 21 2.2.3.1 Further Nondeterminism Issues . 23 xii Contents 3 Advanced Parsers and their Issues 25 3.1 Case Studies ........................... 26 3.1.1 Java Modifiers ...................... 26 3.1.1.1 Grammar .................... 26 3.1.1.2 Input Examples . 26 3.1.1.3 Resolution ................... 28 3.1.2 Standard ML Layered Patterns . 28 3.1.2.1 Grammar .................... 28 3.1.2.2 Input Examples . 28 3.1.2.3 Resolution ................... 29 3.1.3 C++ Qualified Identifiers . 29 3.1.3.1 Grammar .................... 30 3.1.3.2 Input Examples . 30 3.1.3.3 Resolution ................... 30 3.1.4 Standard ML Case Expressions . 31 3.1.4.1 Grammar .................... 31 3.1.4.2 Input Examples . 32 3.1.4.3 Resolution ................... 34 3.2 Advanced Deterministic Parsing . 34 3.2.1 Predicated Parsers ...................
Recommended publications
  • Rewriting Games on Nested Words
    Rewriting Games on Nested Words Martin Schuster Thomas Schwentick TU Dortmund Highlights 2014 Context-Free Games On Strings: Intuition Basic idea: Context-free grammar as Example two-player game T “ tafba, aafafa, abafau, JULIET chooses function f Ñ af | b symbols af f a ROMEO chooses Ó replacement strings af afa Ó JULIET wins if a target abafa string is reached Algorithmic problem JWIN Given: A context-free game G and a string w Question: Does JULIET have a winning strategy on w in G? Previous Results (Some) results from [Muscholl, Schwentick, Segoufin 2006]: JWIN is undecidable in general ù Left-to-right (L2R) restriction With L2R restriction and DFA-represented target language, JWin is EXPTIME-complete for finite or (D/N)FA-represented regular replacement languages PTIME-complete for finite replacement languages and bounded recursion Application: Active XML Schema Rewriting Server Client City Ñ Name, Weather, Events City Ñ Name, Weather, Events Events Ñ @event_svc Events Ñ (Sports|Concert)* event_svc: Ñ (Sports|Concert)* Server (CC-BY-SA 3.0) RRZE Icons, Client (LGPL) Everaldo Coelho Goal: Rewrite server documents into client schema Milo et al., 2003: DTD schema languages ù Reduction to string rewriting ù Claimed PTIME algorithm for bounded recursion (?) Our interest: stronger schema languages (XML Schema) ù Main target: identify tractable restrictions ù cfGs on nested words Nested Words Idea: Correct nesting of tags over label alphabet Σ ù Linearisations of Σ-labelled forests Example a a xayx{ayxayxbyxcyx{cyx{byxayx{ayx{ay b a c Schema
    [Show full text]
  • Regular Description of Context-Free Graph Languages
    Regular Description of ContextFree Graph Languages Jo ost Engelfriet and Vincent van Oostrom Department of Computer Science Leiden University POBox RA Leiden The Netherlands email engelfriwileidenunivnl Abstract A set of lab eled graphs can b e dened by a regular tree language and one regular string language for each p ossible edge lab el as follows For each tree t from the regular tree language the graph g r t has the same no des as t with the same lab els and there is an edge with lab el from no de x to no de y if the string of lab els of the no des on the shortest path from x to y in t b elongs to the regular string language for Slightly generalizing this denition scheme we allow g r t to have only those no des of t that have certain lab els and we allow a relab eling of these no des It is shown that in this way exactly the class of CedNCE graph languages generated by CedNCE graph grammars is obtained one of the largest known classes of contextfree graph languages Introduction There are many kinds of contextfree graph grammars see eg ENRR EKR Some are no de rewriting and others are edge rewriting In b oth cases a pro duc tion of the grammar is of the form X D C Application of such a pro duction to a lab eled graph H consists of removing a no de or edge lab eled X from H replacing it by the graph D and connecting D to the remainder of H accord ing to the embedding pro cedure C Since these grammars are contextfree in the sense that one no de or edge is replaced their derivations can b e mo d eled by derivation trees as
    [Show full text]
  • Grammars (Cfls & Cfgs) Reading: Chapter 5
    Context-Free Languages & Grammars (CFLs & CFGs) Reading: Chapter 5 1 Not all languages are regular So what happens to the languages which are not regular? Can we still come up with a language recognizer? i.e., something that will accept (or reject) strings that belong (or do not belong) to the language? 2 Context-Free Languages A language class larger than the class of regular languages Supports natural, recursive notation called “context- free grammar” Applications: Context- Parse trees, compilers Regular free XML (FA/RE) (PDA/CFG) 3 An Example A palindrome is a word that reads identical from both ends E.g., madam, redivider, malayalam, 010010010 Let L = { w | w is a binary palindrome} Is L regular? No. Proof: N N (assuming N to be the p/l constant) Let w=0 10 k By Pumping lemma, w can be rewritten as xyz, such that xy z is also L (for any k≥0) But |xy|≤N and y≠ + ==> y=0 k ==> xy z will NOT be in L for k=0 ==> Contradiction 4 But the language of palindromes… is a CFL, because it supports recursive substitution (in the form of a CFG) This is because we can construct a “grammar” like this: Same as: 1. A ==> A => 0A0 | 1A1 | 0 | 1 | Terminal 2. A ==> 0 3. A ==> 1 Variable or non-terminal Productions 4. A ==> 0A0 5. A ==> 1A1 How does this grammar work? 5 How does the CFG for palindromes work? An input string belongs to the language (i.e., accepted) iff it can be generated by the CFG G: Example: w=01110 A => 0A0 | 1A1 | 0 | 1 | G can generate w as follows: Generating a string from a grammar: 1.
    [Show full text]
  • A Declarative Extension of Parsing Expression Grammars for Recognizing Most Programming Languages
    Journal of Information Processing Vol.24 No.2 256–264 (Mar. 2016) [DOI: 10.2197/ipsjjip.24.256] Regular Paper A Declarative Extension of Parsing Expression Grammars for Recognizing Most Programming Languages Tetsuro Matsumura1,†1 Kimio Kuramitsu1,a) Received: May 18, 2015, Accepted: November 5, 2015 Abstract: Parsing Expression Grammars are a popular foundation for describing syntax. Unfortunately, several syn- tax of programming languages are still hard to recognize with pure PEGs. Notorious cases appears: typedef-defined names in C/C++, indentation-based code layout in Python, and HERE document in many scripting languages. To recognize such PEG-hard syntax, we have addressed a declarative extension to PEGs. The “declarative” extension means no programmed semantic actions, which are traditionally used to realize the extended parsing behavior. Nez is our extended PEG language, including symbol tables and conditional parsing. This paper demonstrates that the use of Nez Extensions can realize many practical programming languages, such as C, C#, Ruby, and Python, which involve PEG-hard syntax. Keywords: parsing expression grammars, semantic actions, context-sensitive syntax, and case studies on program- ming languages invalidate the declarative property of PEGs, thus resulting in re- 1. Introduction duced reusability of grammars. As a result, many developers need Parsing Expression Grammars [5], or PEGs, are a popu- to redevelop grammars for their software engineering tools. lar foundation for describing programming language syntax [6], In this paper, we propose a declarative extension of PEGs for [15]. Indeed, the formalism of PEGs has many desirable prop- recognizing context-sensitive syntax. The “declarative” exten- erties, including deterministic behaviors, unlimited look-aheads, sion means no arbitrary semantic actions that are written in a gen- and integrated lexical analysis known as scanner-less parsing.
    [Show full text]
  • Weighted Regular Tree Grammars with Storage
    Discrete Mathematics and Theoretical Computer Science DMTCS vol. 20:1, 2018, #26 Weighted Regular Tree Grammars with Storage 1 2 2 Zoltan´ Ful¨ op¨ ∗ Luisa Herrmann † Heiko Vogler 1 Department of Foundations of Computer Science, University of Szeged, Hungary 2 Faculty of Computer Science, Technische Universitat¨ Dresden, Germany received 19th May 2017, revised 11th June 2018, accepted 22nd June 2018. We introduce weighted regular tree grammars with storage as combination of (a) regular tree grammars with storage and (b) weighted tree automata over multioperator monoids. Each weighted regular tree grammar with storage gen- erates a weighted tree language, which is a mapping from the set of trees to the multioperator monoid. We prove that, for multioperator monoids canonically associated to particular strong bimonoids, the support of the generated weighted tree languages can be generated by (unweighted) regular tree grammars with storage. We characterize the class of all generated weighted tree languages by the composition of three basic concepts. Moreover, we prove results on the elimination of chain rules and of finite storage types, and we characterize weighted regular tree grammars with storage by a new weighted MSO-logic. Keywords: regular tree grammars, weighted tree automata, multioperator monoids, storage types, weighted MSO- logic 1 Introduction In automata theory, weighted string automata with storage are a very recent topic of research [HV15, HV16, VDH16]. This model generalizes finite-state string automata in two directions. On the one hand, it is based on the concept of “sequential program + machine”, which Scott [Sco67] proposed in order to har- monize the wealth of upcoming automaton models.
    [Show full text]
  • A Note on Nested Words
    A Note on Nested Words Andreas Blass¤ Yuri Gurevichy Abstract For every regular language of nested words, the underlying strings form a context-free language, and every context-free language can be obtained in this way. Nested words and nested-word automata are gen- eralized to motley words and motley-word automata. Every motley- word automation is equivalent to a deterministic one. For every reg- ular language of motley words, the underlying strings form a ¯nite intersection of context-free languages, and every ¯nite intersection of context-free languages can be obtained in this way. 1 Introduction In [1], Rajeev Alur and P. Madhusudan introduced and studied nested words, nested-word automata and regular languages of nested words. A nested word is a string of letters endowed with a so-called nested relation. Alur and Madhusudan prove that every nested-word automaton is equivalent to a deterministic one. In Section 2, we recall some of their de¯nitions. We quote from the introduction to [1]. \The motivating application area for our results has been soft- ware veri¯cation. Given a sequential program P with stack-based control flow, the execution of P is modeled as a nested word with nesting edges from calls to returns. Speci¯cation of the program is given as a nested word automaton A, and veri¯cation corre- sponds to checking whether every nested word generated by P is ¤Math Dept, University of Michigan, Ann Arbor, MI 48109; [email protected] yMicrosoft Research, Redmond, WA 98052; [email protected] 1 accepted by A. Nested-word automata can express a variety of requirements such as stack-inspection properties, pre-post condi- tions, and interprocedural data-flow properties." In Section 3, we show that all context-free properties, and only context- free properties, can be captured by a nested-word automaton in the following precise sense.
    [Show full text]
  • Practical Experiments with Regular Approximation of Context-Free Languages
    Practical Experiments with Regular Approximation of Context-Free Languages Mark-Jan Nederhof German Research Center for Arti®cial Intelligence Several methods are discussed that construct a ®nite automaton given a context-free grammar, including both methods that lead to subsets and those that lead to supersets of the original context-free language. Some of these methods of regular approximation are new, and some others are presented here in a more re®ned form with respect to existing literature. Practical experiments with the different methods of regular approximation are performed for spoken-language input: hypotheses from a speech recognizer are ®ltered through a ®nite automaton. 1. Introduction Several methods of regular approximation of context-free languages have been pro- posed in the literature. For some, the regular language is a superset of the context-free language, and for others it is a subset. We have implemented a large number of meth- ods, and where necessary, re®ned them with an analysis of the grammar. We also propose a number of new methods. The analysis of the grammar is based on a suf®cient condition for context-free grammars to generate regular languages. For an arbitrary grammar, this analysis iden- ti®es sets of rules that need to be processed in a special way in order to obtain a regular language. The nature of this processing differs for the respective approximation meth- ods. For other parts of the grammar, no special treatment is needed and the grammar rules are translated to the states and transitions of a ®nite automaton without affecting the language.
    [Show full text]
  • Formal Language and Automata Theory (CS21004)
    Context Free Grammar Normal Forms Derivations and Ambiguities Pumping lemma for CFLs PDA Parsing CFL Properties Formal Language and Automata Theory (CS21004) Soumyajit Dey CSE, IIT Kharagpur Context Free Formal Language and Automata Theory Grammar (CS21004) Normal Forms Derivations and Ambiguities Pumping lemma Soumyajit Dey for CFLs CSE, IIT Kharagpur PDA Parsing CFL Properties DPDA, DCFL Membership CSL Soumyajit Dey CSE, IIT Kharagpur Formal Language and Automata Theory (CS21004) Context Free Grammar Normal Forms Derivations and Ambiguities Pumping lemma for CFLs PDA Parsing CFL Properties Formal Language and Automata Announcements Theory (CS21004) Soumyajit Dey CSE, IIT Kharagpur Context Free Grammar Normal Forms Derivations and The slide is just a short summary Ambiguities Follow the discussion and the boardwork Pumping lemma for CFLs Solve problems (apart from those we dish out in class) PDA Parsing CFL Properties DPDA, DCFL Membership CSL Soumyajit Dey CSE, IIT Kharagpur Formal Language and Automata Theory (CS21004) Context Free Grammar Normal Forms Derivations and Ambiguities Pumping lemma for CFLs PDA Parsing CFL Properties Formal Language and Automata Table of Contents Theory (CS21004) Soumyajit Dey 1 Context Free Grammar CSE, IIT Kharagpur 2 Normal Forms Context Free Grammar 3 Derivations and Ambiguities Normal Forms 4 Derivations and Pumping lemma for CFLs Ambiguities 5 Pumping lemma PDA for CFLs 6 Parsing PDA Parsing 7 CFL Properties CFL Properties DPDA, DCFL 8 DPDA, DCFL Membership 9 Membership CSL 10 CSL Soumyajit Dey CSE,
    [Show full text]
  • Efficient Recursive Parsing
    Purdue University Purdue e-Pubs Department of Computer Science Technical Reports Department of Computer Science 1976 Efficient Recursive Parsing Christoph M. Hoffmann Purdue University, [email protected] Report Number: 77-215 Hoffmann, Christoph M., "Efficient Recursive Parsing" (1976). Department of Computer Science Technical Reports. Paper 155. https://docs.lib.purdue.edu/cstech/155 This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. Please contact [email protected] for additional information. EFFICIENT RECURSIVE PARSING Christoph M. Hoffmann Computer Science Department Purdue University West Lafayette, Indiana 47907 CSD-TR 215 December 1976 Efficient Recursive Parsing Christoph M. Hoffmann Computer Science Department Purdue University- Abstract Algorithms are developed which construct from a given LL(l) grammar a recursive descent parser with as much, recursion resolved by iteration as is possible without introducing auxiliary memory. Unlike other proposed methods in the literature designed to arrive at parsers of this kind, the algorithms do not require extensions of the notational formalism nor alter the grammar in any way. The algorithms constructing the parsers operate in 0(k«s) steps, where s is the size of the grammar, i.e. the sum of the lengths of all productions, and k is a grammar - dependent constant. A speedup of the algorithm is possible which improves the bound to 0(s) for all LL(l) grammars, and constructs smaller parsers with some auxiliary memory in form of parameters
    [Show full text]
  • Equivalence of Deterministic Nested Word to Word Transducers
    Equivalence of Deterministic Nested Word to Word Transducers S lawomir Staworko13 and Gr´egoireLaurence23 and Aur´elienLemay23 and Joachim Niehren13 1 INRIA, Lille 2 University of Lille 3 Mostrare project, INRIA & LIFL (CNRS UMR8022) Abstract. We study the equivalence problem of deterministic nested word to word transducers and show it to be surprisingly robust. Modulo polynomial time reductions, it can be identified with 4 equivalence prob- lems for diverse classes of deterministic non-copying order-preserving transducers. In particular, we present polynomial time back and fourth reductions to the morphism equivalence problem on context free lan- guages, which is known to be solvable in polynomial time. Keywords: trees, transducers, automata, context-free grammars, XML. 1 Introduction Nested word automata (nas) [1] are tree automata, that operate on linearizations of unranked trees in streaming manner. All nodes of the tree are visited twice, as usual in preorder traversals. At the first visit (opening event), a symbol is to be pushed onto a stack that is popped at the second visit (closing event). nas were introduced as a reformulation of visibly pushdown automata [2] and proved equivalent to pushdown forest automata [11] and streaming tree automata [6]. More formally, a nested word over Σ is a word of parenthesis in Σ^ = fop; clg × Σ, such that all opening parenthesis are properly closed. We consider nested words as linearizations of unranked trees. For instance, the linearization of a(b(c); d) is the nested word (op; a) · (op; b) · (op; c) · (cl; c) · (cl; b) · (op; d) · (cl; d) · (cl; a), or in XML notation <a><b><c></c></b><d></d></a>.
    [Show full text]
  • Tree Automata Techniques and Applications
    Tree Automata Techniques and Applications Hubert Comon Max Dauchet Remi´ Gilleron Florent Jacquemard Denis Lugiez Sophie Tison Marc Tommasi Contents Introduction 7 Preliminaries 11 1 Recognizable Tree Languages and Finite Tree Automata 13 1.1 Finite Tree Automata . 14 1.2 The pumping Lemma for Recognizable Tree Languages . 22 1.3 Closure Properties of Recognizable Tree Languages . 23 1.4 Tree homomorphisms . 24 1.5 Minimizing Tree Automata . 29 1.6 Top Down Tree Automata . 31 1.7 Decision problems and their complexity . 32 1.8 Exercises . 35 1.9 Bibliographic Notes . 38 2 Regular grammars and regular expressions 41 2.1 Tree Grammar . 41 2.1.1 Definitions . 41 2.1.2 Regular tree grammar and recognizable tree languages . 44 2.2 Regular expressions. Kleene’s theorem for tree languages. 44 2.2.1 Substitution and iteration . 45 2.2.2 Regular expressions and regular tree languages. 48 2.3 Regular equations. 51 2.4 Context-free word languages and regular tree languages . 53 2.5 Beyond regular tree languages: context-free tree languages . 56 2.5.1 Context-free tree languages . 56 2.5.2 IO and OI tree grammars . 57 2.6 Exercises . 58 2.7 Bibliographic notes . 61 3 Automata and n-ary relations 63 3.1 Introduction . 63 3.2 Automata on tuples of finite trees . 65 3.2.1 Three notions of recognizability . 65 3.2.2 Examples of the three notions of recognizability . 67 3.2.3 Comparisons between the three classes . 69 3.2.4 Closure properties for Rec£ and Rec; cylindrification and projection .
    [Show full text]
  • Advanced Parsing Techniques
    Advanced Parsing Techniques Announcements ● Written Set 1 graded. ● Hard copies available for pickup right now. ● Electronic submissions: feedback returned later today. Where We Are Where We Are Parsing so Far ● We've explored five deterministic parsing algorithms: ● LL(1) ● LR(0) ● SLR(1) ● LALR(1) ● LR(1) ● These algorithms all have their limitations. ● Can we parse arbitrary context-free grammars? Why Parse Arbitrary Grammars? ● They're easier to write. ● Can leave operator precedence and associativity out of the grammar. ● No worries about shift/reduce or FIRST/FOLLOW conflicts. ● If ambiguous, can filter out invalid trees at the end. ● Generate candidate parse trees, then eliminate them when not needed. ● Practical concern for some languages. ● We need to have C and C++ compilers! Questions for Today ● How do you go about parsing ambiguous grammars efficiently? ● How do you produce all possible parse trees? ● What else can we do with a general parser? The Earley Parser Motivation: The Limits of LR ● LR parsers use shift and reduce actions to reduce the input to the start symbol. ● LR parsers cannot deterministically handle shift/reduce or reduce/reduce conflicts. ● However, they can nondeterministically handle these conflicts by guessing which option to choose. ● What if we try all options and see if any of them work? The Earley Parser ● Maintain a collection of Earley items, which are LR(0) items annotated with a start position. ● The item A → α·ω @n means we are working on recognizing A → αω, have seen α, and the start position of the item was the nth token. ● Using techniques similar to LR parsing, try to scan across the input creating these items.
    [Show full text]