Approximating Context-Free Grammars for Parsing and Verification Sylvain Schmitz

Approximating Context-Free Grammars for Parsing and Verification Sylvain Schmitz To cite this version: Sylvain Schmitz. Approximating Context-Free Grammars for Parsing and Verification. Software Engineering [cs.SE]. Université Nice Sophia Antipolis, 2007. English. tel-00271168 HAL Id: tel-00271168 https://tel.archives-ouvertes.fr/tel-00271168 Submitted on 8 Apr 2008 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Approximating Context-Free Grammars for Parsing and Verification The thorn bush of ambiguity. Sylvain Schmitz Laboratoire I3S, Universitéde Nice - Sophia Antipolis & CNRS, France Ph.D. Thesis Universitéde Nice - Sophia Antipolis, Ecole´ Doctorale ✭✭ Sciences et Technologies de l’Information et de la Communication ✮✮ Informatique Subject Jacques Farré, Universitéde Nice - Sophia Antipolis Promotor Referees Eberhard Bertsch, Ruhr-UniversitätBochum Pierre Boullier, INRIA Rocquencourt Defense September 24th, 2007 Date Sophia Antipolis, France Place Jury Pierre Boullier, INRIA Rocquencourt Jacques Farré, Universitéde Nice - Sophia Antipolis JoséFortes Gálvez, Universidad de Las Palmas de Gran Canaria Olivier Lecarme, Universitéde Nice - Sophia Antipolis Igor Litovsky, Universitéde Nice - Sophia Antipolis Mark-Jan Nederhof, University of St Andrews Géraud Sénizergues, Universitéde Bordeaux I Approximating Context-Free Grammars for Parsing and Verification Abstract Programming language developers are blessed with the availability of efficient, powerful tools for parser generation. But even with au- tomated help, the implementation of a parser remains often overly complex. Although programs should convey an unambiguous meaning, the parsers produced for their syntax, specified by a context-free grammar, are most often nondeterministic, and even ambiguous. Fac- ing the limitations of traditional deterministic parsers, developers have embraced general parsing techniques, capable of exploring ev- ery nondeterministic choice, but offering no unambiguity warranty. The real challenge in parser development lies then in the proper identification and treatment of ambiguities—but these issues are undecidable. The grammar approximation technique discussed in the thesis is ap- plied to nondeterminism and ambiguity issues in two different ways. The first application is the generation of noncanonical parsers, less prone to nondeterminism, mostly thanks to their ability to exploit an unbounded context-free language as right context to guide their decision. Such parsers enforce the unambiguity of the grammar, and furthermore warrant a linear time parsing complexity. The second application is ambiguity detection in itself, with the insurance that a grammar reported as unambiguous is actually so, whatever level of approximation we might choose. Key Words Ambiguity, context-free grammar, noncanonical parser, position automaton, grammar engineering, verification of infinite systems ACM Categories D.2.4 [Software Engineering]: Software/Program Verification; D.3.1 [Programming Languages]: Formal Definitions and Theory— Syntax ; D.3.4 [Programming Languages]: Processors—Parsing ; F.1.1 [Computation by Abstract Devices]: Models of Computation; F.3.1 [Logics and Meanings of Programs]: Specifying and Verify- ing and Reasoning about Programs; F.4.2 [Mathematical Logic and Formal Languages]: Grammars and Other Rewriting Systems Approximation de grammaires algébriques pour l’analyse syntaxique et la vérification Résumé La thèse s’intéresse au problème de l’analyse syntaxique pour les langages de programmation. Si ce sujet a déjàététraitéàmaintes reprises, et bien que des outils performants pour la génération d’analyseurs syntaxiques existent et soient largement employés, l’implémentation de la partie frontale d’un compilateur reste en- core extrêmement complexe. Ainsi, si le texte d’un programme informatique se doit de n’avoir qu’une seule interprétation possible, l’analyse des langages de programmation, fondée sur une grammaire algébrique, est, pour sa part, le plus souvent non déterministe, voire ambiguë. Confrontés aux insuffisances des analyseurs déterministes traditionnels, les développeurs de parties frontales se sont tournés massivement vers des techniques d’analyse générale, àmême d’explorer des choix non déterministes, mais aussi au prix de la garantie d’avoir bien traité toutes les ambigu¨ıtés grammaticales. Une difficultémajeure dans l’implémentation d’un compilateur réside alors dans l’identification (non décidable en général) et le traitement de ces ambigu¨ıtés. Les techniques décrites dans la thèse s’articulent autour d’appro- ximations des grammaires àdeux fins. L’une est la génération d’a- nalyseurs syntaxiques non canoniques, qui sont moins sensibles aux difficultés grammaticales, en particulier parce qu’ils peuvent exploi- ter un langage algébrique non fini en guise de contexte droit pour résoudre un choix non déterministe. Ces analyseurs rétablissent la garantie de non ambigu¨ıtéde la grammaire, et en sus assurent un traitement en temps linéaire du texte àanalyser. L’autre est la détection d’ambigu¨ıtéen tant que telle, avec l’assurance qu’une grammaire acceptée est bien non ambiguëquel que soit le degré d’approximation employé. Mots clefs Ambigu¨ıté, grammaire algébrique, analyseur syntaxique non canon- ique, automate de positions, ingénierie des grammaires, vérification de systèmes infinis Foreword This thesis compiles and corrects the results published in several articles (Schmitz, 2006; Fortes Gálvez et al., 2006; Schmitz, 2007a,b) written over the last four years at the Laboratoire d’Informatique Signaux et Systèmes de Sophia Antipolis (I3S). I am indebted for the excellent working environment I have been offered during these years, and further grateful for the advice and encouragements provided by countless colleagues, friends, and family members. I am indebted to many friends and members of the laboratory, notably to Ana Almeida Matos, Jan Cederquist, Philippe Lahire, Igor Litovsky, Bruno Martin and Sébastien Verel, and to the members of my research team, Carine Fédèle, Michel Gautero, Vincent Granet, Franck Guingne, Olivier Lecarme, and Lionel Nicolas, who have been extremely helpful and supportive beyond any possible repay. I especially thank Igor Litovsky and Olivier Lecarme for accepting to examine my work and attend to my defense. I am immensely thankful to Jacques Farréfor his insightful supervision, excellent guidance, and extreme kindness, that greatly exceed everything I could have wished for. I thank JoséFortes Gálvez for his fruitful collaboration, his warm wel- come on the occasion of my visit in Las Palmas, and his acceptance to work on my thesis jury. I gratefully acknowledge several insightful email discussions with Dick Grune, who taught me more on the literature on parsing technologies than anyone could wish for, and who first suggested that study- ing the construction of noncanonical parsers from the NFA to DFA point of view could bring interesting results. I am also thankful to Terence Parr for our discussions on LL(*), to Bas Basten for sharing his work on the comparison of ambiguity checking algorithms, and to Claus Brabrand and Anders Møller for our discussions on ambiguity detection and for granting me access to their tool. I am obliged to the many anonymous referees of my articles, whose remarks helped considerably improving my work, and I am much obliged to Eberhard Bertsch and Pierre Boullier for their reviews of the present thesis, and the numerous corrections they offered. I am honored by their interest x Foreword in my work, as well as by the interest shown by the remaining members of my jury, Mark-Jan Nederhof and Géraud Sénizergues. Finally, I thank all my friends and my family for their support, and my beloved Adeline for bearing with me through these years. Sophia Antipolis, October 16, 2007 Contents Abstract v Résumé vii Foreword ix Contents xi 1 Introduction 1 1.1 Motivation ............................ 1 1.2 Overview ............................. 3 2 General Background 5 2.1 Topics ............................... 5 2.1.1 Formal Syntax ...................... 5 2.1.1.1 Context-Free Grammars ............ 7 2.1.1.2 Ambiguity ................... 8 2.1.2 Parsing .......................... 10 2.1.2.1 Pushdown Automata . 10 2.1.2.2 The Determinism Hypothesis . 12 2.2 Scope ............................... 13 2.2.1 Parsers for Programming Languages . 13 2.2.1.1 Front Ends ................... 13 2.2.1.2 Requirements on Parsers . 15 2.2.2 Deterministic Parsers . 16 2.2.2.1 LR(k) Parsers . 17 2.2.2.2 Other Deterministic Parsers . 18 2.2.2.3 Issues with Deterministic Parsers . 20 2.2.3 Ambiguity in Programming Languages . 21 2.2.3.1 Further Nondeterminism Issues . 23 xii Contents 3 Advanced Parsers and their Issues 25 3.1 Case Studies ........................... 26 3.1.1 Java Modifiers ...................... 26 3.1.1.1 Grammar .................... 26 3.1.1.2 Input Examples . 26 3.1.1.3 Resolution ................... 28 3.1.2 Standard ML Layered Patterns . 28 3.1.2.1 Grammar .................... 28 3.1.2.2 Input Examples . 28 3.1.2.3 Resolution ................... 29 3.1.3 C++ Qualified Identifiers . 29 3.1.3.1 Grammar .................... 30 3.1.3.2 Input Examples . 30 3.1.3.3 Resolution ................... 30 3.1.4 Standard ML Case Expressions . 31 3.1.4.1 Grammar .................... 31 3.1.4.2 Input Examples . 32 3.1.4.3 Resolution ................... 34 3.2 Advanced Deterministic Parsing . 34 3.2.1 Predicated Parsers ...................

Approximating Context-Free Grammars for Parsing and Verification Sylvain Schmitz

Rewriting Games on Nested Words

Regular Description of Context-Free Graph Languages

Grammars (Cfls & Cfgs) Reading: Chapter 5

A Declarative Extension of Parsing Expression Grammars for Recognizing Most Programming Languages

Weighted Regular Tree Grammars with Storage

A Note on Nested Words

Practical Experiments with Regular Approximation of Context-Free Languages

Formal Language and Automata Theory (CS21004)

Efficient Recursive Parsing

Equivalence of Deterministic Nested Word to Word Transducers

Tree Automata Techniques and Applications

Advanced Parsing Techniques