Approximating Context-Free Grammars for Parsing and Verification Sylvain Schmitz
Total Page:16
File Type:pdf, Size:1020Kb
Approximating Context-Free Grammars for Parsing and Verification Sylvain Schmitz To cite this version: Sylvain Schmitz. Approximating Context-Free Grammars for Parsing and Verification. Software Engineering [cs.SE]. Université Nice Sophia Antipolis, 2007. English. tel-00271168 HAL Id: tel-00271168 https://tel.archives-ouvertes.fr/tel-00271168 Submitted on 8 Apr 2008 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Approximating Context-Free Grammars for Parsing and Verification The thorn bush of ambiguity. Sylvain Schmitz Laboratoire I3S, Universit´ede Nice - Sophia Antipolis & CNRS, France Ph.D. Thesis Universit´ede Nice - Sophia Antipolis, Ecole´ Doctorale ✭✭ Sciences et Technologies de l’Information et de la Communication ✮✮ Informatique Subject Jacques Farr´e, Universit´ede Nice - Sophia Antipolis Promotor Referees Eberhard Bertsch, Ruhr-Universit¨atBochum Pierre Boullier, INRIA Rocquencourt Defense September 24th, 2007 Date Sophia Antipolis, France Place Jury Pierre Boullier, INRIA Rocquencourt Jacques Farr´e, Universit´ede Nice - Sophia Antipolis Jos´eFortes G´alvez, Universidad de Las Palmas de Gran Canaria Olivier Lecarme, Universit´ede Nice - Sophia Antipolis Igor Litovsky, Universit´ede Nice - Sophia Antipolis Mark-Jan Nederhof, University of St Andrews G´eraud S´enizergues, Universit´ede Bordeaux I Approximating Context-Free Grammars for Parsing and Verification Abstract Programming language developers are blessed with the availability of efficient, powerful tools for parser generation. But even with au- tomated help, the implementation of a parser remains often overly complex. Although programs should convey an unambiguous meaning, the parsers produced for their syntax, specified by a context-free gram- mar, are most often nondeterministic, and even ambiguous. Fac- ing the limitations of traditional deterministic parsers, developers have embraced general parsing techniques, capable of exploring ev- ery nondeterministic choice, but offering no unambiguity warranty. The real challenge in parser development lies then in the proper identification and treatment of ambiguities—but these issues are undecidable. The grammar approximation technique discussed in the thesis is ap- plied to nondeterminism and ambiguity issues in two different ways. The first application is the generation of noncanonical parsers, less prone to nondeterminism, mostly thanks to their ability to exploit an unbounded context-free language as right context to guide their decision. Such parsers enforce the unambiguity of the grammar, and furthermore warrant a linear time parsing complexity. The second application is ambiguity detection in itself, with the insurance that a grammar reported as unambiguous is actually so, whatever level of approximation we might choose. Key Words Ambiguity, context-free grammar, noncanonical parser, position automaton, grammar engineering, verification of infinite systems ACM Categories D.2.4 [Software Engineering]: Software/Program Verification; D.3.1 [Programming Languages]: Formal Definitions and Theory— Syntax ; D.3.4 [Programming Languages]: Processors—Parsing ; F.1.1 [Computation by Abstract Devices]: Models of Computation; F.3.1 [Logics and Meanings of Programs]: Specifying and Verify- ing and Reasoning about Programs; F.4.2 [Mathematical Logic and Formal Languages]: Grammars and Other Rewriting Systems Approximation de grammaires alg´ebriques pour l’analyse syntaxique et la v´erification R´esum´e La th`ese s’int´eresse au probl`eme de l’analyse syntaxique pour les langages de programmation. Si ce sujet a d´ej`a´et´etrait´e`amaintes reprises, et bien que des outils performants pour la g´en´eration d’analyseurs syntaxiques existent et soient largement employ´es, l’impl´ementation de la partie frontale d’un compilateur reste en- core extrˆemement complexe. Ainsi, si le texte d’un programme informatique se doit de n’avoir qu’une seule interpr´etation possible, l’analyse des langages de pro- grammation, fond´ee sur une grammaire alg´ebrique, est, pour sa part, le plus souvent non d´eterministe, voire ambigu¨e. Confront´es aux insuffisances des analyseurs d´eterministes traditionnels, les d´eveloppeurs de parties frontales se sont tourn´es massivement vers des techniques d’analyse g´en´erale, `amˆeme d’explorer des choix non d´eterministes, mais aussi au prix de la garantie d’avoir bien trait´e toutes les ambigu¨ıt´es grammaticales. Une difficult´emajeure dans l’impl´ementation d’un compilateur r´eside alors dans l’identification (non d´ecidable en g´en´eral) et le traitement de ces ambigu¨ıt´es. Les techniques d´ecrites dans la th`ese s’articulent autour d’appro- ximations des grammaires `adeux fins. L’une est la g´en´eration d’a- nalyseurs syntaxiques non canoniques, qui sont moins sensibles aux difficult´es grammaticales, en particulier parce qu’ils peuvent exploi- ter un langage alg´ebrique non fini en guise de contexte droit pour r´esoudre un choix non d´eterministe. Ces analyseurs r´etablissent la garantie de non ambigu¨ıt´ede la grammaire, et en sus assurent un traitement en temps lin´eaire du texte `aanalyser. L’autre est la d´etection d’ambigu¨ıt´een tant que telle, avec l’assurance qu’une grammaire accept´ee est bien non ambigu¨equel que soit le degr´e d’approximation employ´e. Mots clefs Ambigu¨ıt´e, grammaire alg´ebrique, analyseur syntaxique non canon- ique, automate de positions, ing´enierie des grammaires, v´erification de syst`emes infinis Foreword This thesis compiles and corrects the results published in several articles (Schmitz, 2006; Fortes G´alvez et al., 2006; Schmitz, 2007a,b) written over the last four years at the Laboratoire d’Informatique Signaux et Syst`emes de Sophia Antipolis (I3S). I am indebted for the excellent working environment I have been offered during these years, and further grateful for the advice and encouragements provided by countless colleagues, friends, and family members. I am indebted to many friends and members of the laboratory, notably to Ana Almeida Matos, Jan Cederquist, Philippe Lahire, Igor Litovsky, Bruno Martin and S´ebastien Verel, and to the members of my research team, Carine F´ed`ele, Michel Gautero, Vincent Granet, Franck Guingne, Olivier Lecarme, and Lionel Nicolas, who have been extremely helpful and supportive beyond any possible repay. I especially thank Igor Litovsky and Olivier Lecarme for accepting to examine my work and attend to my defense. I am immensely thankful to Jacques Farr´efor his insightful supervision, excellent guidance, and extreme kindness, that greatly exceed everything I could have wished for. I thank Jos´eFortes G´alvez for his fruitful collaboration, his warm wel- come on the occasion of my visit in Las Palmas, and his acceptance to work on my thesis jury. I gratefully acknowledge several insightful email discus- sions with Dick Grune, who taught me more on the literature on parsing technologies than anyone could wish for, and who first suggested that study- ing the construction of noncanonical parsers from the NFA to DFA point of view could bring interesting results. I am also thankful to Terence Parr for our discussions on LL(*), to Bas Basten for sharing his work on the comparison of ambiguity checking algorithms, and to Claus Brabrand and Anders Møller for our discussions on ambiguity detection and for granting me access to their tool. I am obliged to the many anonymous referees of my articles, whose remarks helped considerably improving my work, and I am much obliged to Eberhard Bertsch and Pierre Boullier for their reviews of the present thesis, and the numerous corrections they offered. I am honored by their interest x Foreword in my work, as well as by the interest shown by the remaining members of my jury, Mark-Jan Nederhof and G´eraud S´enizergues. Finally, I thank all my friends and my family for their support, and my beloved Adeline for bearing with me through these years. Sophia Antipolis, October 16, 2007 Contents Abstract v R´esum´e vii Foreword ix Contents xi 1 Introduction 1 1.1 Motivation ............................ 1 1.2 Overview ............................. 3 2 General Background 5 2.1 Topics ............................... 5 2.1.1 Formal Syntax ...................... 5 2.1.1.1 Context-Free Grammars ............ 7 2.1.1.2 Ambiguity ................... 8 2.1.2 Parsing .......................... 10 2.1.2.1 Pushdown Automata . 10 2.1.2.2 The Determinism Hypothesis . 12 2.2 Scope ............................... 13 2.2.1 Parsers for Programming Languages . 13 2.2.1.1 Front Ends ................... 13 2.2.1.2 Requirements on Parsers . 15 2.2.2 Deterministic Parsers . 16 2.2.2.1 LR(k) Parsers . 17 2.2.2.2 Other Deterministic Parsers . 18 2.2.2.3 Issues with Deterministic Parsers . 20 2.2.3 Ambiguity in Programming Languages . 21 2.2.3.1 Further Nondeterminism Issues . 23 xii Contents 3 Advanced Parsers and their Issues 25 3.1 Case Studies ........................... 26 3.1.1 Java Modifiers ...................... 26 3.1.1.1 Grammar .................... 26 3.1.1.2 Input Examples . 26 3.1.1.3 Resolution ................... 28 3.1.2 Standard ML Layered Patterns . 28 3.1.2.1 Grammar .................... 28 3.1.2.2 Input Examples . 28 3.1.2.3 Resolution ................... 29 3.1.3 C++ Qualified Identifiers . 29 3.1.3.1 Grammar .................... 30 3.1.3.2 Input Examples . 30 3.1.3.3 Resolution ................... 30 3.1.4 Standard ML Case Expressions . 31 3.1.4.1 Grammar .................... 31 3.1.4.2 Input Examples . 32 3.1.4.3 Resolution ................... 34 3.2 Advanced Deterministic Parsing . 34 3.2.1 Predicated Parsers ...................