Adaptive LL(*) Parsing: the Power of Dynamic Analysis

Total Page:16

File Type:pdf, Size:1020Kb

Adaptive LL(*) Parsing: the Power of Dynamic Analysis Adaptive LL(*) Parsing: The Power of Dynamic Analysis rtifact A Comple * t * te n * A * te Terence Parr Sam Harwell Kathleen Fisher s W i E s A e n C l l L o D C S o * * c P u e m s University of San Francisco University of Texas at Austin Tufts University E u O e e n v R t e O o d t a y * s E a * l u [email protected] [email protected] kfi[email protected] d a e t Abstract istic LALR(k) or LL(k) parser generators.1 As machine Despite the advances made by modern parsing strategies resources grew, researchers developed more powerful, but such as PEG, LL(*), GLR, and GLL, parsing is not a solved more costly, nondeterministic parsing strategies following problem. Existing approaches suffer from a number of weak- both “bottom-up” (LR-style) and “top-down” (LL-style) ap- nesses, including difficulties supporting side-effecting em- proaches. Strategies include GLR [26], Parser Expression bedded actions, slow and/or unpredictable performance, and Grammar (PEG) [9], LL(*) [20] from ANTLR 3, and re- counter-intuitive matching strategies. This paper introduces cently, GLL [25], a fully general top-down strategy. the ALL(*) parsing strategy that combines the simplicity, ef- Although these newer strategies are much easier to use ficiency, and predictability of conventional top-down LL(k) than LALR(k) and LL(k) parser generators, they suffer parsers with the power of a GLR-like mechanism to make from a variety of weaknesses. First, nondeterministic parsers parsing decisions. The critical innovation is to move gram- sometimes have unanticipated behavior. GLL and GLR re- mar analysis to parse-time, which lets ALL(*) handle any turn multiple parse trees (forests) for ambiguous grammars non-left-recursive context-free grammar. ALL(*) is O(n4) because they were designed to handle natural language in theory but consistently performs linearly on grammars grammars, which are typically ambiguous. For computer used in practice, outperforming general strategies such as languages, ambiguity is almost always an error. One can GLL and GLR by orders of magnitude. ANTLR 4 generates certainly walk a parse forest to disambiguate it, but that ALL(*) parsers and supports direct left-recursion through approach costs extra time, space, and machinery for the un- grammar rewriting. Widespread ANTLR 4 use (5000 down- common case. PEGs are unambiguous by definition but have a quirk where rule A a ab (meaning “A matches either a loads/month in 2013) provides evidence that ALL(*) is ef- ! | fective for a wide variety of applications. or ab”) can never match ab since PEGs choose the first alter- native that matches a prefix of the remaining input. Nested Categories and Subject Descriptors F.4.2 Grammars and backtracking makes debugging PEGs difficult. Other Rewriting Systems [Parsing]; D.3.1 Formal Lan- Second, side-effecting programmer-supplied actions (mu- guages [syntax] tators) like print statements should be avoided in any strat- General Terms Algorithms, Languages, Theory egy that continuously speculates (PEG) or supports multiple interpretations of the input (GLL and GLR) because such ac- Keywords nondeterministic parsing, DFA, augmented tran- tions may never really take place [17]. (Though DParser [24] sition networks, grammar, ALL(*), LL(*), GLR, GLL, PEG supports “final” actions when the programmer is certain a re- duction is part of an unambiguous final parse.) Without side 1. Introduction effects, actions must buffer data for all interpretations in im- Computer language parsing is still not a solved problem in mutable data structures or provide undo actions. The former practice, despite the sophistication of modern parsing strate- mechanism is limited by memory size and the latter is not gies and long history of academic study. When machine re- always easy or possible. The typical approach to avoiding sources were scarce, it made sense to force programmers to mutators is to construct a parse tree for post-parse process- contort their grammars to fit the constraints of determin- ing, but such artifacts fundamentally limit parsing to input files whose trees fit in memory. Parsers that build parse trees cannot analyze large files or infinite streams, such as net- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed work traffic, unless they can be processed in logical chunks. for profit or commercial advantage and that copies bear this notice and the full citation Third, our experiments (Section 7) show that GLL and on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, GLR can be slow and unpredictable in time and space. Their to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. 1 OOPSLA ’14, October 20–24, 2014, Portland, OR, USA. We use the term deterministic in the way that deterministic finite automata Copyright c 2014 ACM 978-1-4503-2585-1/14/10. $15.00. (DFA) differ from nondeterministic finite automata (NFA): The next sym- http://dx.doi.org/10.1145/2660193.2660202 bol(s) uniquely determine action. 579 complexities are, respectively, O(n3) and O(np+1) where p ity and resolves it in favor of the lowest production number is the length of the longest production in the grammar [14]. associated with a surviving subparser. (Productions are num- (GLR is typically quoted as O(n3) as Kipps [15] gave such bered to express precedence as an automatic means of re- an algorithm, albeit with a constant so high as to be imprac- solving ambiguities like PEGs; Bison also resolves conflicts tical.) In theory, general parsers should handle deterministic by choosing the production specified first.) Programmers can grammars in near-linear time. In practice, we found GLL and also embed semantic predicates [22] to choose between am- GLR to be ˜135x slower than ALL(*) on a corpus of 12,920 biguous interpretations. Java 6 library source files (123M) and 6 orders of magnitude ALL(*) parsers memoize analysis results, incrementally slower on a single 3.2M Java file, respectively. and dynamically building up a cache of DFA that map looka- LL(*) addresses these weaknesses by providing a mostly head phrases to predicted productions. (We use the term deterministic parsing strategy that uses regular expressions, analysis in the sense that ALL(*) analysis yields lookahead represented as deterministic finite automata (DFA), to po- DFA like static LL(*) analysis.) The parser can make fu- tentially examine the entire remaining input rather than the ture predictions at the same parser decision and lookahead fixed k-sequences of LL(k). Using DFA for lookahead lim- phrase quickly by consulting the cache. Unfamiliar input its LL(*) decisions to distinguishing alternatives with regular phrases trigger the grammar analysis mechanism, simultane- lookahead languages, even though lookahead languages (set ously predicting an alternative and updating the DFA. DFA of all possible remaining input phrases) are often context- are suitable for recording prediction results, despite the fact free. But the main problem is that the LL(*) grammar con- that the lookahead language at a given decision typically dition is statically undecidable and grammar analysis some- forms a context-free language. Dynamic analysis only needs times fails to find regular expressions that distinguish be- to consider the finite context-free language subsets encoun- tween alternative productions. ANTLR 3’s static analysis de- tered during a parse and any finite set is regular. tects and avoids potentially-undecidable situations, failing To avoid the exponential nature of nondeterministic sub- over to backtracking parsing decisions instead. This gives parsers, prediction uses a graph-structured stack (GSS) [25] LL(*) the same a ab quirk as PEGs for such decisions. to avoid redundant computations. GLR uses essentially the | Backtracking decisions that choose the first matching alter- same strategy except that ALL(*) only predicts productions native also cannot detect obvious ambiguities such A with such subparsers whereas GLR actually parses with ! ↵ ↵ where ↵ is some sequence of grammar symbols that them. Consequently, GLR must push terminals onto the GSS | makes ↵ ↵ non-LL(*). but ALL(*) does not. | ALL(*) parsers handle the task of matching terminals and expanding nonterminals with the simplicity of LL but have 1.1 Dynamic grammar analysis O(n4) theoretical time complexity (Theorem 6.3) because In this paper, we introduce Adaptive LL(*) , or ALL(*) , in the worst-case, the parser must make a prediction at each parsers that combine the simplicity of deterministic top- input symbol and each prediction must examine the entire re- down parsers with the power of a GLR-like mechanism to maining input; examining an input symbol can cost O(n2). make parsing de- cisions. Specifically, LL parsing suspends O(n4) is in line with the complexity of GLR. In Section 7, at each prediction decision point (nonterminal) and then re- we show empirically that ALL(*) parsers for common lan- sumes once the prediction mechanism has chosen the ap- guages are efficient and exhibit linear behavior in practice. propriate production to expand. The critical innovation is The advantages of ALL(*) stem from moving grammar to move grammar analysis to parse-time; no static gram- analysis to parse time, but this choice places an additional mar analysis is needed. This choice lets us avoid the un- burden on grammar functional testing. As with all dynamic decidability of static LL(*) grammar analysis and lets us approaches, programmers must cover as many grammar generate correct parsers (Theorem 6.1) for any non-left- position and input sequence combinations as possible to recursive context-free grammar (CFG).
Recommended publications
  • Derivatives of Parsing Expression Grammars
    Derivatives of Parsing Expression Grammars Aaron Moss Cheriton School of Computer Science University of Waterloo Waterloo, Ontario, Canada [email protected] This paper introduces a new derivative parsing algorithm for recognition of parsing expression gram- mars. Derivative parsing is shown to have a polynomial worst-case time bound, an improvement on the exponential bound of the recursive descent algorithm. This work also introduces asymptotic analysis based on inputs with a constant bound on both grammar nesting depth and number of back- tracking choices; derivative and recursive descent parsing are shown to run in linear time and constant space on this useful class of inputs, with both the theoretical bounds and the reasonability of the in- put class validated empirically. This common-case constant memory usage of derivative parsing is an improvement on the linear space required by the packrat algorithm. 1 Introduction Parsing expression grammars (PEGs) are a parsing formalism introduced by Ford [6]. Any LR(k) lan- guage can be represented as a PEG [7], but there are some non-context-free languages that may also be represented as PEGs (e.g. anbncn [7]). Unlike context-free grammars (CFGs), PEGs are unambiguous, admitting no more than one parse tree for any grammar and input. PEGs are a formalization of recursive descent parsers allowing limited backtracking and infinite lookahead; a string in the language of a PEG can be recognized in exponential time and linear space using a recursive descent algorithm, or linear time and space using the memoized packrat algorithm [6]. PEGs are formally defined and these algo- rithms outlined in Section 3.
    [Show full text]
  • Parser Tables for Non-LR(1) Grammars with Conflict Resolution Joel E
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Elsevier - Publisher Connector Science of Computer Programming 75 (2010) 943–979 Contents lists available at ScienceDirect Science of Computer Programming journal homepage: www.elsevier.com/locate/scico The IELR(1) algorithm for generating minimal LR(1) parser tables for non-LR(1) grammars with conflict resolution Joel E. Denny ∗, Brian A. Malloy School of Computing, Clemson University, Clemson, SC 29634, USA article info a b s t r a c t Article history: There has been a recent effort in the literature to reconsider grammar-dependent software Received 17 July 2008 development from an engineering point of view. As part of that effort, we examine a Received in revised form 31 March 2009 deficiency in the state of the art of practical LR parser table generation. Specifically, LALR Accepted 12 August 2009 sometimes generates parser tables that do not accept the full language that the grammar Available online 10 September 2009 developer expects, but canonical LR is too inefficient to be practical particularly during grammar development. In response, many researchers have attempted to develop minimal Keywords: LR parser table generation algorithms. In this paper, we demonstrate that a well known Grammarware Canonical LR algorithm described by David Pager and implemented in Menhir, the most robust minimal LALR LR(1) implementation we have discovered, does not always achieve the full power of Minimal LR canonical LR(1) when the given grammar is non-LR(1) coupled with a specification for Yacc resolving conflicts. We also detail an original minimal LR(1) algorithm, IELR(1) (Inadequacy Bison Elimination LR(1)), which we have implemented as an extension of GNU Bison and which does not exhibit this deficiency.
    [Show full text]
  • Adaptive LL(*) Parsing: the Power of Dynamic Analysis
    Adaptive LL(*) Parsing: The Power of Dynamic Analysis Terence Parr Sam Harwell Kathleen Fisher University of San Francisco University of Texas at Austin Tufts University [email protected] [email protected] kfi[email protected] Abstract PEGs are unambiguous by definition but have a quirk where Despite the advances made by modern parsing strategies such rule A ! a j ab (meaning “A matches either a or ab”) can never as PEG, LL(*), GLR, and GLL, parsing is not a solved prob- match ab since PEGs choose the first alternative that matches lem. Existing approaches suffer from a number of weaknesses, a prefix of the remaining input. Nested backtracking makes de- including difficulties supporting side-effecting embedded ac- bugging PEGs difficult. tions, slow and/or unpredictable performance, and counter- Second, side-effecting programmer-supplied actions (muta- intuitive matching strategies. This paper introduces the ALL(*) tors) like print statements should be avoided in any strategy that parsing strategy that combines the simplicity, efficiency, and continuously speculates (PEG) or supports multiple interpreta- predictability of conventional top-down LL(k) parsers with the tions of the input (GLL and GLR) because such actions may power of a GLR-like mechanism to make parsing decisions. never really take place [17]. (Though DParser [24] supports The critical innovation is to move grammar analysis to parse- “final” actions when the programmer is certain a reduction is time, which lets ALL(*) handle any non-left-recursive context- part of an unambiguous final parse.) Without side effects, ac- free grammar. ALL(*) is O(n4) in theory but consistently per- tions must buffer data for all interpretations in immutable data forms linearly on grammars used in practice, outperforming structures or provide undo actions.
    [Show full text]
  • The Design & Implementation of an Abstract Semantic Graph For
    Clemson University TigerPrints All Dissertations Dissertations 12-2011 The esiD gn & Implementation of an Abstract Semantic Graph for Statement-Level Dynamic Analysis of C++ Applications Edward Duffy Clemson University, [email protected] Follow this and additional works at: https://tigerprints.clemson.edu/all_dissertations Part of the Computer Sciences Commons Recommended Citation Duffy, Edward, "The eD sign & Implementation of an Abstract Semantic Graph for Statement-Level Dynamic Analysis of C++ Applications" (2011). All Dissertations. 832. https://tigerprints.clemson.edu/all_dissertations/832 This Dissertation is brought to you for free and open access by the Dissertations at TigerPrints. It has been accepted for inclusion in All Dissertations by an authorized administrator of TigerPrints. For more information, please contact [email protected]. THE DESIGN &IMPLEMENTATION OF AN ABSTRACT SEMANTIC GRAPH FOR STATEMENT-LEVEL DYNAMIC ANALYSIS OF C++ APPLICATIONS A Dissertation Presented to the Graduate School of Clemson University In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy Computer Science by Edward B. Duffy December 2011 Accepted by: Dr. Brian A. Malloy, Committee Chair Dr. James B. von Oehsen Dr. Jason P. Hallstrom Dr. Pradip K. Srimani In this thesis, we describe our system, Hylian, for statement-level analysis, both static and dynamic, of a C++ application. We begin by extending the GNU gcc parser to generate parse trees in XML format for each of the compilation units in a C++ application. We then provide verification that the generated parse trees are structurally equivalent to the code in the original C++ application. We use the generated parse trees, together with an augmented version of the gcc test suite, to recover a grammar for the C++ dialect that we parse.
    [Show full text]
  • Parsing Techniques
    Dick Grune, Ceriel J.H. Jacobs Parsing Techniques 2nd edition — Monograph — September 27, 2007 Springer Berlin Heidelberg NewYork Hong Kong London Milan Paris Tokyo Contents Preface to the Second Edition : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : v Preface to the First Edition : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : xi 1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 1.1 Parsing as a Craft. 2 1.2 The Approach Used . 2 1.3 Outline of the Contents . 3 1.4 The Annotated Bibliography . 4 2 Grammars as a Generating Device : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 5 2.1 Languages as Infinite Sets . 5 2.1.1 Language . 5 2.1.2 Grammars . 7 2.1.3 Problems with Infinite Sets . 8 2.1.4 Describing a Language through a Finite Recipe . 12 2.2 Formal Grammars . 14 2.2.1 The Formalism of Formal Grammars . 14 2.2.2 Generating Sentences from a Formal Grammar . 15 2.2.3 The Expressive Power of Formal Grammars . 17 2.3 The Chomsky Hierarchy of Grammars and Languages . 19 2.3.1 Type 1 Grammars . 19 2.3.2 Type 2 Grammars . 23 2.3.3 Type 3 Grammars . 30 2.3.4 Type 4 Grammars . 33 2.3.5 Conclusion . 34 2.4 Actually Generating Sentences from a Grammar. 34 2.4.1 The Phrase-Structure Case . 34 2.4.2 The CS Case . 36 2.4.3 The CF Case . 36 2.5 To Shrink or Not To Shrink . 38 xvi Contents 2.6 Grammars that Produce the Empty Language . 41 2.7 The Limitations of CF and FS Grammars . 42 2.7.1 The uvwxy Theorem . 42 2.7.2 The uvw Theorem .
    [Show full text]
  • GLR* - an Efficient Noise-Skipping Parsing Algorithm for Context Free· Grammars
    GLR* - An Efficient Noise-skipping Parsing Algorithm For Context Free· Grammars Alon Lavie and Masaru Tomita School of Computer Scienc�, Carnegie Mellon University 5000 Forbes- Avenue, Pittsburgh, PA 15213 email: lavie©cs. emu . edu Abstract This paper describes GLR*, a parser that can parse any input sentence by ignoring unrec­ ognizable parts of the sentence. In case the standard parsing procedure fails to parse an input sentence, the parser nondeterministically skips some word(s) in the sentence, and returns the parse with fewest skipped words. Therefore, the parser will return some parse(s) with any input sentence, unless no part of the sentence can be recognized at all. The problem can be defined in the following way: Given a context-free grammar G and a sentence S, find and parse S' - the largest subset of words of S, such that S' E L(G). The algorithm described in this paper is a modificationof the Generalized LR (Tomita) parsing algorithm [Tomita, 1986]. The parser accommodates the skipping of words by allowing shift operations to be performed from inactive state nodes of the Graph Structured Stack. A heuristic similar to beam search makes the algorithm computationally tractable. There have been several other approaches to the problem of robust parsing, most of which are special purpose algorithms [Carbonell and Hayes, 1984), [Ward, 1991] and others. Because our approach is a modification to a standard context-free parsing algorithm, all the techniques and grammars developed for the standard parser can be applied as they are. Also, in case the input sentence is by itself grammatical, our parser behaves exactly as the standard GLR parser.
    [Show full text]
  • A Simple, Possibly Correct LR Parser for C11 Jacques-Henri Jourdan, François Pottier
    A Simple, Possibly Correct LR Parser for C11 Jacques-Henri Jourdan, François Pottier To cite this version: Jacques-Henri Jourdan, François Pottier. A Simple, Possibly Correct LR Parser for C11. ACM Transactions on Programming Languages and Systems (TOPLAS), ACM, 2017, 39 (4), pp.1 - 36. 10.1145/3064848. hal-01633123 HAL Id: hal-01633123 https://hal.archives-ouvertes.fr/hal-01633123 Submitted on 11 Nov 2017 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. 14 A simple, possibly correct LR parser for C11 Jacques-Henri Jourdan, Inria Paris, MPI-SWS François Pottier, Inria Paris The syntax of the C programming language is described in the C11 standard by an ambiguous context-free grammar, accompanied with English prose that describes the concept of “scope” and indicates how certain ambiguous code fragments should be interpreted. Based on these elements, the problem of implementing a compliant C11 parser is not entirely trivial. We review the main sources of difficulty and describe a relatively simple solution to the problem. Our solution employs the well-known technique of combining an LALR(1) parser with a “lexical feedback” mechanism. It draws on folklore knowledge and adds several original as- pects, including: a twist on lexical feedback that allows a smooth interaction with lookahead; a simplified and powerful treatment of scopes; and a few amendments in the grammar.
    [Show full text]
  • Lecture 3: Recursive Descent Limitations, Precedence Climbing
    Lecture 3: Recursive descent limitations, precedence climbing David Hovemeyer September 9, 2020 601.428/628 Compilers and Interpreters Today I Limitations of recursive descent I Precedence climbing I Abstract syntax trees I Supporting parenthesized expressions Before we begin... Assume a context-free struct Node *Parser::parse_A() { grammar has the struct Node *next_tok = lexer_peek(m_lexer); following productions on if (!next_tok) { the nonterminal A: error("Unexpected end of input"); } A → b C A → d E struct Node *a = node_build0(NODE_A); int tag = node_get_tag(next_tok); (A, C, E are if (tag == TOK_b) { nonterminals; b, d are node_add_kid(a, expect(TOK_b)); node_add_kid(a, parse_C()); terminals) } else if (tag == TOK_d) { What is the problem node_add_kid(a, expect(TOK_d)); node_add_kid(a, parse_E()); with the parse function } shown on the right? return a; } Limitations of recursive descent Recall: a better infix expression grammar Grammar (start symbol is A): A → i = A T → T*F A → E T → T/F E → E + T T → F E → E-T F → i E → T F → n Precedence levels: Nonterminal Precedence Meaning Operators Associativity A lowest Assignment = right E Expression + - left T Term * / left F highest Factor No Parsing infix expressions Can we write a recursive descent parser for infix expressions using this grammar? Parsing infix expressions Can we write a recursive descent parser for infix expressions using this grammar? No Left recursion Left-associative operators want to have left-recursive productions, but recursive descent parsers can’t handle left recursion
    [Show full text]
  • OPTIMAL AMBIGUITY PACKING in CONTEXT-FREE PARSERS with Intaloner Laleavie VED Unicarolynfica Pensttionein Rose Language Technologies Inst
    OPTIMAL AMBIGUITY PACKING IN CONTEXT-FREE PARSERS WITH INTAlonER LaLEAvie VED UNICarolynFICA PenstTIONein Rose Language Technologies Inst. Learning Res. and Dev. Center Carnegie Mellon University University of Pittsburgh Pittsburgh, PA 15213 Pittsburgh, PA 15213 alavie©cs . cmu .edu rosecpCpitt .edu Abstract Ambiguity packing is a well known technique for enhancing the efficiency of context-free parsers. However, in the case of unification-augmented context-free parsers where parsing is interleaved with feature unification, the propagation of feature structures imposes difficulties on the ability of the parser to effectively perform ambiguity packing. We demonstrate that a clever heuristic for prioritizing the execution order of grammar rules and parsing actions can achieve a high level of ambiguity packing that is provably optimal. We present empirical evaluations of the proposed technique, performed with both a Generalized LR parser and a chart parser, that demonstrate its effectiveness. 1 Introduction Efficient parsing algorithms for purely context-free grammars have long been known. Most natural language applications, however, require a far more linguistically detailed level of analysis that cannot be supported in a natural way by pure context-free grammars. Unification-based grammar formalisms (such as HPSG), on the other hand, are linguistically well founded, but are difficult to parse efficiently. Unification-augmented context-free grammar formalismshave thus become popular approaches where parsing can be based on an efficient context-free algorithm enhanced to produce linguistically rich representations. Whereas in some formalisms, such as the ANLT Grammar Formalism [2] [3], a context-free "backbone" is compiled from a pure unification grammar, in other formalisms [19] the grammar itself is written as a collection of context-free phrase structure rules augmented with unifi­ cation constraints that apply to feature structures that are associated with the grammar constituents.
    [Show full text]
  • Context-Aware Scanning and Determinism-Preserving Grammar Composition, in Theory and Practice
    Context-Aware Scanning and Determinism-Preserving Grammar Composition, in Theory and Practice A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY August Schwerdfeger IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY July, 2010 c August Schwerdfeger 2010 ALL RIGHTS RESERVED Acknowledgments I would like to thank all my colleagues in the MELT group for all their input and assis- tance through the course of the project, and especially for their help in providing ready- made tests and applications for my software. Thanks specifically to Derek Bodin and Ted Kaminski for their efforts in integrating the Copper parser generator into MELT’s attribute grammar tools; to Lijesh Krishnan for developing the ableJ system of Java ex- tensions and assisting me extensively with its further development; and to Yogesh Mali for implementing and testing the Promela grammar with Copper. I also thank my advisor, Eric Van Wyk, for his continuous guidance throughout my time as a student, and also for going well above and beyond the call of duty in helping me to edit and prepare this thesis for presentation. I thank the members of my thesis committee, Mats Heimdahl, Gopalan Nadathur, and Wayne Richter, for their time and efforts in serving and reviewing. I especially wish to thank Prof. Nadathur for his many detailed questions and helpful suggestions for improving this thesis. Work on this thesis has been partially funded by National Science Foundation Grants 0347860 and 0905581. Support has also been received from funds provided by the Institute of Technology (soon to be renamed the College of Science and Engi- neering) and the Department of Computer Science and Engineering at the University of Minnesota.
    [Show full text]
  • Elkhound: a Fast, Practical GLR Parser Generator
    Published in Proc. of Conference on Compiler Construction, 2004, pp. 73–88. Elkhound: A Fast, Practical GLR Parser Generator Scott McPeak and George C. Necula ? University of California, Berkeley {smcpeak,necula}@cs.berkeley.edu Abstract. The Generalized LR (GLR) parsing algorithm is attractive for use in parsing programming languages because it is asymptotically efficient for typical grammars, and can parse with any context-free gram- mar, including ambiguous grammars. However, adoption of GLR has been slowed by high constant-factor overheads and the lack of a general, user-defined action interface. In this paper we present algorithmic and implementation enhancements to GLR to solve these problems. First, we present a hybrid algorithm that chooses between GLR and ordinary LR on a token-by-token ba- sis, thus achieving competitive performance for determinstic input frag- ments. Second, we describe a design for an action interface and a new worklist algorithm that can guarantee bottom-up execution of actions for acyclic grammars. These ideas are implemented in the Elkhound GLR parser generator. To demonstrate the effectiveness of these techniques, we describe our ex- perience using Elkhound to write a parser for C++, a language notorious for being difficult to parse. Our C++ parser is small (3500 lines), efficient and maintainable, employing a range of disambiguation strategies. 1 Introduction The state of the practice in automated parsing has changed little since the intro- duction of YACC (Yet Another Compiler-Compiler), an LALR(1) parser genera- tor, in 1975 [1]. An LALR(1) parser is deterministic: at every point in the input, it must be able to decide which grammar rule to use, if any, utilizing only one token of lookahead [2].
    [Show full text]
  • A Simple, Possibly Correct LR Parser for C11
    14 A simple, possibly correct LR parser for C11 Jacques-Henri Jourdan, Inria Paris, MPI-SWS François Pottier, Inria Paris The syntax of the C programming language is described in the C11 standard by an ambiguous context-free grammar, accompanied with English prose that describes the concept of “scope” and indicates how certain ambiguous code fragments should be interpreted. Based on these elements, the problem of implementing a compliant C11 parser is not entirely trivial. We review the main sources of difficulty and describe a relatively simple solution to the problem. Our solution employs the well-known technique of combining an LALR(1) parser with a “lexical feedback” mechanism. It draws on folklore knowledge and adds several original as- pects, including: a twist on lexical feedback that allows a smooth interaction with lookahead; a simplified and powerful treatment of scopes; and a few amendments in the grammar. Although not formally verified, our parser avoids several pitfalls that other implementations have fallen prey to. We believe that its simplicity, its mostly-declarative nature, and its high similarity with the C11 grammar are strong informal arguments in favor of its correctness. Our parser is accompanied with a small suite of “tricky” C11 programs. We hope that it may serve as a reference or a starting point in the implementation of compilers and analysis tools. CCS Concepts: •Software and its engineering ! Parsers; Additional Key Words and Phrases: Compilation; parsing; ambiguity; lexical feedback; C89; C99; C11 ACM Reference Format: Jacques-Henri Jourdan and François Pottier. 2017. A simple, possibly correct LR parser for C11 ACM Trans.
    [Show full text]