Lavie, A., "GLR* : a Robust Grammar-Focused Parser For

Total Page:16

File Type:pdf, Size:1020Kb

Lavie, A., GLR* : A Robust Grammar­Focused Parser for Spontaneously Spoken Language Alon Lavie May 1996 CMU­CS­96­126 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy Thesis Committee: Masaru Tomita, Chair Jaime Carbonell Alex Waibel Edward Gibson, MIT Copyright c 1996 Alon Lavie Keywords: Natural Language Processing, Speech Understanding, Machine Translation, Parsing, Generalized LR Parsing, JANUS Abstract The analysis of spoken language is widely considered to be a more challenging task than the analysis of written text. All of the difficulties of written language can generally be found in spoken language as well. Parsing spontaneous speech must, however, also deal with problems such as speech disfluencies, the looser notion of grammaticality, and the lack of clearly marked sentence boundaries. The contamination of the input with errors of a speech recognizer can further exacerbate these problems. Most natural language parsing algorithms are designed to analyze “clean” grammatical input. Because they reject any input which is found to be ungrammatical in even the slightest way, such parsers are unsuitable for parsing spontaneous speech, where completely grammatical input is the exception more than the rule. This thesis describes GLR*, a parsing system based on Tomita's Generalized LR parsing algorithm, that was designed to be robust to two particular types of extra­grammaticality: noise in the input, and limited grammar coverage. GLR* attempts to overcome these forms of extra­grammaticality by ignoring the unparsable words and fragments and conducting a search for the maximal subset of the original input that is covered by the grammar. The parser is coupled with a beam search heuristic, that limits the combinations of skipped words considered by the parser, and ensures that the parser will operate within feasible time and space bounds. The developed parsing system includes several tools designed to address the difficulties of parsing spontaneous speech. To cope with high levels of ambiguity, we developed a statistical disambigua­ tion module, in which probabilities are attached directly to the actions in the LR parsing table. The parser must also determine the “best” parse from among the different parsable subsets of an input. We thus designed a general framework for combining a collection of parse evaluation measures into an integrated heuristic for evaluating and ranking the parses produced by the GLR* parser. This framework was applied to a set of four parse scoring measures developed for the JANUS scheduling domain and the ATIS domain. We added a parse quality heuristic, that allows the parser to self­judge the quality of the parse chosen as best, and to detect cases in which important information is likely to have been skipped. To demonstrate its suitability to parsing spontaneous speech, the GLR* parser was integrated into the JANUS speech translation system. Our evaluations on both transcribed and speech recognized input have indicated that the version of the system that uses GLR* produces between 15% and 30% more acceptable translations, than a corresponding version that uses the original non­robust GLR parser. We also developed a version of GLR* that is suitable to parsing word lattices produced by the speech recognizer, and investigated how lattice parsing can potentially overcome errors of the speech recognizer and further improve end­to­end performance of the speech translation system. Acknowledgements There are many who have helped me along the long road that has culminated in this thesis, some in direct and obvious ways, others in small and supposedly unrelated ways, but important nonetheless. I would like to thank my advisor, Masaru Tomita, for inspiring and supporting my research interests in parsing algorithms, for introducing me to the problems of spoken language parsing and understanding, for suggesting the topic of this thesis, and for guiding me in my thesis work. I also wish to thank the other members of my thesis committee: Jaime Carbonell, Alex Waibel and Ted Gibson. Alex brought me into the JANUS speech­to­speech translation project, which provided a natural and practical setting for applying and testing my work. I am particularly grateful for his guidance on system performance issues and evaluation methods. Ted Gibson provided me with objective insight about my work, and with careful and well thought comments on my thesis draft. Special thanks are due to Jaime Carbonell, for sharing his experienced perspective on Machine Translation and AI, for stepping into the “advisor shoes” in Tommy's absence, and for his very helpful comments and suggestions on the preliminary draft of this thesis. I would like to thank all my friends and colleagues in the JANUS project, for providing a fun and exciting environment for conducting research on speech understanding and translation. Particular thanks go to Lori Levin, Donna Gates, Oren Glickman, Noah Coccaro, Carolyn Rose,´ Marsal Gavalda,` Laura Mayfield, Keiko Horiguchi and Kaori Shima, who worked closely with me on the project, and assisted me in a variety of experiments and evaluations reported in this thesis. On the personal side, there is a whole bunch of friends, colleagues and family members, whom I would like to thank for their support and encouragement. I will not even attempt to list them all, for fear that surely someone will be forgotten. Yet, I feel a need to mention a handful of people to whom special thanks are due: To my friend and ex­officemate Shai, who greeted and hosted me when I first arrived here in Pittsburgh, spent five years with me in a windowless Wean office, and helped me with so many “system hacking” questions... To my best friend and next door neighbor Dean, who is always there to listen and give advice, and is really good at putting things in perspective. To my good friend Orna, for being just that, but also for the numerous dinners and coffee breaks during the last two intense months of writing, that made them so much more bearable. To my family in Israel, for their support, and in particular to my father, who not only provided constant encouragement and advice, but has also been a true role model for me to follow. And most of all, to Bob, for taking care of “things” during the very busy months of final thesis writing, and for traveling with me along the longest and most difficult segments of the road to the PhD. I couldn't have done it without you. Contents 1 Introduction and Overview 7 1.1 Introduction ¡¢¡£¡¢¡£¡¢¡¤¡¥¡¤¡¢¡¥¡¤¡¢¡£¡¢¡£¡¢¡£¡¢¡¤¡¥¡¢¡¤¡¥¡¤¡¢¡£¡¢¡£¡¢¡£¡¢¡¢¡£¡¢¡¤¡ 7 1.1.1 Extra­grammaticalities in Spontaneous Speech ¡¥¡¤¡¢¡£¡¢¡£¡¢¡£¡¢¡¢¡£¡¢¡¤¡ 8 1.2 Research Goals ¡¢¡£¡¢¡£¡¢¡¤¡¥¡¢¡¤¡¥¡¤¡¢¡£¡¢¡£¡¢¡£¡¢¡¢¡£¡¢¡¤¡¥¡¤¡¢¡£¡¢¡£¡¢¡¢¡£¡¢¡£¡ 10 1.3 Thesis Summary ¡¢¡£¡¢¡£¡¢¡¤¡¥¡¢¡¤¡¥¡¤¡¢¡£¡¢¡£¡¢¡£¡¢¡¢¡£¡¢¡¤¡¥¡¤¡¢¡£¡¢¡£¡¢¡¢¡£¡¢¡£¡ 11 1.3.1 Foundation: GLR Parsing ¡¢¡£¡¢¡£¡¢¡£¡¢¡¤¡¥¡¢¡¤¡¥¡¤¡¢¡£¡¢¡£¡¢¡£¡¢¡¢¡£¡¢¡¤¡ 11 1.3.2 The GLR* Parsing Algorithm ¡¢¡£¡¢¡£¡¢¡¤¡¥¡¢¡¤¡¥¡¤¡¢¡£¡¢¡£¡¢¡£¡¢¡¢¡£¡¢¡¤¡ 12 1.3.3 Statistical Disambiguation ¡¢¡¤¡¥¡¤¡¢¡£¡¢¡£¡¢¡¢¡£¡¢¡£¡¢¡¤¡¥¡¤¡¢¡£¡¢¡¢¡£¡¢¡£¡ 13 1.3.4 Parse Evaluation Heuristics ¡£¡¢¡£¡¢¡¤¡¥¡¤¡¢¡¥¡¤¡¢¡£¡¢¡£¡¢¡£¡¢¡¤¡¥¡¢¡¤¡¥¡¤¡ 16 1.3.5 Parsing Spontaneous Speech using GLR* ¡¢¡¥¡¤¡¢¡£¡¢¡£¡¢¡£¡¢¡¤¡¥¡¢¡¤¡¥¡¤¡ 18 1.3.6 Parsing Speech Lattices using GLR* ¡¢¡£¡¢¡¢¡£¡¢¡¤¡¥¡¤¡¢¡£¡¢¡£¡¢¡¢¡£¡¢¡£¡ 20 1.4 Thesis Contributions ¡¢¡£¡¢¡£¡¢¡¢¡£¡¢¡£¡¢¡¤¡¥¡¤¡¢¡£¡¢¡¢¡£¡¢¡£¡¢¡£¡¢¡¤¡¥¡¤¡¢¡¥¡¤¡¢¡£¡ 22 1.5 Previous Related Work ¡£¡¢¡£¡¢¡¢¡£¡¢¡¤¡¥¡¤¡¢¡£¡¢¡£¡¢¡¢¡£¡¢¡£¡¢¡¤¡¥¡¤¡¢¡£¡¢¡¢¡£¡¢¡£¡ 22 1.5.1 Other Approaches to Robust Parsing ¡¢¡£¡¢¡¢¡£¡¢¡£¡¢¡¤¡¥¡¤¡¢¡£¡¢¡¢¡£¡¢¡£¡ 22 1.5.2 Other Work on Parsing Speech ¡£¡¢¡¤¡¥¡¤¡¢¡¥¡¤¡¢¡£¡¢¡£¡¢¡£¡¢¡¤¡¥¡¢¡¤¡¥¡¤¡ 24 1.5.3 Adaptive and Self­learning Approaches ¡¤¡¢¡¥¡¤¡¢¡£¡¢¡£¡¢¡£¡¢¡¤¡¥¡¢¡¤¡¥¡¤¡ 27 2 Generalized LR Parsing 28 2.1 Principles of LR Parsing ¡¢¡£¡¢¡¢¡£¡¢¡¤¡¥¡¤¡¢¡£¡¢¡£¡¢¡¢¡£¡¢¡£¡¢¡¤¡¥¡¤¡¢¡£¡¢¡¢¡£¡¢¡£¡ 28 2.2 The GLR Parsing Algorithm ¡¢¡¢¡£¡¢¡¤¡¥¡¤¡¢¡£¡¢¡£¡¢¡¢¡£¡¢¡£¡¢¡¤¡¥¡¤¡¢¡£¡¢¡¢¡£¡¢¡£¡ 29 2.2.1 The Graph Structured Stack ¡£¡¢¡£¡¢¡¤¡¥¡¤¡¢¡¥¡¤¡¢¡£¡¢¡£¡¢¡£¡¢¡¤¡¥¡¢¡¤¡¥¡¤¡ 29 2.2.2 Local Ambiguity Packing ¡¢¡£¡¢¡£¡¢¡¤¡¥¡¤¡¢¡¥¡¤¡¢¡£¡¢¡£¡¢¡£¡¢¡¤¡¥¡¢¡¤¡¥¡¤¡ 30 2.2.3 Shared Packed Forests ¡¢¡¤¡¥¡¤¡¢¡£¡¢¡£¡¢¡£¡¢¡¢¡£¡¢¡¤¡¥¡¤¡¢¡£¡¢¡£¡¢¡¢¡£¡¢¡£¡ 31 2.3 Computational Complexity and Performance of the GLR Parser ¡¢¡£¡¢¡¤¡¥¡¢¡¤¡¥¡¤¡ 32 2.4 GLR and Unification Based Grammars ¡£¡¢¡£¡¢¡¤¡¥¡¢¡¤¡¥¡¤¡¢¡£¡¢¡£¡¢¡£¡¢¡¢¡£¡¢¡¤¡ 33 3 The GLR* Parsing Algorithm 35 3.1 Introduction ¡¢¡£¡¢¡£¡¢¡¤¡¥¡¤¡¢¡¥¡¤¡¢¡£¡¢¡£¡¢¡£¡¢¡¤¡¥¡¢¡¤¡¥¡¤¡¢¡£¡¢¡£¡¢¡£¡¢¡¢¡£¡¢¡¤¡ 35 3.2 The Unrestricted GLR* Parsing Algorithm ¡£¡¢¡£¡¢¡¢¡£¡¢¡£¡¢¡¤¡¥¡¤¡¢¡£¡¢¡¢¡£¡¢¡£¡ 36 3.2.1 Design Considerations ¡¢¡£¡¢¡¤¡¥¡¤¡¢¡£¡¢¡£¡¢¡¢¡£¡¢¡£¡¢¡¤¡¥¡¤¡¢¡£¡¢¡¢¡£¡¢¡£¡ 36 3.2.2 Outline of the Unrestricted GLR* Algorithm ¡¤¡¢¡£¡¢¡£¡¢¡£¡¢¡¤¡¥¡¢¡¤¡¥¡¤¡ 37 3.3 Enhanced Local Ambiguity Packing ¡£¡¢¡£¡¢¡£¡¢¡¤¡¥¡¢¡¤¡¥¡¤¡¢¡£¡¢¡£¡¢¡£¡¢¡¢¡£¡¢¡¤¡ 38 3.4 An Example ¡¥¡¤¡¢¡£¡¢¡£¡¢¡£¡¢¡¢¡£¡¢¡¤¡¥¡¤¡¢¡£¡¢¡£¡¢¡¢¡£¡¢¡£¡¢¡¤¡¥¡¤¡¢¡£¡¢¡¢¡£¡¢¡£¡ 40 3.5 Complexity and Performance of the Unrestricted GLR* Algorithm ¡¥¡¤¡¢¡¥¡¤¡¢¡£¡ 46 3.5.1 Time complexity of Unrestricted GLR* ¡£¡¢¡¢¡£¡¢¡¤¡¥¡¤¡¢¡£¡¢¡£¡¢¡¢¡£¡¢¡£¡ 46 3.5.2 Runtime Performance of Unrestricted GLR* ¡£¡¢¡¤¡¥¡¤¡¢¡£¡¢¡£¡¢¡¢¡£¡¢¡£¡ 49 3.6 Controlling Parser Search ¡¤¡¥¡¢¡¤¡¥¡¤¡¢¡£¡¢¡£¡¢¡£¡¢¡¢¡£¡¢¡¤¡¥¡¤¡¢¡£¡¢¡£¡¢¡¢¡£¡¢¡£¡ 50 ¡¢¡¤¡¥¡¤¡¢¡¥¡¤¡¢¡£¡¢¡£¡¢¡£¡¢¡¤¡¥¡¢¡¤¡¥¡¤¡ 3.6.1 The ¦ ­word Skip Limit Heuristic 52 1 3.6.2 The Beam Search Heuristic ¡£¡¢¡£¡¢¡£¡¢¡¤¡¥¡¢¡¤¡¥¡¤¡¢¡£¡¢¡£¡¢¡£¡¢¡¢¡£¡¢¡¤¡ 53 3.6.3 Empirical Evaluation
Recommended publications
  • Parser Tables for Non-LR(1) Grammars with Conflict Resolution Joel E
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Elsevier - Publisher Connector Science of Computer Programming 75 (2010) 943–979 Contents lists available at ScienceDirect Science of Computer Programming journal homepage: www.elsevier.com/locate/scico The IELR(1) algorithm for generating minimal LR(1) parser tables for non-LR(1) grammars with conflict resolution Joel E. Denny ∗, Brian A. Malloy School of Computing, Clemson University, Clemson, SC 29634, USA article info a b s t r a c t Article history: There has been a recent effort in the literature to reconsider grammar-dependent software Received 17 July 2008 development from an engineering point of view. As part of that effort, we examine a Received in revised form 31 March 2009 deficiency in the state of the art of practical LR parser table generation. Specifically, LALR Accepted 12 August 2009 sometimes generates parser tables that do not accept the full language that the grammar Available online 10 September 2009 developer expects, but canonical LR is too inefficient to be practical particularly during grammar development. In response, many researchers have attempted to develop minimal Keywords: LR parser table generation algorithms. In this paper, we demonstrate that a well known Grammarware Canonical LR algorithm described by David Pager and implemented in Menhir, the most robust minimal LALR LR(1) implementation we have discovered, does not always achieve the full power of Minimal LR canonical LR(1) when the given grammar is non-LR(1) coupled with a specification for Yacc resolving conflicts. We also detail an original minimal LR(1) algorithm, IELR(1) (Inadequacy Bison Elimination LR(1)), which we have implemented as an extension of GNU Bison and which does not exhibit this deficiency.
    [Show full text]
  • The Design & Implementation of an Abstract Semantic Graph For
    Clemson University TigerPrints All Dissertations Dissertations 12-2011 The esiD gn & Implementation of an Abstract Semantic Graph for Statement-Level Dynamic Analysis of C++ Applications Edward Duffy Clemson University, [email protected] Follow this and additional works at: https://tigerprints.clemson.edu/all_dissertations Part of the Computer Sciences Commons Recommended Citation Duffy, Edward, "The eD sign & Implementation of an Abstract Semantic Graph for Statement-Level Dynamic Analysis of C++ Applications" (2011). All Dissertations. 832. https://tigerprints.clemson.edu/all_dissertations/832 This Dissertation is brought to you for free and open access by the Dissertations at TigerPrints. It has been accepted for inclusion in All Dissertations by an authorized administrator of TigerPrints. For more information, please contact [email protected]. THE DESIGN &IMPLEMENTATION OF AN ABSTRACT SEMANTIC GRAPH FOR STATEMENT-LEVEL DYNAMIC ANALYSIS OF C++ APPLICATIONS A Dissertation Presented to the Graduate School of Clemson University In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy Computer Science by Edward B. Duffy December 2011 Accepted by: Dr. Brian A. Malloy, Committee Chair Dr. James B. von Oehsen Dr. Jason P. Hallstrom Dr. Pradip K. Srimani In this thesis, we describe our system, Hylian, for statement-level analysis, both static and dynamic, of a C++ application. We begin by extending the GNU gcc parser to generate parse trees in XML format for each of the compilation units in a C++ application. We then provide verification that the generated parse trees are structurally equivalent to the code in the original C++ application. We use the generated parse trees, together with an augmented version of the gcc test suite, to recover a grammar for the C++ dialect that we parse.
    [Show full text]
  • Parsing Techniques
    Dick Grune, Ceriel J.H. Jacobs Parsing Techniques 2nd edition — Monograph — September 27, 2007 Springer Berlin Heidelberg NewYork Hong Kong London Milan Paris Tokyo Contents Preface to the Second Edition : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : v Preface to the First Edition : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : xi 1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 1.1 Parsing as a Craft. 2 1.2 The Approach Used . 2 1.3 Outline of the Contents . 3 1.4 The Annotated Bibliography . 4 2 Grammars as a Generating Device : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 5 2.1 Languages as Infinite Sets . 5 2.1.1 Language . 5 2.1.2 Grammars . 7 2.1.3 Problems with Infinite Sets . 8 2.1.4 Describing a Language through a Finite Recipe . 12 2.2 Formal Grammars . 14 2.2.1 The Formalism of Formal Grammars . 14 2.2.2 Generating Sentences from a Formal Grammar . 15 2.2.3 The Expressive Power of Formal Grammars . 17 2.3 The Chomsky Hierarchy of Grammars and Languages . 19 2.3.1 Type 1 Grammars . 19 2.3.2 Type 2 Grammars . 23 2.3.3 Type 3 Grammars . 30 2.3.4 Type 4 Grammars . 33 2.3.5 Conclusion . 34 2.4 Actually Generating Sentences from a Grammar. 34 2.4.1 The Phrase-Structure Case . 34 2.4.2 The CS Case . 36 2.4.3 The CF Case . 36 2.5 To Shrink or Not To Shrink . 38 xvi Contents 2.6 Grammars that Produce the Empty Language . 41 2.7 The Limitations of CF and FS Grammars . 42 2.7.1 The uvwxy Theorem . 42 2.7.2 The uvw Theorem .
    [Show full text]
  • GLR* - an Efficient Noise-Skipping Parsing Algorithm for Context Free· Grammars
    GLR* - An Efficient Noise-skipping Parsing Algorithm For Context Free· Grammars Alon Lavie and Masaru Tomita School of Computer Scienc�, Carnegie Mellon University 5000 Forbes- Avenue, Pittsburgh, PA 15213 email: lavie©cs. emu . edu Abstract This paper describes GLR*, a parser that can parse any input sentence by ignoring unrec­ ognizable parts of the sentence. In case the standard parsing procedure fails to parse an input sentence, the parser nondeterministically skips some word(s) in the sentence, and returns the parse with fewest skipped words. Therefore, the parser will return some parse(s) with any input sentence, unless no part of the sentence can be recognized at all. The problem can be defined in the following way: Given a context-free grammar G and a sentence S, find and parse S' - the largest subset of words of S, such that S' E L(G). The algorithm described in this paper is a modificationof the Generalized LR (Tomita) parsing algorithm [Tomita, 1986]. The parser accommodates the skipping of words by allowing shift operations to be performed from inactive state nodes of the Graph Structured Stack. A heuristic similar to beam search makes the algorithm computationally tractable. There have been several other approaches to the problem of robust parsing, most of which are special purpose algorithms [Carbonell and Hayes, 1984), [Ward, 1991] and others. Because our approach is a modification to a standard context-free parsing algorithm, all the techniques and grammars developed for the standard parser can be applied as they are. Also, in case the input sentence is by itself grammatical, our parser behaves exactly as the standard GLR parser.
    [Show full text]
  • A Simple, Possibly Correct LR Parser for C11 Jacques-Henri Jourdan, François Pottier
    A Simple, Possibly Correct LR Parser for C11 Jacques-Henri Jourdan, François Pottier To cite this version: Jacques-Henri Jourdan, François Pottier. A Simple, Possibly Correct LR Parser for C11. ACM Transactions on Programming Languages and Systems (TOPLAS), ACM, 2017, 39 (4), pp.1 - 36. 10.1145/3064848. hal-01633123 HAL Id: hal-01633123 https://hal.archives-ouvertes.fr/hal-01633123 Submitted on 11 Nov 2017 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. 14 A simple, possibly correct LR parser for C11 Jacques-Henri Jourdan, Inria Paris, MPI-SWS François Pottier, Inria Paris The syntax of the C programming language is described in the C11 standard by an ambiguous context-free grammar, accompanied with English prose that describes the concept of “scope” and indicates how certain ambiguous code fragments should be interpreted. Based on these elements, the problem of implementing a compliant C11 parser is not entirely trivial. We review the main sources of difficulty and describe a relatively simple solution to the problem. Our solution employs the well-known technique of combining an LALR(1) parser with a “lexical feedback” mechanism. It draws on folklore knowledge and adds several original as- pects, including: a twist on lexical feedback that allows a smooth interaction with lookahead; a simplified and powerful treatment of scopes; and a few amendments in the grammar.
    [Show full text]
  • OPTIMAL AMBIGUITY PACKING in CONTEXT-FREE PARSERS with Intaloner Laleavie VED Unicarolynfica Pensttionein Rose Language Technologies Inst
    OPTIMAL AMBIGUITY PACKING IN CONTEXT-FREE PARSERS WITH INTAlonER LaLEAvie VED UNICarolynFICA PenstTIONein Rose Language Technologies Inst. Learning Res. and Dev. Center Carnegie Mellon University University of Pittsburgh Pittsburgh, PA 15213 Pittsburgh, PA 15213 alavie©cs . cmu .edu rosecpCpitt .edu Abstract Ambiguity packing is a well known technique for enhancing the efficiency of context-free parsers. However, in the case of unification-augmented context-free parsers where parsing is interleaved with feature unification, the propagation of feature structures imposes difficulties on the ability of the parser to effectively perform ambiguity packing. We demonstrate that a clever heuristic for prioritizing the execution order of grammar rules and parsing actions can achieve a high level of ambiguity packing that is provably optimal. We present empirical evaluations of the proposed technique, performed with both a Generalized LR parser and a chart parser, that demonstrate its effectiveness. 1 Introduction Efficient parsing algorithms for purely context-free grammars have long been known. Most natural language applications, however, require a far more linguistically detailed level of analysis that cannot be supported in a natural way by pure context-free grammars. Unification-based grammar formalisms (such as HPSG), on the other hand, are linguistically well founded, but are difficult to parse efficiently. Unification-augmented context-free grammar formalismshave thus become popular approaches where parsing can be based on an efficient context-free algorithm enhanced to produce linguistically rich representations. Whereas in some formalisms, such as the ANLT Grammar Formalism [2] [3], a context-free "backbone" is compiled from a pure unification grammar, in other formalisms [19] the grammar itself is written as a collection of context-free phrase structure rules augmented with unifi­ cation constraints that apply to feature structures that are associated with the grammar constituents.
    [Show full text]
  • Context-Aware Scanning and Determinism-Preserving Grammar Composition, in Theory and Practice
    Context-Aware Scanning and Determinism-Preserving Grammar Composition, in Theory and Practice A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY August Schwerdfeger IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY July, 2010 c August Schwerdfeger 2010 ALL RIGHTS RESERVED Acknowledgments I would like to thank all my colleagues in the MELT group for all their input and assis- tance through the course of the project, and especially for their help in providing ready- made tests and applications for my software. Thanks specifically to Derek Bodin and Ted Kaminski for their efforts in integrating the Copper parser generator into MELT’s attribute grammar tools; to Lijesh Krishnan for developing the ableJ system of Java ex- tensions and assisting me extensively with its further development; and to Yogesh Mali for implementing and testing the Promela grammar with Copper. I also thank my advisor, Eric Van Wyk, for his continuous guidance throughout my time as a student, and also for going well above and beyond the call of duty in helping me to edit and prepare this thesis for presentation. I thank the members of my thesis committee, Mats Heimdahl, Gopalan Nadathur, and Wayne Richter, for their time and efforts in serving and reviewing. I especially wish to thank Prof. Nadathur for his many detailed questions and helpful suggestions for improving this thesis. Work on this thesis has been partially funded by National Science Foundation Grants 0347860 and 0905581. Support has also been received from funds provided by the Institute of Technology (soon to be renamed the College of Science and Engi- neering) and the Department of Computer Science and Engineering at the University of Minnesota.
    [Show full text]
  • Elkhound: a Fast, Practical GLR Parser Generator
    Published in Proc. of Conference on Compiler Construction, 2004, pp. 73–88. Elkhound: A Fast, Practical GLR Parser Generator Scott McPeak and George C. Necula ? University of California, Berkeley {smcpeak,necula}@cs.berkeley.edu Abstract. The Generalized LR (GLR) parsing algorithm is attractive for use in parsing programming languages because it is asymptotically efficient for typical grammars, and can parse with any context-free gram- mar, including ambiguous grammars. However, adoption of GLR has been slowed by high constant-factor overheads and the lack of a general, user-defined action interface. In this paper we present algorithmic and implementation enhancements to GLR to solve these problems. First, we present a hybrid algorithm that chooses between GLR and ordinary LR on a token-by-token ba- sis, thus achieving competitive performance for determinstic input frag- ments. Second, we describe a design for an action interface and a new worklist algorithm that can guarantee bottom-up execution of actions for acyclic grammars. These ideas are implemented in the Elkhound GLR parser generator. To demonstrate the effectiveness of these techniques, we describe our ex- perience using Elkhound to write a parser for C++, a language notorious for being difficult to parse. Our C++ parser is small (3500 lines), efficient and maintainable, employing a range of disambiguation strategies. 1 Introduction The state of the practice in automated parsing has changed little since the intro- duction of YACC (Yet Another Compiler-Compiler), an LALR(1) parser genera- tor, in 1975 [1]. An LALR(1) parser is deterministic: at every point in the input, it must be able to decide which grammar rule to use, if any, utilizing only one token of lookahead [2].
    [Show full text]
  • A Simple, Possibly Correct LR Parser for C11
    14 A simple, possibly correct LR parser for C11 Jacques-Henri Jourdan, Inria Paris, MPI-SWS François Pottier, Inria Paris The syntax of the C programming language is described in the C11 standard by an ambiguous context-free grammar, accompanied with English prose that describes the concept of “scope” and indicates how certain ambiguous code fragments should be interpreted. Based on these elements, the problem of implementing a compliant C11 parser is not entirely trivial. We review the main sources of difficulty and describe a relatively simple solution to the problem. Our solution employs the well-known technique of combining an LALR(1) parser with a “lexical feedback” mechanism. It draws on folklore knowledge and adds several original as- pects, including: a twist on lexical feedback that allows a smooth interaction with lookahead; a simplified and powerful treatment of scopes; and a few amendments in the grammar. Although not formally verified, our parser avoids several pitfalls that other implementations have fallen prey to. We believe that its simplicity, its mostly-declarative nature, and its high similarity with the C11 grammar are strong informal arguments in favor of its correctness. Our parser is accompanied with a small suite of “tricky” C11 programs. We hope that it may serve as a reference or a starting point in the implementation of compilers and analysis tools. CCS Concepts: •Software and its engineering ! Parsers; Additional Key Words and Phrases: Compilation; parsing; ambiguity; lexical feedback; C89; C99; C11 ACM Reference Format: Jacques-Henri Jourdan and François Pottier. 2017. A simple, possibly correct LR parser for C11 ACM Trans.
    [Show full text]
  • A Tool for Generalised LR Parsing in Haskell
    A Tool for Generalised LR Parsing in Haskell Ben Medlock œ St. John‘s College Single Honours CS Project Report, April 2002 This report is submitted as part of the degree of Ben Medlock to the Board of Examiners in the Department of Computer Science, University of Durham. Word count: 17,857 Abstract Parsing is the process of deriving structure from sentences in a given language. This structure is derived from a specification of the language defined by a formal grammar. Many different types of grammar exist, but those most often used in the field of computer science are known as context-free (CF) grammars. The LR parsing technique can be used to efficiently parse a large class of unambiguous CF grammars. However, many languages can only be specified using ambiguous grammars. These include natural language (NL) grammars, as well as a host of simpler grammars. Tomita (85) proposed an extension to LR parsing known as generalised LR (GLR) parsing which allows languages derived from ambiguous CF grammars to be parsed efficiently. The project implements a version of Tomita’s algorithm in the functional programming language Haskell and integrates it with the Haskell-based LR parser-generator tool Happy. The amendments to Happy allow it to generate a GLR parser, based on Tomita’s algorithm, capable of parsing languages derived from ambiguous CF grammars. Our implementation of Tomita’s algorithm is analysed both theoretically (time and space orders) and through the use of Haskell profiling tools. Table of Contents Ch 1. Introduction ......................................................................1
    [Show full text]
  • Incremental Scannerless Generalized LR Parsing
    SPLASH: G: Incremental Scannerless Generalized LR Parsing Maarten P. Sijm Delft University of Technology Delft, The Netherlands [email protected] Abstract a separate lexing (or scanning) phase, supports modelling We present the Incremental Scannerless Generalized LR the entire language syntax in one single grammar, and al- (ISGLR) parsing algorithm, which combines the benefits of lows composition of grammars for different languages. One Incremental Generalized LR (IGLR) parsing and Scanner- notable disadvantage is that the SGLR parsing algorithm of less Generalized LR (SGLR) parsing. The ISGLR parser can Visser [9] is a batch algorithm, meaning that it must process reuse parse trees from unchanged regions in the input and each entire file in one pass. This becomes a problem for soft- thus only needs to parse changed regions. We also present ware projects that have large files, as every small change incremental techniques for imploding the parse tree to an requires the entire file to be parsed again. Abstract Syntax Tree (AST) and syntax highlighting. Scan- Incremental Generalized LR (IGLR) parsing is an improve- nerless parsing relies heavily on non-determinism during ment over batch Generalized LR (GLR) parsing. Amongst parsing, negatively impacting the incrementality of ISGLR others, Wagner [10] and TreeSitter [8] have created parsing parsing. We evaluated the ISGLR parsing algorithm using algorithms that allow rapid parsing of changes to large files. file histories from Git, achieving a speedup of up to 25 times However, these algorithms use a separate incremental lexical over non-incremental SGLR. analysis phase which complicates the implementation of in- cremental parsing [11] and does not directly allow language CCS Concepts • Software and its engineering → Incre- composition.
    [Show full text]
  • Faster Scannerless GLR Parsing
    Faster Scannerless GLR Parsing Giorgios Economopoulos, Paul Klint, Jurgen Vinju Centrum voor Wiskunde en Informatica (CWI), Kruislaan 413, 1098 SJ Amsterdam, The Netherlands Abstract. Analysis and renovation of large software portfolios requires syntax analysis of multiple, usually embedded, languages and this is be- yond the capabilities of many standard parsing techniques. The tradi- tional separation between lexer and parser falls short due to the limita- tions of tokenization based on regular expressions when handling multiple lexical grammars. In such cases scannerless parsing provides a viable so- lution. It uses the power of context-free grammars to be able to deal with a wide variety of issues in parsing lexical syntax. However, it comes at the price of less efficiency. The structure of tokens is obtained using a more powerful but more time and memory intensive parsing algorithm. Scan- nerless grammars are also more non-deterministic than their tokenized counterparts, increasing the burden on the parsing algorithm even fur- ther. In this paper we investigate the application of the Right-Nulled Gener- alized LR parsing algorithm (RNGLR) to scannerless parsing. We adapt the Scannerless Generalized LR parsing and filtering algorithm (SGLR) to implement the optimizations of RNGLR. We present an updated pars- ing and filtering algorithm, called SRNGLR, and analyze its performance in comparison to SGLR on ambiguous grammars for the programming languages C, Java, Python, SASL, and C++. Measurements show that SRNGLR is on average 33% faster than SGLR, but is 95% faster on the highly ambiguous SASL grammar. For the mainstream languages C, C++, Java and Python the average speedup is 16%.
    [Show full text]