Parsing for Agile Modeling

Total Page:16

File Type:pdf, Size:1020Kb

Parsing for Agile Modeling Parsing For Agile Modeling Inauguraldissertation der Philosophisch-naturwissenschaftlichen Fakultat¨ der Universitat¨ Bern vorgelegt von Jan Kursˇ von Tschechien | downloaded: 7.10.2021 Leiter der Arbeit: Prof. Dr. Oscar Nierstrasz Institut fur¨ Informatik https://doi.org/10.24442/boristheses.891 source: Parsing For Agile Modeling Inauguraldissertation der Philosophisch-naturwissenschaftlichen Fakultat¨ der Universitat¨ Bern vorgelegt von Jan Kursˇ von Tschechien Leiter der Arbeit: Prof. Dr. Oscar Nierstrasz Institut fur¨ Informatik Von der Philosophisch-naturwissenschaftlichen Fakultat¨ angenommen. Bern, 25.10.2016 Der Dekan: Prof. Dr. Gilberto Colangelo This dissertation can be downloaded from scg.unibe.ch. Copyright c 2016 by Jan Kursˇ This work is licensed under a Creative Commons Attribution-Non-Commercial-No derivative works 2.5 Switzerland license. To see the license go to http://creativecommons.org/licenses/by-sa/2.5/ch/ Attribution–ShareAlike Abstract Agile modeling refers to a set of methods that allow for a quick initial development of an importer and its further refinement. These requirements are not met simultaneously by the current parsing technology. Problems with parsing became a bottleneck in our research of agile modeling. In this thesis we introduce a novel approach to specify and build parsers. Our approach allows for expressive, tolerant and composable parsers with- out sacrificing performance. The approach is based on a context-sensitive extension of parsing expression grammars that allows a grammar engineer to specify complex language restrictions. To insure high parsing perfor- mance we automatically analyze a grammar definition and choose differ- ent parsing strategies for different parts of the grammar. We show that context-sensitive parsing expression grammars allow for highly composable, tolerant and variable-grained parsers that can be easily refined. Different parsing strategies significantly insure high-performance of parsers without sacrificing expressiveness of the underlying grammars. 2 Contents 1 Introduction 8 1.1 Agile Modeling . .9 1.2 Parsing Obstacles of Agile Modeling . 12 1.3 Thesis . 13 1.4 Our Contribution . 13 2 Overview of Parsing Technologies 15 2.1 Parsing in the Wild . 15 2.1.1 Expressive Power . 15 2.1.2 Composability . 17 2.1.3 Tolerant Grammars and Semi-Parsing . 18 2.1.4 Performance . 19 2.1.5 Parsing Frameworks . 20 2.2 Existing Limitations . 22 2.3 Our Solution . 23 3 Parsing Expression Grammars and PetitParser 24 3.1 Parsing Expression Grammars . 24 3.1.1 PEG Analysis . 25 3.1.2 Parser Combinators . 26 3.2 PetitParser . 29 4 Context Sensitivity in Parsing Expression Grammars 32 4.1 Motivating Example . 33 4.2 Parsing Contexts . 34 4.2.1 Context-Sensitive Extension . 35 4.2.2 Indentation Stack . 35 4.3 Parsing Contexts in Parsing Expression Grammars . 37 4.3.1 Parser Combinators . 40 4.3.2 CS-PEG analysis . 40 4.4 Implementation . 43 4.4.1 Performance . 44 4.5 Case Studies . 47 4.5.1 Python . 47 4.5.2 Markdown . 49 4.6 Related Work . 52 4.7 Conclusion . 54 4 CONTENTS 5 5 Semi-Parsing with Bounded Seas 55 5.1 Motivating Example . 56 5.1.1 Why not use Regular Expressions? . 56 5.1.2 A Na¨ıve Island Grammar . 57 5.1.3 An Advanced Island Grammar . 57 5.2 Bounded Seas . 59 5.2.1 The Sea Boundary . 60 5.2.2 The Context Sensitivity of Bounded Seas . 61 5.3 Bounded Seas in Parsing Expression Grammars . 62 5.3.1 The Water Operator . 63 5.3.2 The NEXT function . 67 5.3.3 BS-PEG analysis . 71 5.4 Implementation . 72 5.4.1 Performance . 73 5.5 Java Parser Case Study . 73 5.5.1 Without Nested Classes . 75 5.5.2 With Nested Classes . 75 5.5.3 With Return Types . 76 5.5.4 Performance . 77 5.6 Related Work . 79 5.7 Conclusion . 82 6 Adaptable Parsing Strategies 83 6.1 Motivating Example . 84 6.1.1 Composition Overhead . 86 6.1.2 Superfluous Intermediate Objects . 86 6.1.3 Backtracking Overhead . 87 6.1.4 Context-Sensitivity Overhead . 87 6.2 A Parser Combinator Compiler . 88 6.2.1 Adaptable Strategies . 88 6.3 Parser Optimizations . 90 6.3.1 Regular Optimizations . 92 6.3.2 Context-Free Optimizations . 94 6.3.3 Context-Sensitive Optimizations . 96 6.4 Performance analysis . 100 6.4.1 PetitParser compiler . 100 6.4.2 Benchmarks . 101 6.4.3 Parsing Strategies Impact . 103 6.4.4 Scanner Impact . 105 6.4.5 Memoization Impact . 107 6.4.6 Java Parsers Comparison . 108 6.4.7 Smalltalk Parsers Comparison . 108 6.5 Related Work . 109 6.6 Conclusion . 110 7 Ruby Case study 112 7.1 Ruby Structure . 112 7.1.1 The Dangling End Problem . 113 7.1.2 Measurements . 115 7.2 Ruby Method Calls . 116 CONTENTS 6 7.2.1 Measurements . 117 7.3 Performance . 119 7.4 Conclusion . 121 8 Conclusion 123 A Formal development of PEGs 136 B Bounded Seas Examples 141 B.1 Example of Dynamic NEXT computation . 141 B.2 Example of Static NEXT computation . 142 B.3 Overlapping Seas Example . 147 C Implementation 152 C.1 Bounded seas . 152 D Layout Sensitivity in the Wild 159 D.1 Haskell . 159 D.2 Python . 160 D.3 F#.................................... 161 D.4 YAML . 162 D.5 OCaml . 162 D.6 CoffeeScript . 163 D.7 Grace . 163 D.8 SRFI 49 — Indentation-Sensitive Scheme . 164 D.9 Elastic Tabstops . 164 E Scanner 166 E.1 Scanners in PEG-based parsers . 166 E.1.1 Tokens and Scannable Parsing Expressions . 166 E.1.2 Scannable Choices . 167 E.1.3 Scanner . 167 E.2 Regular Parsing Expressions . 170 E.3 Regular Parsing Expression Languages . 172 E.4 Finite State Automata . 175 E.4.1 Construction of finite state automata from regular parsing ex- pressions (FSA)........................ 178 E.4.2 Determinization of the automata with epsilons and priorities (D) 181 F Measurements 183 F.1 Summary . 183 F.2 Strategies Details . 187 F.2.1 Expressions . 190 F.2.2 IS Expressions . 191 F.2.3 CF Python . 192 F.2.4 Python . 193 F.2.5 Smalltalk . 194 F.2.6 Java . 195 F.2.7 Java Sea . 196 F.3 Scanner Impact . 197 F.3.1 Expressions . 199 CONTENTS 7 F.3.2 Smalltalk . 200 F.4 Memoization Details . 201 F.5 Smalltalk Parsers . 202 F.6 Java Parsers . ..
Recommended publications
  • Ece351 Lab Manual
    DEREK RAYSIDE & ECE351 STAFF ECE351 LAB MANUAL UNIVERSITYOFWATERLOO 2 derek rayside & ece351 staff Copyright © 2014 Derek Rayside & ECE351 Staff Compiled March 6, 2014 acknowledgements: • Prof Paul Ward suggested that we look into something with vhdl to have synergy with ece327. • Prof Mark Aagaard, as the ece327 instructor, consulted throughout the development of this material. • Prof Patrick Lam generously shared his material from the last offering of ece251. • Zhengfang (Alex) Duanmu & Lingyun (Luke) Li [1b Elec] wrote solutions to most labs in txl. • Jiantong (David) Gao & Rui (Ray) Kong [3b Comp] wrote solutions to the vhdl labs in antlr. • Aman Muthrej and Atulan Zaman [3a Comp] wrote solutions to the vhdl labs in Parboiled. • TA’s Jon Eyolfson, Vajih Montaghami, Alireza Mortezaei, Wenzhu Man, and Mohammed Hassan. • TA Wallace Wu developed the vhdl labs. • High school students Brian Engio and Tianyu Guo drew a number of diagrams for this manual, wrote Javadoc comments for the code, and provided helpful comments on the manual. Licensed under Creative Commons Attribution-ShareAlike (CC BY-SA) version 2.5 or greater. http://creativecommons.org/licenses/by-sa/2.5/ca/ http://creativecommons.org/licenses/by-sa/3.0/ Contents 0 Overview 9 Compiler Concepts: call stack, heap 0.1 How the Labs Fit Together . 9 Programming Concepts: version control, push, pull, merge, SSH keys, IDE, 0.2 Learning Progressions . 11 debugger, objects, pointers 0.3 How this project compares to CS241, the text book, etc. 13 0.4 Student work load . 14 0.5 How this course compares to MIT 6.035 .......... 15 0.6 Where do I learn more? .
    [Show full text]
  • Validating LR(1) Parsers
    Validating LR(1) Parsers Jacques-Henri Jourdan1;2, Fran¸coisPottier2, and Xavier Leroy2 1 Ecole´ Normale Sup´erieure 2 INRIA Paris-Rocquencourt Abstract. An LR(1) parser is a finite-state automaton, equipped with a stack, which uses a combination of its current state and one lookahead symbol in order to determine which action to perform next. We present a validator which, when applied to a context-free grammar G and an automaton A, checks that A and G agree. Validating the parser pro- vides the correctness guarantees required by verified compilers and other high-assurance software that involves parsing. The validation process is independent of which technique was used to construct A. The validator is implemented and proved correct using the Coq proof assistant. As an application, we build a formally-verified parser for the C99 language. 1 Introduction Parsing remains an essential component of compilers and other programs that input textual representations of structured data. Its theoretical foundations are well understood today, and mature technology, ranging from parser combinator libraries to sophisticated parser generators, is readily available to help imple- menting parsers. The issue we focus on in this paper is that of parser correctness: how to obtain formal evidence that a parser is correct with respect to its specification? Here, following established practice, we choose to specify parsers via context-free grammars enriched with semantic actions. One application area where the parser correctness issue naturally arises is formally-verified compilers such as the CompCert verified C compiler [14]. In- deed, in the current state of CompCert, the passes that have been formally ver- ified start at abstract syntax trees (AST) for the CompCert C subset of C and extend to ASTs for three assembly languages.
    [Show full text]
  • A Declarative Extension of Parsing Expression Grammars for Recognizing Most Programming Languages
    Journal of Information Processing Vol.24 No.2 256–264 (Mar. 2016) [DOI: 10.2197/ipsjjip.24.256] Regular Paper A Declarative Extension of Parsing Expression Grammars for Recognizing Most Programming Languages Tetsuro Matsumura1,†1 Kimio Kuramitsu1,a) Received: May 18, 2015, Accepted: November 5, 2015 Abstract: Parsing Expression Grammars are a popular foundation for describing syntax. Unfortunately, several syn- tax of programming languages are still hard to recognize with pure PEGs. Notorious cases appears: typedef-defined names in C/C++, indentation-based code layout in Python, and HERE document in many scripting languages. To recognize such PEG-hard syntax, we have addressed a declarative extension to PEGs. The “declarative” extension means no programmed semantic actions, which are traditionally used to realize the extended parsing behavior. Nez is our extended PEG language, including symbol tables and conditional parsing. This paper demonstrates that the use of Nez Extensions can realize many practical programming languages, such as C, C#, Ruby, and Python, which involve PEG-hard syntax. Keywords: parsing expression grammars, semantic actions, context-sensitive syntax, and case studies on program- ming languages invalidate the declarative property of PEGs, thus resulting in re- 1. Introduction duced reusability of grammars. As a result, many developers need Parsing Expression Grammars [5], or PEGs, are a popu- to redevelop grammars for their software engineering tools. lar foundation for describing programming language syntax [6], In this paper, we propose a declarative extension of PEGs for [15]. Indeed, the formalism of PEGs has many desirable prop- recognizing context-sensitive syntax. The “declarative” exten- erties, including deterministic behaviors, unlimited look-aheads, sion means no arbitrary semantic actions that are written in a gen- and integrated lexical analysis known as scanner-less parsing.
    [Show full text]
  • Implementation of Processing in Racket 1 Introduction
    P2R Implementation of Processing in Racket Hugo Correia [email protected] Instituto Superior T´ecnico University of Lisbon Abstract. Programming languages are being introduced in several ar- eas of expertise, including design and architecture. Processing is an ex- ample of one of these languages that was created to teach architects and designers how to program. In spite of offering a wide set of features, Pro- cessing does not support the use of traditional computer-aided design applications, which are heavily used in the architecture industry. Rosetta is a generative design tool based on the Racket language that attempts to solve this problem. Rosetta provides a novel approach to de- sign creation, offering a set of programming languages that generate de- signs in different computer-aided design applications. However, Rosetta does not support Processing. Therefore, the goal is to add Processing to Rosetta's language set, offering architects that know Processing, an alternative solution that supports computer-aided design applications. In order to achieve this goal, a source-to-source compiler that translates Processing into Racket will be developed. This will also give the Racket community the ability to use Processing in the Racket environment, and, at the same time, will allow the Processing community to take advantage of Racket's libraries and development environment. In this report, an analysis of different language implementation mecha- nisms will be presented, focusing on the different steps of the compilation phase, as well as higher-level solutions, including Language Workbenches. In order to gain insight of source-to-source compiler creation, relevant existing source-to-source compilers are presented.
    [Show full text]
  • Adaptive LL(*) Parsing: the Power of Dynamic Analysis
    Adaptive LL(*) Parsing: The Power of Dynamic Analysis Terence Parr Sam Harwell Kathleen Fisher University of San Francisco University of Texas at Austin Tufts University [email protected] [email protected] kfi[email protected] Abstract PEGs are unambiguous by definition but have a quirk where Despite the advances made by modern parsing strategies such rule A ! a j ab (meaning “A matches either a or ab”) can never as PEG, LL(*), GLR, and GLL, parsing is not a solved prob- match ab since PEGs choose the first alternative that matches lem. Existing approaches suffer from a number of weaknesses, a prefix of the remaining input. Nested backtracking makes de- including difficulties supporting side-effecting embedded ac- bugging PEGs difficult. tions, slow and/or unpredictable performance, and counter- Second, side-effecting programmer-supplied actions (muta- intuitive matching strategies. This paper introduces the ALL(*) tors) like print statements should be avoided in any strategy that parsing strategy that combines the simplicity, efficiency, and continuously speculates (PEG) or supports multiple interpreta- predictability of conventional top-down LL(k) parsers with the tions of the input (GLL and GLR) because such actions may power of a GLR-like mechanism to make parsing decisions. never really take place [17]. (Though DParser [24] supports The critical innovation is to move grammar analysis to parse- “final” actions when the programmer is certain a reduction is time, which lets ALL(*) handle any non-left-recursive context- part of an unambiguous final parse.) Without side effects, ac- free grammar. ALL(*) is O(n4) in theory but consistently per- tions must buffer data for all interpretations in immutable data forms linearly on grammars used in practice, outperforming structures or provide undo actions.
    [Show full text]
  • Efficient Recursive Parsing
    Purdue University Purdue e-Pubs Department of Computer Science Technical Reports Department of Computer Science 1976 Efficient Recursive Parsing Christoph M. Hoffmann Purdue University, [email protected] Report Number: 77-215 Hoffmann, Christoph M., "Efficient Recursive Parsing" (1976). Department of Computer Science Technical Reports. Paper 155. https://docs.lib.purdue.edu/cstech/155 This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. Please contact [email protected] for additional information. EFFICIENT RECURSIVE PARSING Christoph M. Hoffmann Computer Science Department Purdue University West Lafayette, Indiana 47907 CSD-TR 215 December 1976 Efficient Recursive Parsing Christoph M. Hoffmann Computer Science Department Purdue University- Abstract Algorithms are developed which construct from a given LL(l) grammar a recursive descent parser with as much, recursion resolved by iteration as is possible without introducing auxiliary memory. Unlike other proposed methods in the literature designed to arrive at parsers of this kind, the algorithms do not require extensions of the notational formalism nor alter the grammar in any way. The algorithms constructing the parsers operate in 0(k«s) steps, where s is the size of the grammar, i.e. the sum of the lengths of all productions, and k is a grammar - dependent constant. A speedup of the algorithm is possible which improves the bound to 0(s) for all LL(l) grammars, and constructs smaller parsers with some auxiliary memory in form of parameters
    [Show full text]
  • Conflict Resolution in a Recursive Descent Compiler Generator
    LL(1) Conflict Resolution in a Recursive Descent Compiler Generator Albrecht Wöß, Markus Löberbauer, Hanspeter Mössenböck Johannes Kepler University Linz, Institute of Practical Computer Science, Altenbergerstr. 69, 4040 Linz, Austria {woess,loeberbauer,moessenboeck}@ssw.uni-linz.ac.at Abstract. Recursive descent parsing is restricted to languages whose grammars are LL(1), i.e., which can be parsed top-down with a single lookahead symbol. Unfortunately, many languages such as Java, C++, or C# are not LL(1). There- fore recursive descent parsing cannot be used or the parser has to make its deci- sions based on semantic information or a multi-symbol lookahead. In this paper we suggest a systematic technique for resolving LL(1) conflicts in recursive descent parsing and show how to integrate it into a compiler gen- erator (Coco/R). The idea is to evaluate user-defined boolean expressions, in order to allow the parser to make its parsing decisions where a one symbol loo- kahead does not suffice. Using our extended compiler generator we implemented a compiler front end for C# that can be used as a framework for implementing a variety of tools. 1 Introduction Recursive descent parsing [16] is a popular top-down parsing technique that is sim- ple, efficient, and convenient for integrating semantic processing. However, it re- quires the grammar of the parsed language to be LL(1), which means that the parser must always be able to select between alternatives with a single symbol lookahead. Unfortunately, many languages such as Java, C++ or C# are not LL(1) so that one either has to resort to bottom-up LALR(1) parsing [5, 1], which is more powerful but less convenient for semantic processing, or the parser has to resolve the LL(1) con- flicts using semantic information or a multi-symbol lookahead.
    [Show full text]
  • LLLR Parsing: a Combination of LL and LR Parsing
    LLLR Parsing: a Combination of LL and LR Parsing Boštjan Slivnik University of Ljubljana, Faculty of Computer and Information Science, Ljubljana, Slovenia [email protected] Abstract A new parsing method called LLLR parsing is defined and a method for producing LLLR parsers is described. An LLLR parser uses an LL parser as its backbone and parses as much of its input string using LL parsing as possible. To resolve LL conflicts it triggers small embedded LR parsers. An embedded LR parser starts parsing the remaining input and once the LL conflict is resolved, the LR parser produces the left parse of the substring it has just parsed and passes the control back to the backbone LL parser. The LLLR(k) parser can be constructed for any LR(k) grammar. It produces the left parse of the input string without any backtracking and, if used for a syntax-directed translation, it evaluates semantic actions using the top-down strategy just like the canonical LL(k) parser. An LLLR(k) parser is appropriate for grammars where the LL(k) conflicting nonterminals either appear relatively close to the bottom of the derivation trees or produce short substrings. In such cases an LLLR parser can perform a significantly better error recovery than an LR parser since the most part of the input string is parsed with the backbone LL parser. LLLR parsing is similar to LL(∗) parsing except that it (a) uses LR(k) parsers instead of finite automata to resolve the LL(k) conflicts and (b) does not perform any backtracking.
    [Show full text]
  • Keyword Based Search Engine for Web Applications Using Lucene and Javacc Priyesh Wani, Nikita Shah, Kapil Thombare, Chaitalee Zade
    Priyesh Wani et al IJCSET |April 2012| Vol 2, Issue 4,1143-1146 Keyword based Search Engine for Web Applications Using Lucene and JavaCC Priyesh Wani, Nikita Shah, Kapil Thombare, Chaitalee Zade Information Technology, Computer Science University of Pune [email protected] [email protected] [email protected] [email protected] Abstract—Past Few years IT industry has taken leap towards functionality to an application or website. Lucene is able to developing Web based Applications. The need aroused due to achieve fast search responses because, instead of searching increasing globalization. Web applications has revolutionized the text directly, it searches an index instead. Which is the way business processes. It provides scalability and equivalent to retrieving pages in a book related to a extensibility in managing different processes in respective keyword by searching the index at the back of a book, as domains. With evolving standard of these web applications some functionalities has become part of that standard, Search opposed to searching the words in each page of the book. Engine being one of them. Organization dealing with large JavacCC: Java Compiler Compiler (JavaCC) is the most number of clients has to face an overhead of maintaining huge popular parser generator for use with Java applications. A databases. Retrieval of data in that case becomes difficult from parser generator is a tool that reads a grammar specification developer’s point of view. In this paper we have presented an and converts it to a Java program that can recognize efficient way of implementing keyword based search engine matches to the grammar.
    [Show full text]
  • Yacc (Yet Another Compiler-Compiler)
    Yacc (Yet Another Compiler-Compiler) EECS 665 Compiler Construction 1 The Role of the Parser Source Intermediate Program Token Parse Representation Tree Lexical Parser Rest of Front Analyzer End Get next token Symbol Table 2 Yacc – LALR Parser Generator Input *.y Yacc y.tab.c (yyparse) C Executable y.tab.h Compiler *.l Lex lex.yy.c (yylex) Output 3 Yacc Specification declarations %% translation rules %% supporting C routines 4 Yacc Declarations ● %{ C declarations %} ● Declare tokens ● %token name1 name2 … ● Yacc compiler %token INTVAL → #define INTVAL 257 5 Yacc Declarations (cont.) ● Precedence ● %left, %right, %nonassoc ● Precedence is established by the order the operators are listed (low to high) ● Encoding precedence manually ● Left recursive = left associative (recommended) ● Right recursive = right associative 6 Yacc Declarations (cont.) ● %start ● %union %union { type type_name } %token <type_name> token_name %type <type_name> nonterminal_name ● yylval 7 Yacc Translation Rules ● Form: A : Body ; where A is a nonterminal and Body is a list of nonterminals and terminals (':' and ';' are Yacc punctuation) ● Semantic actions can be enclosed before or after each grammar symbol in the body ● Ambiguous grammar rules ● Yacc chooses to shift in a shift/reduce conflict ● Yacc chooses the first production in reduce/reduce conflict 8 Yacc Translation Rules (cont.) ● When there is more than one rule with the same left hand side, a '|' can be used A : B C D ; A : E F ; A : G ; => A : B C D | E F | G ; 9 Yacc Actions ● Actions are C code segments
    [Show full text]
  • Compiler Design
    CCOOMMPPIILLEERR DDEESSIIGGNN -- PPAARRSSEERR http://www.tutorialspoint.com/compiler_design/compiler_design_parser.htm Copyright © tutorialspoint.com In the previous chapter, we understood the basic concepts involved in parsing. In this chapter, we will learn the various types of parser construction methods available. Parsing can be defined as top-down or bottom-up based on how the parse-tree is constructed. Top-Down Parsing We have learnt in the last chapter that the top-down parsing technique parses the input, and starts constructing a parse tree from the root node gradually moving down to the leaf nodes. The types of top-down parsing are depicted below: Recursive Descent Parsing Recursive descent is a top-down parsing technique that constructs the parse tree from the top and the input is read from left to right. It uses procedures for every terminal and non-terminal entity. This parsing technique recursively parses the input to make a parse tree, which may or may not require back-tracking. But the grammar associated with it ifnotleftfactored cannot avoid back- tracking. A form of recursive-descent parsing that does not require any back-tracking is known as predictive parsing. This parsing technique is regarded recursive as it uses context-free grammar which is recursive in nature. Back-tracking Top- down parsers start from the root node startsymbol and match the input string against the production rules to replace them ifmatched. To understand this, take the following example of CFG: S → rXd | rZd X → oa | ea Z → ai For an input string: read, a top-down parser, will behave like this: It will start with S from the production rules and will match its yield to the left-most letter of the input, i.e.
    [Show full text]
  • Advanced Parsing Techniques
    Advanced Parsing Techniques Announcements ● Written Set 1 graded. ● Hard copies available for pickup right now. ● Electronic submissions: feedback returned later today. Where We Are Where We Are Parsing so Far ● We've explored five deterministic parsing algorithms: ● LL(1) ● LR(0) ● SLR(1) ● LALR(1) ● LR(1) ● These algorithms all have their limitations. ● Can we parse arbitrary context-free grammars? Why Parse Arbitrary Grammars? ● They're easier to write. ● Can leave operator precedence and associativity out of the grammar. ● No worries about shift/reduce or FIRST/FOLLOW conflicts. ● If ambiguous, can filter out invalid trees at the end. ● Generate candidate parse trees, then eliminate them when not needed. ● Practical concern for some languages. ● We need to have C and C++ compilers! Questions for Today ● How do you go about parsing ambiguous grammars efficiently? ● How do you produce all possible parse trees? ● What else can we do with a general parser? The Earley Parser Motivation: The Limits of LR ● LR parsers use shift and reduce actions to reduce the input to the start symbol. ● LR parsers cannot deterministically handle shift/reduce or reduce/reduce conflicts. ● However, they can nondeterministically handle these conflicts by guessing which option to choose. ● What if we try all options and see if any of them work? The Earley Parser ● Maintain a collection of Earley items, which are LR(0) items annotated with a start position. ● The item A → α·ω @n means we are working on recognizing A → αω, have seen α, and the start position of the item was the nth token. ● Using techniques similar to LR parsing, try to scan across the input creating these items.
    [Show full text]