10. Pegs, Packrats and Parser Combinators
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Derivatives of Parsing Expression Grammars
Derivatives of Parsing Expression Grammars Aaron Moss Cheriton School of Computer Science University of Waterloo Waterloo, Ontario, Canada [email protected] This paper introduces a new derivative parsing algorithm for recognition of parsing expression gram- mars. Derivative parsing is shown to have a polynomial worst-case time bound, an improvement on the exponential bound of the recursive descent algorithm. This work also introduces asymptotic analysis based on inputs with a constant bound on both grammar nesting depth and number of back- tracking choices; derivative and recursive descent parsing are shown to run in linear time and constant space on this useful class of inputs, with both the theoretical bounds and the reasonability of the in- put class validated empirically. This common-case constant memory usage of derivative parsing is an improvement on the linear space required by the packrat algorithm. 1 Introduction Parsing expression grammars (PEGs) are a parsing formalism introduced by Ford [6]. Any LR(k) lan- guage can be represented as a PEG [7], but there are some non-context-free languages that may also be represented as PEGs (e.g. anbncn [7]). Unlike context-free grammars (CFGs), PEGs are unambiguous, admitting no more than one parse tree for any grammar and input. PEGs are a formalization of recursive descent parsers allowing limited backtracking and infinite lookahead; a string in the language of a PEG can be recognized in exponential time and linear space using a recursive descent algorithm, or linear time and space using the memoized packrat algorithm [6]. PEGs are formally defined and these algo- rithms outlined in Section 3. -
CS 432 Fall 2020 Top-Down (LL) Parsing
CS 432 Fall 2020 Mike Lam, Professor Top-Down (LL) Parsing Compilation Current focus "Back end" Source code Tokens Syntax tree Machine code char data[20]; 7f 45 4c 46 01 int main() { 01 01 00 00 00 float x 00 00 00 00 00 = 42.0; ... return 7; } Lexing Parsing Code Generation & Optimization "Front end" Review ● Recognize regular languages with finite automata – Described by regular expressions – Rule-based transitions, no memory required ● Recognize context-free languages with pushdown automata – Described by context-free grammars – Rule-based transitions, MEMORY REQUIRED ● Add a stack! Segue KEY OBSERVATION: Allowing the translator to use memory to track parse state information enables a wider range of automated machine translation. Chomsky Hierarchy of Languages Recursively enumerable Context-sensitive Context-free Most useful Regular for PL https://en.wikipedia.org/wiki/Chomsky_hierarchy Parsing Approaches ● Top-down: begin with start symbol (root of parse tree), and gradually expand non-terminals – Stack contains leaves that still need to be expanded ● Bottom-up: begin with terminals (leaves of parse tree), and gradually connect using non-terminals – Stack contains roots of subtrees that still need to be connected A V = E Top-down a E + E Bottom-up V V b c Top-Down Parsing root = createNode(S) focus = root A → V = E push(null) V → a | b | c token = nextToken() E → E + E loop: | V if (focus is non-terminal): B = chooseRuleAndExpand(focus) for each b in B.reverse(): focus.addChild(createNode(b)) push(b) A focus = pop() else if (token -
Total Parser Combinators
Total Parser Combinators Nils Anders Danielsson School of Computer Science, University of Nottingham, United Kingdom [email protected] Abstract The library has an interface which is very similar to those of A monadic parser combinator library which guarantees termination classical monadic parser combinator libraries. For instance, con- of parsing, while still allowing many forms of left recursion, is sider the following simple, left recursive, expression grammar: described. The library’s interface is similar to those of many other term :: factor term '+' factor parser combinator libraries, with two important differences: one is factor ::D atom j factor '*' atom that the interface clearly specifies which parts of the constructed atom ::D numberj '(' term ')' parsers may be infinite, and which parts have to be finite, using D j dependent types and a combination of induction and coinduction; We can define a parser which accepts strings from this grammar, and the other is that the parser type is unusually informative. and also computes the values of the resulting expressions, as fol- The library comes with a formal semantics, using which it is lows (the combinators are described in Section 4): proved that the parser combinators are as expressive as possible. mutual The implementation is supported by a machine-checked correct- term factor ness proof. D ] term >> λ n j D 1 ! Categories and Subject Descriptors D.1.1 [Programming Tech- tok '+' >> λ D ! niques]: Applicative (Functional) Programming; E.1 [Data Struc- factor >> λ n D 2 ! tures]; F.3.1 [Logics and Meanings of Programs]: Specifying and return .n n / 1 C 2 Verifying and Reasoning about Programs; F.4.2 [Mathematical factor atom Logic and Formal Languages]: Grammars and Other Rewriting D ] factor >> λ n Systems—Grammar types, Parsing 1 j tok '*' >>D λ ! D ! General Terms Languages, theory, verification atom >> λ n2 return .n n / D ! 1 ∗ 2 Keywords Dependent types, mixed induction and coinduction, atom number parser combinators, productivity, termination D tok '(' >> λ j D ! ] term >> λ n 1. -
Adaptive LL(*) Parsing: the Power of Dynamic Analysis
Adaptive LL(*) Parsing: The Power of Dynamic Analysis Terence Parr Sam Harwell Kathleen Fisher University of San Francisco University of Texas at Austin Tufts University [email protected] [email protected] kfi[email protected] Abstract PEGs are unambiguous by definition but have a quirk where Despite the advances made by modern parsing strategies such rule A ! a j ab (meaning “A matches either a or ab”) can never as PEG, LL(*), GLR, and GLL, parsing is not a solved prob- match ab since PEGs choose the first alternative that matches lem. Existing approaches suffer from a number of weaknesses, a prefix of the remaining input. Nested backtracking makes de- including difficulties supporting side-effecting embedded ac- bugging PEGs difficult. tions, slow and/or unpredictable performance, and counter- Second, side-effecting programmer-supplied actions (muta- intuitive matching strategies. This paper introduces the ALL(*) tors) like print statements should be avoided in any strategy that parsing strategy that combines the simplicity, efficiency, and continuously speculates (PEG) or supports multiple interpreta- predictability of conventional top-down LL(k) parsers with the tions of the input (GLL and GLR) because such actions may power of a GLR-like mechanism to make parsing decisions. never really take place [17]. (Though DParser [24] supports The critical innovation is to move grammar analysis to parse- “final” actions when the programmer is certain a reduction is time, which lets ALL(*) handle any non-left-recursive context- part of an unambiguous final parse.) Without side effects, ac- free grammar. ALL(*) is O(n4) in theory but consistently per- tions must buffer data for all interpretations in immutable data forms linearly on grammars used in practice, outperforming structures or provide undo actions. -
Disambiguation Filters for Scannerless Generalized LR Parsers
Disambiguation Filters for Scannerless Generalized LR Parsers Mark G. J. van den Brand1,4, Jeroen Scheerder2, Jurgen J. Vinju1, and Eelco Visser3 1 Centrum voor Wiskunde en Informatica (CWI) Kruislaan 413, 1098 SJ Amsterdam, The Netherlands {Mark.van.den.Brand,Jurgen.Vinju}@cwi.nl 2 Department of Philosophy, Utrecht University Heidelberglaan 8, 3584 CS Utrecht, The Netherlands [email protected] 3 Institute of Information and Computing Sciences, Utrecht University P.O. Box 80089, 3508TB Utrecht, The Netherlands [email protected] 4 LORIA-INRIA 615 rue du Jardin Botanique, BP 101, F-54602 Villers-l`es-Nancy Cedex, France Abstract. In this paper we present the fusion of generalized LR pars- ing and scannerless parsing. This combination supports syntax defini- tions in which all aspects (lexical and context-free) of the syntax of a language are defined explicitly in one formalism. Furthermore, there are no restrictions on the class of grammars, thus allowing a natural syntax tree structure. Ambiguities that arise through the use of unrestricted grammars are handled by explicit disambiguation constructs, instead of implicit defaults that are taken by traditional scanner and parser gener- ators. Hence, a syntax definition becomes a full declarative description of a language. Scannerless generalized LR parsing is a viable technique that has been applied in various industrial and academic projects. 1 Introduction Since the introduction of efficient deterministic parsing techniques, parsing is considered a closed topic for research, both by computer scientists and by practi- cioners in compiler construction. Tools based on deterministic parsing algorithms such as LEX & YACC [15,11] (LALR) and JavaCC (recursive descent), are con- sidered adequate for dealing with almost all modern (programming) languages. -
Introduction to Programming Languages and Syntax
Programming Languages Session 1 – Main Theme Programming Languages Overview & Syntax Dr. Jean-Claude Franchitti New York University Computer Science Department Courant Institute of Mathematical Sciences Adapted from course textbook resources Programming Language Pragmatics (3rd Edition) Michael L. Scott, Copyright © 2009 Elsevier 1 Agenda 11 InstructorInstructor andand CourseCourse IntroductionIntroduction 22 IntroductionIntroduction toto ProgrammingProgramming LanguagesLanguages 33 ProgrammingProgramming LanguageLanguage SyntaxSyntax 44 ConclusionConclusion 2 Who am I? - Profile - ¾ 27 years of experience in the Information Technology Industry, including thirteen years of experience working for leading IT consulting firms such as Computer Sciences Corporation ¾ PhD in Computer Science from University of Colorado at Boulder ¾ Past CEO and CTO ¾ Held senior management and technical leadership roles in many large IT Strategy and Modernization projects for fortune 500 corporations in the insurance, banking, investment banking, pharmaceutical, retail, and information management industries ¾ Contributed to several high-profile ARPA and NSF research projects ¾ Played an active role as a member of the OMG, ODMG, and X3H2 standards committees and as a Professor of Computer Science at Columbia initially and New York University since 1997 ¾ Proven record of delivering business solutions on time and on budget ¾ Original designer and developer of jcrew.com and the suite of products now known as IBM InfoSphere DataStage ¾ Creator of the Enterprise -
Staged Parser Combinators for Efficient Data Processing
Staged Parser Combinators for Efficient Data Processing Manohar Jonnalagedda∗ Thierry Coppeyz Sandro Stucki∗ Tiark Rompf y∗ Martin Odersky∗ ∗LAMP zDATA, EPFL {first.last}@epfl.ch yOracle Labs: {first.last}@oracle.com Abstract use the language’s abstraction capabilities to enable compo- Parsers are ubiquitous in computing, and many applications sition. As a result, a parser written with such a library can depend on their performance for decoding data efficiently. look like formal grammar descriptions, and is also readily Parser combinators are an intuitive tool for writing parsers: executable: by construction, it is well-structured, and eas- tight integration with the host language enables grammar ily maintainable. Moreover, since combinators are just func- specifications to be interleaved with processing of parse re- tions in the host language, it is easy to combine them into sults. Unfortunately, parser combinators are typically slow larger, more powerful combinators. due to the high overhead of the host language abstraction However, parser combinators suffer from extremely poor mechanisms that enable composition. performance (see Section 5) inherent to their implementa- We present a technique for eliminating such overhead. We tion. There is a heavy penalty to be paid for the expres- use staging, a form of runtime code generation, to dissoci- sivity that they allow. A grammar description is, despite its ate input parsing from parser composition, and eliminate in- declarative appearance, operationally interleaved with input termediate data structures and computations associated with handling, such that parts of the grammar description are re- parser composition at staging time. A key challenge is to built over and over again while input is processed. -
Packrat Parsing: Simple, Powerful, Lazy, Linear Time
Packrat Parsing: Simple, Powerful, Lazy, Linear Time Functional Pearl Bryan Ford Massachusetts Institute of Technology Cambridge, MA [email protected] Abstract 1 Introduction Packrat parsing is a novel technique for implementing parsers in a There are many ways to implement a parser in a functional pro- lazy functional programming language. A packrat parser provides gramming language. The simplest and most direct approach is the power and flexibility of top-down parsing with backtracking and top-down or recursive descent parsing, in which the components unlimited lookahead, but nevertheless guarantees linear parse time. of a language grammar are translated more-or-less directly into a Any language defined by an LL(k) or LR(k) grammar can be rec- set of mutually recursive functions. Top-down parsers can in turn ognized by a packrat parser, in addition to many languages that be divided into two categories. Predictive parsers attempt to pre- conventional linear-time algorithms do not support. This additional dict what type of language construct to expect at a given point by power simplifies the handling of common syntactic idioms such as “looking ahead” a limited number of symbols in the input stream. the widespread but troublesome longest-match rule, enables the use Backtracking parsers instead make decisions speculatively by try- of sophisticated disambiguation strategies such as syntactic and se- ing different alternatives in succession: if one alternative fails to mantic predicates, provides better grammar composition properties, match, then the parser “backtracks” to the original input position and allows lexical analysis to be integrated seamlessly into parsing. and tries another. -
Safestrings Representing Strings As Structured Data
1 SafeStrings Representing Strings as Structured Data DAVID KELLY, University College London MARK MARRON, Microso Research DAVID CLARK, University College London EARL T. BARR, University College London Strings are ubiquitous in code. Not all strings are created equal, some contain structure that makes them incompatible with other strings. CSS units are an obvious example. Worse, type checkers cannot see this structure: this is the latent structure problem. We introduce SafeStrings to solve this problem and expose latent structure in strings. Once visible, operations can leverage this structure to efficiently manipulate it; further, SafeStrings permit the establishment of closure properties. SafeStrings harness the subtyping and inheritance mechanics of their host language to create a natural hierarchy of string subtypes. SafeStrings define an elegant programming model over strings: the front end use of a SafeString is clear and uncluered, with complexity confined inside the definition of a particular SafeString. ey are lightweight, language- agnostic and deployable, as we demonstrate by implementing SafeStrings in TypeScript. SafeStrings reduce the surface area for cross-site scripting, argument selection defects, and they can facilitate fuzzing and analysis. 1 Introduction Strings are the great dumping ground of programming. eir meagre structural obligations mean that it is easy to pour information into them. Developers use them to store semi-structured, even structured, data. omas Edison (paraphrasing Reynolds) captured the reason: “ ere is no expe- dient to which a man will not resort to avoid the real labor of thinking.” Take colour assignment in css. As flat strings they take the form '#XXX' or '#XXXXXX', but they implicitly encode three integer values for rgb. -
Parser Combinator Compiler (Pc-Compiler)
Optimizing Parser Combinators Jan Kurˇs†, Jan Vran´y‡, Mohammad Ghafari†, Mircea Lungu¶ and Oscar Nierstrasz† †Software Composition Group, ‡Software Engineering Group, ¶Software Engineering and University of Bern, Czech Technical University, Architecture Group, Switzerland Czech Republic Universtiy of Groningen, scg.unibe.ch swing.fit.cvut.cz Netherlands http://www.cs.rug.nl/search Abstract also powerful in their expressiveness as they can parse Parser combinators are a popular approach to pars- beyond context-free languages (e.g., layout-sensitive ing. Parser combinators follow the structure of an un- languages (Hutton and Meijer 1996; Adams and A˘gacan derlying grammar, are modular, well-structured, easy 2014)). to maintain, and can recognize a large variety of lan- Nevertheless, parser combinators yet remain a tech- guages including context-sensitive ones. However, their nology for prototyping and not for the actual deploy- universality and flexibility introduces a noticeable per- ment. Firstly, naive implementations do not handle left- formance overhead. Time-wise, parser combinators can- recursive grammars, though there are already solutions not compete with parsers generated by well-performing to this issue (Frost et al. 2007; Warth et al. 2008). Sec- parser generators or optimized hand-written code. ondly, the expressive power of parser combinators comes Techniques exist to achieve a linear asymptotic per- at a price of less efficiency. A parser combinator uses formance of parser combinators, yet there is still a sig- the full power of a Turing-equivalent formalism to rec- nificant constant multiplier. This can be further lowered ognize even simple languages that could be recognized using meta-programming techniques. by finite state machines or pushdown automata. -
Compiler Construction
Compiler construction PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Sat, 10 Dec 2011 02:23:02 UTC Contents Articles Introduction 1 Compiler construction 1 Compiler 2 Interpreter 10 History of compiler writing 14 Lexical analysis 22 Lexical analysis 22 Regular expression 26 Regular expression examples 37 Finite-state machine 41 Preprocessor 51 Syntactic analysis 54 Parsing 54 Lookahead 58 Symbol table 61 Abstract syntax 63 Abstract syntax tree 64 Context-free grammar 65 Terminal and nonterminal symbols 77 Left recursion 79 Backus–Naur Form 83 Extended Backus–Naur Form 86 TBNF 91 Top-down parsing 91 Recursive descent parser 93 Tail recursive parser 98 Parsing expression grammar 100 LL parser 106 LR parser 114 Parsing table 123 Simple LR parser 125 Canonical LR parser 127 GLR parser 129 LALR parser 130 Recursive ascent parser 133 Parser combinator 140 Bottom-up parsing 143 Chomsky normal form 148 CYK algorithm 150 Simple precedence grammar 153 Simple precedence parser 154 Operator-precedence grammar 156 Operator-precedence parser 159 Shunting-yard algorithm 163 Chart parser 173 Earley parser 174 The lexer hack 178 Scannerless parsing 180 Semantic analysis 182 Attribute grammar 182 L-attributed grammar 184 LR-attributed grammar 185 S-attributed grammar 185 ECLR-attributed grammar 186 Intermediate language 186 Control flow graph 188 Basic block 190 Call graph 192 Data-flow analysis 195 Use-define chain 201 Live variable analysis 204 Reaching definition 206 Three address -
Incremental Scannerless Generalized LR Parsing
SPLASH: G: Incremental Scannerless Generalized LR Parsing Maarten P. Sijm Delft University of Technology Delft, The Netherlands [email protected] Abstract a separate lexing (or scanning) phase, supports modelling We present the Incremental Scannerless Generalized LR the entire language syntax in one single grammar, and al- (ISGLR) parsing algorithm, which combines the benefits of lows composition of grammars for different languages. One Incremental Generalized LR (IGLR) parsing and Scanner- notable disadvantage is that the SGLR parsing algorithm of less Generalized LR (SGLR) parsing. The ISGLR parser can Visser [9] is a batch algorithm, meaning that it must process reuse parse trees from unchanged regions in the input and each entire file in one pass. This becomes a problem for soft- thus only needs to parse changed regions. We also present ware projects that have large files, as every small change incremental techniques for imploding the parse tree to an requires the entire file to be parsed again. Abstract Syntax Tree (AST) and syntax highlighting. Scan- Incremental Generalized LR (IGLR) parsing is an improve- nerless parsing relies heavily on non-determinism during ment over batch Generalized LR (GLR) parsing. Amongst parsing, negatively impacting the incrementality of ISGLR others, Wagner [10] and TreeSitter [8] have created parsing parsing. We evaluated the ISGLR parsing algorithm using algorithms that allow rapid parsing of changes to large files. file histories from Git, achieving a speedup of up to 25 times However, these algorithms use a separate incremental lexical over non-incremental SGLR. analysis phase which complicates the implementation of in- cremental parsing [11] and does not directly allow language CCS Concepts • Software and its engineering → Incre- composition.