Parsing Wrap-Up Thoughts on Assignment 4 PL Feature: Type System
Total Page:16
File Type:pdf, Size:1020Kb
Parsing Wrap-Up Thoughts on Assignment 4 PL Feature: Type System CS F331 Programming Languages CSCE A331 Programming Language Concepts Lecture Slides Wednesday, February 19, 2020 Glenn G. Chappell Department of Computer Science University of Alaska Fairbanks [email protected] © 2017–2020 Glenn G. Chappell Review 2020-02-19 CS F331 / CSCE A331 Spring 2020 2 Review Overview of Lexing & Parsing Parsing Lexer Parser Character Lexeme AST or Stream Stream Error return (*dp + 2.6); //x return (*dp + 2.6); //x returnStmt key op id op num binOp: + lit punct punct unOp: * numLit: 2.6 Two phases: id: dp § Lexical analysis (lexing) § Syntax analysis (parsing) The output of a parser is typically an abstract syntax tree (AST). Specifications of these will vary. 2020-02-19 CS F331 / CSCE A331 Spring 2020 3 Review The Basics of Syntax Analysis — Categories of Parsers Parsing methods can be divided into two broad categories. Top-Down Parsers § Go through derivation top to bottom, expanding nonterminals. § Important subcategory: LL parsers (read input Left-to-right, produce Leftmost derivation). § Often hand-coded—but not always. § Method we look at: Predictive Recursive Descent. Bottom-Up Parsers § Go through the derivation bottom to top, reducing substrings to nonterminals. § Important subcategory: LR parsers (read input Left-to-right, produce Rightmost derivation). § Almost always automatically generated. § Method we look at: Shift-Reduce. 2020-02-19 CS F331 / CSCE A331 Spring 2020 4 Review The Basics of Syntax Analysis — Categories of Grammars LL(k) grammars: those usable by LL parsers that k is a number. make decisions based on the next k lexemes. Similarly: LR(k) grammars: those usable by LR parsers that make decisions based on the next k lexemes. All Grammars CFGs LR(1) Grammars Regular Grammars LL(1) Grammars 2020-02-19 CS F331 / CSCE A331 Spring 2020 5 Review Recursive-Descent Parsing [1/2] Recursive Descent is a top-down parsing method. § When we avoid backtracking: predictive. Predictive Recursive- Descent is an LL parsing method. § There is one parsing function for each nonterminal. This parses all strings that the nonterminal can be expanded into. The natural grammar for expressions with left-associative binary operators is not LL(1). But we can transform it so that it is usable by a predictive recursive-descent parser. Not Usable (left recursion) Usable e → t e → t { ( “+” | “-” ) t } | e ( “+” | “-” ) t 2020-02-19 CS F331 / CSCE A331 Spring 2020 6 Review Recursive-Descent Parsing [2/2] On a correct parse, a parser typically returns an abstract syntax tree (AST). We specify the format of an AST for each line in our grammar. It is helpful to include information in the AST telling what kind of entity each node represents. Expression: a + 2 AST (diagram): binOp: + simpleVar: a numLit: 2 AST (Lua): {{ BIN_OP, "+" }, { SIMPLE_VAR, "a" }, { NUMLIT_VAL, "2" }} See rdparser4.lua. 2020-02-19 CS F331 / CSCE A331 Spring 2020 7 Review Shift-Reduce Parsing [1/2] We looked at a class of table-based bottom-up parsing algorithms. expr 8 Parser execution uses a state machine with Stack = 4 a stack, called a Shift-Reduce parser. ID (a) 2 Each stack item holds a symbol and a state. 1 Parser operation is specified by a parsing table in two parts— action table & goto table—constructed before execution. At each time step a lookup in the action table is done. One of four operations will be specified. § SHIFT. Shift the next input symbol onto the stack. § REDUCE. Apply a production, reducing a string of symbols on the stack to a single nonterminal. See the goto table for the new state. § ACCEPT. Done, syntactically correct. § ERROR. Done, syntactically incorrect. 2020-02-19 CS F331 / CSCE A331 Spring 2020 8 Review Shift-Reduce Parsing [2/2] In its most general form, Shift-Reduce parsing is an LR parsing method that can handle all LR(1) grammars. But generating Shift-Reduce parsing tables in a straightforward way tends to result in far too many states. Various ways of generating smaller tables are known, but these do not work for all LR(1) grammars. Probably the most common way of generating smaller parsing tables is Lookahead LR (LALR) [F. DeRemer 1969]. The LR(k) grammars for which this gives correct results are the LALR(k) grammars. The Yacc (“Yet Another Compiler Compiler”) parser generator and its descendants (GNU Bison, etc.), use LALR. 2020-02-19 CS F331 / CSCE A331 Spring 2020 9 Parsing Wrap-Up 2020-02-19 CS F331 / CSCE A331 Spring 2020 10 Parsing Wrap-Up Efficiency of Parsing [1/6] We have discussed parsing algorithms that can handle some, but not all, CFLs. We have not yet discussed their efficiency. This leads to several questions. § How fast are the parsing algorithms we have covered? § How fast are other practical parsing algorithms? § Are there parsing algorithms that can handle all CFLs? § If so, how fast are these algorithms? 2020-02-19 CS F331 / CSCE A331 Spring 2020 11 Parsing Wrap-Up Efficiency of Parsing [2/6] How do we determine the time efficiency of a parser? In order to express efficiency using asymptotic notation (big-O, Θ, etc.), we need to know three things. § How do we measure the size of the input? § What operations are allowed? § What operations do we count (basic operations)? We will use the following conventions. § The size of the input, denoted by n, is the number of symbols. So for our parsers, n is the number of lexemes. If lexical analysis is included, then n is the number of characters. § Allowed operations are retrieving the next input symbol and any internal-processing operations we want. § Basic operations are C operators on fundamental types and calls to client-code-provided functions (e.g., retrieve next lexeme). 2020-02-19 CS F331 / CSCE A331 Spring 2020 12 Parsing Wrap-Up Efficiency of Parsing [3/6] Based on these conventions: Practical parsers (and lexers) run in linear time. This includes state-machine lexers, Predictive Recursive-Descent parsers, and Shift-Reduce parsers. For state-machine lexers, a state cannot be repeated without advancing the input, and there are a fixed number of states. For Predictive Recursive-Descent parsers, similarly note that a parsing function cannot be called repeatedly without advancing the input, and there are a fixed number of parsing functions. For Shift-Reduce parsers, also note that, while a REDUCE is not constant-time, each stack item will be pushed and popped once. 2020-02-19 CS F331 / CSCE A331 Spring 2020 13 Parsing Wrap-Up Efficiency of Parsing [4/6] In the late 1960s and 1970s, parsing methods were found that can handle any CFL. These generally cannot handle arbitrary CFGs; the grammar may need to be transformed. § Earley Parser [J. Earley 1968] § CYK Parser [J. Cocke & J.T. Schwartz 1970, D.H. Younger 1967, T. Kasami 1965] § Generalized LR (GLR) Parser [B. Lang 1974, M. Tomita 1984] Note! All of these have a worst-case time of Θ(n3) for an arbitrary CFL. Some can have better performance for some CFLs. 2020-02-19 CS F331 / CSCE A331 Spring 2020 14 Parsing Wrap-Up Efficiency of Parsing [5/6] An interesting method is Valiant’s Algorithm [L.G. Valiant 1975]. This is an implementation of CYK using matrix multiplication. It can thus benefit from fast matrix-multiplication algorithms. Using Strassen’s matrix-multiplication algorithm [V. Strassen 1969], Valiant’s Algorithm parses any CFL in about Θ(n2.807355). Using Le Gall’s matrix-multiplication algorithm [F. Le Gall 2014], Valiant’s Algorithm parses any CFL in about Θ(n2.372864). But Le Gall’s only achieves speed-ups for extremely large matrices; this version of Valiant’s Algorithm is slow for realistic input. Again: Practical parsers (and lexers) run in linear time. 2020-02-19 CS F331 / CSCE A331 Spring 2020 15 Parsing Wrap-Up Efficiency of Parsing [6/6] Another method of interest that is actually used in practice is the Generalized LR (GLR) parser mentioned two slides back. A GLR parser uses a variant of the Shift-Reduce idea, but it allows for grammars that cannot be parsed by a deterministic automaton. It does this, is, roughly, by simulating a nondeterministic automaton. It allows for multiple possible actions for a particular state and symbol, and it tries all of them. As noted earlier, GLR parsing has a worst-case time of Θ(n3). But it is much faster for some grammars. As a result, GLR is considered to be a practical parsing method—but only for those grammars for which it is fast. GLR can easily handle some situations that are problematic for other parsing methods. For this reason GLR is increasingly used. 2020-02-19 CS F331 / CSCE A331 Spring 2020 16 Parsing Wrap-Up Parsing in Practice [1/3] Many parsing methods have been developed. Some of them are significantly different from the ones we have looked at. However, in my experience, the parser used in a production compiler is generally one of the following two kinds. § A hand-coded top-down parser using Recursive Descent—or something similar. § An automatically generated bottom-up parser using a table-based Shift-Reduce method—or something similar (usually LALR or GLR). So, while we have by no means looked at all known parsing methods, what we have covered should give you the flavor of the parsers that are used in production compilers. 2020-02-19 CS F331 / CSCE A331 Spring 2020 17 Parsing Wrap-Up Parsing in Practice [2/3] Producing a parser is a very practical skill. This might seem unlikely; as a member of a software-development team, it is unlikely you will write a compiler.