Recursive-Descent continued

CS F331 Programming Languages CSCE A331 Concepts Lecture Slides Friday, February 14, 2020

Glenn G. Chappell Department of Computer Science University of Alaska Fairbanks [email protected]

© 2017–2020 Glenn G. Chappell Review

2020-02-14 CS F331 / CSCE A331 Spring 2020 2 Review Overview of Lexing & Parsing

Parsing

Lexer Parser Character Lexeme AST or Stream Stream Error return (*dp + 2.6); //x return (*dp + 2.6); //x returnStmt key op id op num binOp: + lit punct punct unOp: * numLit: 2.6 Two phases: id: dp § (lexing) § Syntax analysis (parsing)

The output of a parser is typically an abstract syntax (AST). Specifications of these will vary.

2020-02-14 CS F331 / CSCE A331 Spring 2020 3 Review The Basics of Syntax Analysis — Categories of Parsers

Parsing methods can be divided into two broad categories. Top-Down Parsers § Go through derivation top to bottom, expanding nonterminals. § Important subcategory: LL parsers (read input Left-to-right, produce Leftmost derivation). § Often hand-coded—but not always. § Method we look at: Predictive Recursive Descent. Bottom-Up Parsers § Go through the derivation bottom to top, reducing substrings to nonterminals. § Important subcategory: LR parsers (read input Left-to-right, produce Rightmost derivation). § Almost always automatically generated. § Method we look at: Shift-Reduce.

2020-02-14 CS F331 / CSCE A331 Spring 2020 4 Review The Basics of Syntax Analysis — Categories of Grammars

LL(k) grammars: those usable by LL parsers that k is a number. make decisions based on the next k lexemes. Similarly: LR(k) grammars: those usable by LR parsers that make decisions based on the next k lexemes.

All Grammars

CFGs

LR(1) Grammars Regular Grammars LL(1) Grammars

2020-02-14 CS F331 / CSCE A331 Spring 2020 5 Review Recursive-Descent Parsing — Intro thru Example #2 [1/2]

Recursive Descent is a top-down parsing method. When we avoid : predictive. Predictive Recursive- Descent is an LL parsing method.

There is one parsing function for each nonterminal. This parses all strings that the nonterminal can be expanded into. Parsing-function code is a translation of the right-hand side of the production for the nonterminal. § A terminal in the right-hand side becomes a check that the input string contains the proper lexeme. § A nonterminal in the right-hand side becomes a call to its parsing function. § If that returns false, our function must return false—no backtracking. § Brackets [ … ] in the right-hand side become a conditional. § Braces { … } in the right-hand side become See rdparser2.lua a loop. & userdparser2.lua.

2020-02-14 CS F331 / CSCE A331 Spring 2020 6 Review Recursive-Descent Parsing — Intro thru Example #2 [2/2]

A naively written parser may call some inputs correct when it cannot parse the entire input. Example: ((x))) Syntactically Extra correct stuff

Two Solutions § Introduce a new end of input lexeme (standard symbol: $). Revise the grammar to include it. § After parsing, check whether the end of the input was reached. Our parsers use the second solution.

See rdparser2.lua & userdparser2.lua.

2020-02-14 CS F331 / CSCE A331 Spring 2020 7 Review Recursive-Descent Parsing — Example #3: Expressions [1/3]

We wish to parse arithmetic expressions in their usual form, with variables, numeric literals, binary +, -, *, and / operators, and parentheses. When given a syntactically correct expression, our parser should return an (AST).

All operators will be binary and left-associative: “a + b + ” means “(a + b) + c”. Precedence will be as usual: “a + b * c” means “a + (b * c)”. These may be overridden using parentheses: “(a + b) * c”.

Due to the limitations of our in-class lexer, the expression “k-4” will need to be written as “k - 4”.

2020-02-14 CS F331 / CSCE A331 Spring 2020 8 Review Recursive-Descent Parsing — Example #3: Expressions [2/3]

We started with Grammar 3, which has start symbol expr. This grammar encodes our associativity and precedence rules, and it allows us to use parentheses to override them.

Grammar 3 expr → term | expr ( “+” | “-” ) term term → factor | term ( “*” | “/” ) factor factor → ID | NUMLIT | “(” expr “)”

But Grammar 3 is not an LL(1) grammar, so we cannot use it. We see the problem if we try to write a Recursive-Descent parser.

2020-02-14 CS F331 / CSCE A331 Spring 2020 9 Review Recursive-Descent Parsing — Example #3: Expressions [3/3]

Grammar 3 function parse_expr() expr → term if parse_term() then | expr ( “+” | “-” ) term … -- Construct AST term → factor return true | term ( “*” | “/” ) factor elseif parse_expr() then factor → ID … | NUMLIT | “(” expr “)”

The above code has problems. § First, we are not supposed to be doing backtracking; if the call to parse_term returns false, then we must return false. § But even if we solve that, there is another problem. If parse_expr is called with input that does not begin with a valid term, then we get infinite recursion.

2020-02-14 CS F331 / CSCE A331 Spring 2020 10 Review Recursive-Descent Parsing — LL(1) Grammars

An LL(1) grammar is a CFG that can be handled by an LL parsing method—such as Predictive Recursive Descent—without lookahead. The name comes from the fact that these parsers handle their input in a Left-to-right order, and they go through the steps required to generate a Leftmost derivation.

When there is a choice to be made, an LL parser without lookahead must be able to make that choice based on the current lexeme. If this cannot be done, then the grammar is not LL(1).

2020-02-14 CS F331 / CSCE A331 Spring 2020 11 Review Recursive-Descent Parsing — Transforming Grammars

Sometimes, we can transform a non-LL(1) grammar into an LL(1) grammar that generates the same language.

For example, here is Grammar C, which is not LL(1), along with an LL(1) grammar that generates the same language.

Grammar C Grammar Ca xx → yy | zz xx → yy | zz | “” yy → “” | “a” yy → “a” zz → “” | “b” zz → “b”

2020-02-14 CS F331 / CSCE A331 Spring 2020 12 Review Recursive-Descent Parsing — Back to Ex. #3 (Left-Associativity)

Grammar 3b, below, is what we function parse_expr() want. It works with a Predictive if not parse_term() then Recursive-Descent parser, and return false we can use it to parse left- end associative binary operators. while matchString("+") or matchString("-") do Grammar 3b if not parse_term() then return false expr → term { ( “+” | “-” ) term } end term → factor { ( “*” | “/” ) factor } end factor → ID return true | NUMLIT end | “(” expr “)”

Now, what about generating an AST?

2020-02-14 CS F331 / CSCE A331 Spring 2020 13 Recursive-Descent Parsing continued

2020-02-14 CS F331 / CSCE A331 Spring 2020 14 Recursive-Descent Parsing Back to Example #3: Expressions — ASTs [1/4]

We want write a parser that returns an abstract syntax tree (AST).

First we review some rooted-tree terminology.

A node with one or more children—that is, a node A that is not a leaf—is an internal node. B C The tree at right has two internal nodes, labeled A and C. D E

The subtrees of an internal node are those rooted A at each of its children. In the tree at right, the two subtrees of the A B C node are circled. D E

2020-02-14 CS F331 / CSCE A331 Spring 2020 15 Recursive-Descent Parsing Back to Example #3: Expressions — ASTs [2/4]

Recall that a , or concrete syntax tree, includes one leaf node for each lexeme in the input, and also one node for each nonterminal in the derivation.

Below is the parse tree for a + 2, based on Grammar 3b.

expr This is not new! term + term It is the same kind of tree we drew before. factor factor

a 2

2020-02-14 CS F331 / CSCE A331 Spring 2020 16 Recursive-Descent Parsing Back to Example #3: Expressions — ASTs [3/4]

In an abstract syntax tree (AST), each internal node represents an operation. Its subtrees are the ASTs for the entities the operation is applied to. On the left is our parse tree for a + 2. On the right is an AST for the same expression. Abstract Parse Tree expr + Syntax Tree This is new. term + term a 2

factor factor

a 2

Above, operation is broadly defined. For example, an AST node might represent the operation “execute all these statements, in order”. Its subtrees would be ASTs for the statements.

2020-02-14 CS F331 / CSCE A331 Spring 2020 17 Recursive-Descent Parsing Back to Example #3: Expressions — ASTs [4/4]

The subtrees of an internal node in an AST are ASTs themselves. So we can build an AST from smaller ASTs in the same way we build an expression from smaller expressions.

AST for AST for + a + 2 * (a + 2) * b

a 2 + b

a 2 Notes § ASTs omit portions of syntax that only serve to guide the parser— like semicolons, parentheses, commas, extra nonterminals. § There is no universal specification for an AST. Instead the ASTs used in a software project are specified so as to best meet the needs of that project.

2020-02-14 CS F331 / CSCE A331 Spring 2020 18 Recursive-Descent Parsing Back to Example #3: Expressions — Representing ASTs [1/7]

+ *

a 2 + b

a 2

We need to represent ASTs like these in Lua. § Represent a single node by the string form of its lexeme. § If there is more than one node in an AST, then represent it as an array whose first item represents the root, while each remaining item represents one of the subtrees of the root, in order.

The first AST, in Lua: { "+", "a", "2" }

TRY IT: What is the Lua representation of the second AST?

2020-02-14 CS F331 / CSCE A331 Spring 2020 19 Recursive-Descent Parsing Back to Example #3: Expressions — Representing ASTs [2/7]

+ *

a 2 + b

a 2

We need to represent ASTs like these in Lua. § Represent a single node by the string form of its lexeme. § If there is more than one node in an AST, then represent it as an array whose first item represents the root, while each remaining item represents one of the subtrees of the root, in order.

The first AST, in Lua: { "+", "a", "2" }

The second AST, in Lua: { "*", { "+", "a", "2" }, "b" }

2020-02-14 CS F331 / CSCE A331 Spring 2020 20 Recursive-Descent Parsing Back to Example #3: Expressions — Representing ASTs [3/7]

It is better to describe our ASTs in a way that does not require tree drawings. So we specify the format of an AST for each line in our grammar. (Lines are numbered so we can refer to them.)

Grammar 3b 1. expr → term { ( “+” | “-” ) term } 2. term → factor { ( “*” | “/” ) factor } 3. factor → ID 4. | NUMLIT 5. | “(” expr “)”

1. If there is only a term, then the AST for the expr is the AST for the term. Otherwise, the AST is { OO, AAA, BBB }, where OO is the string form of the last operator, AAA is the AST for everything before it, and BBB is the AST for the last term.

2020-02-14 CS F331 / CSCE A331 Spring 2020 21 Recursive-Descent Parsing Back to Example #3: Expressions — Representing ASTs [4/7]

A term is handled similarly.

Grammar 3b 1. expr → term { ( “+” | “-” ) term } 2. term → factor { ( “*” | “/” ) factor } 3. factor → ID 4. | NUMLIT 5. | “(” expr “)”

2. If there is only a factor, then the AST for the term is the AST for the factor. Otherwise, the AST is { OO, AAA, BBB }, where OO is the string form of the last operator, AAA is the AST for everything before it, and BBB is the AST for the last factor.

2020-02-14 CS F331 / CSCE A331 Spring 2020 22 Recursive-Descent Parsing Back to Example #3: Expressions — Representing ASTs [5/7]

A factor has multiple options.

Grammar 3b 1. expr → term { ( “+” | “-” ) term } 2. term → factor { ( “*” | “/” ) factor } 3. factor → ID 4. | NUMLIT 5. | “(” expr “)”

3. AST for the factor: string form of the ID. 4. AST for the factor: string form of the NUMLIT. 5. AST for the factor: AST for the expr.

2020-02-14 CS F331 / CSCE A331 Spring 2020 23 Recursive-Descent Parsing Back to Example #3: Expressions — Representing ASTs [6/7]

Grammar 3b 1. expr → term { ( “+” | “-” ) term } 2. term → factor { ( “*” | “/” ) factor } 3. factor → ID 4. | NUMLIT 5. | “(” expr “)”

Applying the various rules, the AST for (a + 2) * b is

{ "*", { "+", "a", "2" }, "b" }

Each parsing function can now return a pair: a Boolean and an AST. The Boolean indicates a correct parse, as before. The AST is only valid if the Boolean is true, in which case it will be in the specified format.

2020-02-14 CS F331 / CSCE A331 Spring 2020 24 Recursive-Descent Parsing Back to Example #3: Expressions — Representing ASTs [7/7]

Grammar 3b TO DO 1. expr → term { ( “+” | “-” ) term } § Based on Grammar 3b, 2. term → factor { ( “*” | “/” ) factor } write a Predictive Recursive-Descent 3. factor → ID parser that produces 4. | NUMLIT an AST, as described. 5. | “(” expr “)” Done. See rdparser3.lua & userdparser3.lua. 1. If there is only a term, then the AST for the expr is the AST for the term. Otherwise, the AST is { OO, AAA, BBB }, where OO is the string form of the last operator, AAA is the AST for everything before it, and BBB is the AST for the last term. 2. Similar to (1). 3. AST for the factor: string form of the ID. 4. AST for the factor: string form of the NUMLIT. 5. AST for the factor: AST for the expr.

2020-02-14 CS F331 / CSCE A331 Spring 2020 25 Recursive-Descent Parsing Example #4: Better ASTs [1/4]

The ASTs we have specified are not quite what we need.

We need to know whether each node represents an operator, a variable identifier, etc. The lexer already figured this out, and it told the parser, but this information is not in the AST.

And there is other information we can store. For example, in many PLs—including the one you will write a parser for—a minus sign can be a binary operator (a - b) or a unary operator (-x). The lexer does not know which it is. But the parser knows, and it can include this information in the AST.

2020-02-14 CS F331 / CSCE A331 Spring 2020 26 Recursive-Descent Parsing Example #4: Better ASTs [2/4]

Let’s mark each node in the AST, indicating the kind of entity it represents. We have three kinds of nodes: binary operators, simple variables, and numeric literals.

* binOp: *

+ b binOp: + simpleVar: b

a 2 simpleVar: a numLit: 2

2020-02-14 CS F331 / CSCE A331 Spring 2020 27 Recursive-Descent Parsing Example #4: Better ASTs [3/4]

In the Lua form of our AST, we can replace each string with a two- item array. The first item in the array will be one of three constants: BIN_OP, SIMPLE_VAR, or NUMLIT_VAL. The second item will be the string form of the lexeme.

"/" { BIN_OP, "/" } "abc" { SIMPLE_VAR, "abc" } "123" { NUMLIT_VAL, "123" }

So the AST for a + 2 changes as shown below.

{ "+", "a", "2" } {{ BIN_OP, "+" }, { SIMPLE_VAR, "a" }, { NUMLIT_VAL, "2" }}

2020-02-14 CS F331 / CSCE A331 Spring 2020 28 Recursive-Descent Parsing Example #4: Better ASTs [4/4]

"/" { BIN_OP, "/" } "abc" { SIMPLE_VAR, "abc" } "123" { NUMLIT_VAL, "123" }

TO DO § Rewrite the parser based on Grammar 3b so that it constructs and returns these improved ASTs. Done. See rdparser4.lua & userdparser4.lua.

2020-02-14 CS F331 / CSCE A331 Spring 2020 29