Parsing Wrap-Up Thoughts on Assignment 4 PL Feature:

CS F331 Programming Languages CSCE A331 Concepts Lecture Slides Wednesday, February 19, 2020

Glenn G. Chappell Department of Computer Science University of Alaska Fairbanks [email protected]

© 2017–2020 Glenn G. Chappell Review

2020-02-19 CS F331 / CSCE A331 Spring 2020 2 Review Overview of Lexing & Parsing

Parsing

Lexer Parser Character Lexeme AST or Stream Stream Error return (*dp + 2.6); //x return (*dp + 2.6); //x returnStmt key op id op num binOp: + lit punct punct unOp: * numLit: 2.6 Two phases: id: dp § Lexical analysis (lexing) § Syntax analysis (parsing)

The output of a parser is typically an abstract syntax tree (AST). Specifications of these will vary.

2020-02-19 CS F331 / CSCE A331 Spring 2020 3 Review The Basics of Syntax Analysis — Categories of Parsers

Parsing methods can be divided into two broad categories. Top-Down Parsers § Go through derivation top to bottom, expanding nonterminals. § Important subcategory: LL parsers (read input Left-to-right, produce Leftmost derivation). § Often hand-coded—but not always. § Method we look at: Predictive Recursive Descent. Bottom-Up Parsers § Go through the derivation bottom to top, reducing substrings to nonterminals. § Important subcategory: LR parsers (read input Left-to-right, produce Rightmost derivation). § Almost always automatically generated. § Method we look at: Shift-Reduce.

2020-02-19 CS F331 / CSCE A331 Spring 2020 4 Review The Basics of Syntax Analysis — Categories of Grammars

LL(k) grammars: those usable by LL parsers that k is a number. make decisions based on the next k lexemes. Similarly: LR(k) grammars: those usable by LR parsers that make decisions based on the next k lexemes.

All Grammars

CFGs

LR(1) Grammars Regular Grammars LL(1) Grammars

2020-02-19 CS F331 / CSCE A331 Spring 2020 5 Review Recursive-Descent Parsing [1/2]

Recursive Descent is a top-down parsing method. § When we avoid backtracking: predictive. Predictive Recursive- Descent is an LL parsing method. § There is one parsing function for each nonterminal. This parses all strings that the nonterminal can be expanded into.

The natural grammar for expressions with left-associative binary operators is not LL(1). But we can transform it so that it is usable by a predictive recursive-descent parser.

Not Usable (left recursion) Usable e → t { ( “+” | “-” ) t } | e ( “+” | “-” ) t

2020-02-19 CS F331 / CSCE A331 Spring 2020 6 Review Recursive-Descent Parsing [2/2]

On a correct parse, a parser typically returns an abstract syntax tree (AST). We specify the format of an AST for each line in our grammar. It is helpful to include information in the AST telling what kind of entity each node represents.

Expression: a + 2

AST (diagram): binOp: +

simpleVar: a numLit: 2

AST (Lua): {{ BIN_OP, "+" }, { SIMPLE_VAR, "a" }, { NUMLIT_VAL, "2" }} See rdparser4.lua.

2020-02-19 CS F331 / CSCE A331 Spring 2020 7 Review Shift-Reduce Parsing [1/2]

We looked at a class of table-based bottom-up parsing algorithms.

expr 8 Parser execution uses a state machine with Stack = 4 a stack, called a Shift-Reduce parser. ID (a) 2 Each stack item holds a symbol and a state. 1

Parser operation is specified by a parsing table in two parts— action table & goto table—constructed before execution. At each time step a lookup in the action table is done. One of four operations will be specified. § SHIFT. Shift the next input symbol onto the stack. § REDUCE. Apply a production, reducing a string of symbols on the stack to a single nonterminal. See the goto table for the new state. § ACCEPT. Done, syntactically correct. § ERROR. Done, syntactically incorrect.

2020-02-19 CS F331 / CSCE A331 Spring 2020 8 Review Shift-Reduce Parsing [2/2]

In its most general form, Shift-Reduce parsing is an LR parsing method that can handle all LR(1) grammars. But generating Shift-Reduce parsing tables in a straightforward way tends to result in far too many states. Various ways of generating smaller tables are known, but these do not work for all LR(1) grammars.

Probably the most common way of generating smaller parsing tables is Lookahead LR (LALR) [F. DeRemer 1969]. The LR(k) grammars for which this gives correct results are the LALR(k) grammars.

The Yacc (“Yet Another Compiler”) parser generator and its descendants (GNU Bison, etc.), use LALR.

2020-02-19 CS F331 / CSCE A331 Spring 2020 9 Parsing Wrap-Up

2020-02-19 CS F331 / CSCE A331 Spring 2020 10 Parsing Wrap-Up Efficiency of Parsing [1/6]

We have discussed parsing algorithms that can handle some, but not all, CFLs. We have not yet discussed their efficiency. This leads to several questions. § How fast are the parsing algorithms we have covered? § How fast are other practical parsing algorithms? § Are there parsing algorithms that can handle all CFLs? § If so, how fast are these algorithms?

2020-02-19 CS F331 / CSCE A331 Spring 2020 11 Parsing Wrap-Up Efficiency of Parsing [2/6]

How do we determine the time efficiency of a parser?

In order to express efficiency using asymptotic notation (big-O, Θ, etc.), we need to know three things. § How do we measure the size of the input? § What operations are allowed? § What operations do we count (basic operations)?

We will use the following conventions. § The size of the input, denoted by n, is the number of symbols. So for our parsers, n is the number of lexemes. If lexical analysis is included, then n is the number of characters. § Allowed operations are retrieving the next input symbol and any internal-processing operations we want. § Basic operations are operators on fundamental types and calls to client-code-provided functions (e.g., retrieve next lexeme).

2020-02-19 CS F331 / CSCE A331 Spring 2020 12 Parsing Wrap-Up Efficiency of Parsing [3/6]

Based on these conventions:

Practical parsers (and lexers) run in linear time.

This includes state-machine lexers, Predictive Recursive-Descent parsers, and Shift-Reduce parsers.

For state-machine lexers, a state cannot be repeated without advancing the input, and there are a fixed number of states. For Predictive Recursive-Descent parsers, similarly note that a parsing function cannot be called repeatedly without advancing the input, and there are a fixed number of parsing functions. For Shift-Reduce parsers, also note that, while a REDUCE is not constant-time, each stack item will be pushed and popped once.

2020-02-19 CS F331 / CSCE A331 Spring 2020 13 Parsing Wrap-Up Efficiency of Parsing [4/6]

In the late 1960s and 1970s, parsing methods were found that can handle any CFL. These generally cannot handle arbitrary CFGs; the grammar may need to be transformed. § Earley Parser [J. Earley 1968] § CYK Parser [J. Cocke & J.T. Schwartz 1970, .H. Younger 1967, T. Kasami 1965] § Generalized LR (GLR) Parser [B. Lang 1974, M. Tomita 1984] Note! All of these have a worst-case time of Θ(n3) for an arbitrary CFL. Some can have better performance for some CFLs.

2020-02-19 CS F331 / CSCE A331 Spring 2020 14 Parsing Wrap-Up Efficiency of Parsing [5/6]

An interesting method is Valiant’s Algorithm [L.G. Valiant 1975]. This is an implementation of CYK using matrix multiplication. It can thus benefit from fast matrix-multiplication algorithms.

Using Strassen’s matrix-multiplication algorithm [V. Strassen 1969], Valiant’s Algorithm parses any CFL in about Θ(n2.807355).

Using Le Gall’s matrix-multiplication algorithm [F. Le Gall 2014], Valiant’s Algorithm parses any CFL in about Θ(n2.372864). But Le Gall’s only achieves speed-ups for extremely large matrices; this version of Valiant’s Algorithm is slow for realistic input.

Again: Practical parsers (and lexers) run in linear time.

2020-02-19 CS F331 / CSCE A331 Spring 2020 15 Parsing Wrap-Up Efficiency of Parsing [6/6]

Another method of interest that is actually used in practice is the Generalized LR (GLR) parser mentioned two slides back.

A GLR parser uses a variant of the Shift-Reduce idea, but it allows for grammars that cannot be parsed by a deterministic automaton. It does this, is, roughly, by simulating a nondeterministic automaton. It allows for multiple possible actions for a particular state and symbol, and it tries all of them.

As noted earlier, GLR parsing has a worst-case time of Θ(n3). But it is much faster for some grammars. As a result, GLR is considered to be a practical parsing method—but only for those grammars for which it is fast. GLR can easily handle some situations that are problematic for other parsing methods. For this reason GLR is increasingly used.

2020-02-19 CS F331 / CSCE A331 Spring 2020 16 Parsing Wrap-Up Parsing in Practice [1/3]

Many parsing methods have been developed. Some of them are significantly different from the ones we have looked at.

However, in my experience, the parser used in a production compiler is generally one of the following two kinds. § A hand-coded top-down parser using Recursive Descent—or something similar. § An automatically generated bottom-up parser using a table-based Shift-Reduce method—or something similar (usually LALR or GLR).

So, while we have by no means looked at all known parsing methods, what we have covered should give you the flavor of the parsers that are used in production .

2020-02-19 CS F331 / CSCE A331 Spring 2020 17 Parsing Wrap-Up Parsing in Practice [2/3]

Producing a parser is a very practical skill. This might seem unlikely; as a member of a software-development team, it is unlikely you will write a compiler.

But what is parsing? We have defined it: parsing is determining whether input is syntactically correct and, if so, finding its structure.

However, there is another way of looking at it:

Parsing is making sense of input.

And that is something that computer programs need to do a lot.

2020-02-19 CS F331 / CSCE A331 Spring 2020 18 Parsing Wrap-Up Parsing in Practice [3/3]

Lastly, writing a parser is generally not a terribly difficult task.

So knowing how to produce a parser can be a useful addition to your personal toolbox.

2020-02-19 CS F331 / CSCE A331 Spring 2020 19 Thoughts on Assignment 4

2020-02-19 CS F331 / CSCE A331 Spring 2020 20 Thoughts on Assignment 4 Introduction

In Assignment 4 you will be writing a Recursive-Descent parser, in the form of Lua module parseit. It will be similar to module rdparser4 (written in class), but it will involve a different grammar & AST specification.

In Assignment 6 you will write an interpreter that takes, as input, an AST of the form your parser returns. Your lexer, parser, and interpreter togther will form an implementation of a programming language called Degu.

Module parseit is to parse a complete programming language, while module rdparser4 only parses expressions. Nonetheless, Degu has expressions, and they work in much the same way as those handled by rdparser4. I suggest using file rdparser4.lua as a starting point for Assignment 4.

2020-02-19 CS F331 / CSCE A331 Spring 2020 21 Thoughts on Assignment 4 The Goal

Here again is a sample Degu program.

# fibo # Main program # Parameter is in variable n. # Print some Fibonacci numbers # Return Fibonacci number F(n). how_many_to_print = 20 func fibo() curr_fibo = 0 print("Fibonacci Numbers\n") next_fibo = 1 i = 0 # Loop counter j = 0 # Loop counter while i < n while j < how_many_to_print tmp = curr_fibo + next_fibo n = j # Set parameter for fibo curr_fibo = next_fibo print("F(",j,") = ",fibo(),"\n") next_fibo = tmp j = j+1 i = i+1 end end return curr_fibo end 2020-02-19 CS F331 / CSCE A331 Spring 2020 22 Thoughts on Assignment 4 The Grammar

1. program → stmt_list 2. stmt_list → { statement } 3. statement → ‘print’ ‘(’ [ print_arg { ‘,’ print_arg } ] ‘)’ 4. | ‘func’ ID ‘(’ ‘)’ stmt_list ‘end’ 5. | ‘if’ expr stmt_list { ‘elif’ expr stmt_list } [ ‘else’ stmt_list ] ‘end’ 6. | ‘while’ expr stmt_list ‘end’ 7. | ‘return’ expr 8. | ID ( ‘(’ ‘)’ | [ ‘[’ expr ‘]’ ] ‘=’ expr ) 9. print_arg → STRLIT 10. | ‘char’ ‘(’ expr ‘)’ 11. | expr 12. expr → comp_expr { ( ‘and’ | ‘or’ ) comp_expr } 13. comp_expr → arith_expr { ( ‘==’ | ‘!=’ | ‘<’ | ‘<=’ | ‘>’ | ‘>=’ ) arith_expr } 14. arith_expr → term { ( ‘+’ | ‘-’ ) term } 15. term → factor { ( ‘*’ | ‘/’ | ‘%’ ) factor } 16. factor → ‘(’ expr ‘)’ 17. | ( ‘+’ | ‘-’ | ‘not’ ) factor 18. | NUMLIT 19. | ( ‘true’ | ‘false’ ) 20. | ‘input’ ‘(’ ‘)’ 21. | ID [ ‘(’ ‘)’ | ‘[’ expr ‘]’ ]

2020-02-19 CS F331 / CSCE A331 Spring 2020 23 Thoughts on Assignment 4 Common Mistakes [1/3]

The most common mistake I made when writing module parseit was to return only one value from a parsing function.

A parsing function will always return two values: boolean & AST. § If the boolean is true, then the AST needs to be in the proper form. § If the boolean is false, then the AST can be anything (it might as well be nil).

2020-02-19 CS F331 / CSCE A331 Spring 2020 24 Thoughts on Assignment 4 Common Mistakes [2/3]

A mistake I have seen many students make on an assignment of this kind involves failing to “trust” their parsing functions in some way.

Remember:

If you have a function that does something, and you need to do that thing, then call the function.

So: § If you need something parsed, then call the appropriate parsing function, and trust that function to do it right. § Do not try to do a function’s work for it. § Do not rewrite something that has already been written.

2020-02-19 CS F331 / CSCE A331 Spring 2020 25 Thoughts on Assignment 4 Common Mistakes [3/3]

Three kinds of Degu statements are terminated with an end keyword: § Function definitions: func ID() … end § If-statements: if … elseif … else … end § While-loops: while … end Other kinds of statements have no end.

This means, for example, that you generally do not want to exit the parsing function early if you think the if-statement is over. So code like the following is not good. if not matchString("else") then return true, ast end

2020-02-19 CS F331 / CSCE A331 Spring 2020 26 Thoughts on Assignment 4 Some Code

I have posted a file containing a portion of my version of parseit.lua. See assn4_code.txt.

2020-02-19 CS F331 / CSCE A331 Spring 2020 27 PL Feature: Type System

2020-02-19 CS F331 / CSCE A331 Spring 2020 28 PL Feature: Type System Basic Concepts [1/3]

These slides are an incomplete summary of the reading “A Primer on Type Systems”.

A type system is a way of classifying entities in a program by the kinds of values they represent, in order to prevent undesirable program states. The classification assigned to an entity is its type. In C++, int is a type. abc is a variable of type int. int abc; 123 and 456 are literals of type int. abc = 123 + 456;

123 + 456 is an expression of type int. cout << 4.2; 4.2 is a literal of type double. cout is a variable of type std::ostream.

2020-02-19 CS F331 / CSCE A331 Spring 2020 29 PL Feature: Type System Basic Concepts [2/3]

The great majority of PLs include some kind of type system.

In the past, many PLs had a fixed set of types. Many modern PLs have an extensible type system: one that allows programmers to define new types. class Zebra { // New C++ type named "Zebra" …

Type checking means checking & enforcing the restrictions associated with a type system.

The various actions involved with a type system (determining types, type checking) are collectively known as typing.

2020-02-19 CS F331 / CSCE A331 Spring 2020 30 PL Feature: Type System Basic Concepts [3/3]

Types are used in three ways. They are used to determine: 1. Which values an entity may take on. int abc = vector(); // Type error: RHS is not int

2. Which operations are legal. cout << *abc; // Type error: cannot dereference int

3. Which of multiple possible operations to perform. cout << 123 + 456; // + does int addition string ss1, ss2; cout << ss1 + ss2; // + does string concatenation

2020-02-19 CS F331 / CSCE A331 Spring 2020 31 PL Feature: Type System Classifying Type Systems [1/2]

We classify type systems along three axes. 1. Overall type system: static or dynamic. 2. How are types specified: manifest or implicit. 3. How are types checked: nominal or structural.

We can also consider .

2020-02-19 CS F331 / CSCE A331 Spring 2020 32 PL Feature: Type System Classifying Type Systems [2/2]

The following table shows how the type systems of various PLs can be classified along our first two axes.

Type Specification

Mostly Manifest Mostly Implicit Overall Static C, C++, Java Haskell, OCaml Type Python, Lua, Ruby, Dynamic Not much goes here System JavaScript, Scheme

2020-02-19 CS F331 / CSCE A331 Spring 2020 33 PL Feature: Type System Type Safety [1/3]

A PL or PL construct is type-safe if it forbids operations that are incorrect for the types on which they operate. Some PLs/constructs discourage incorrect operations without forbidding them. We may compare their level of type safety.

The C/C++ printf function is not type-safe. The following assumes age has type int, but does not check. It may behave oddly if age has a different type. printf("I am %d years old.", age);

C++ stream I/O is type-safe. Below, age is output correctly, based on its type. This will not compile if that type cannot be output. cout << "I am " << age << " years old.";

2020-02-19 CS F331 / CSCE A331 Spring 2020 34 PL Feature: Type System Type Safety [2/3]

Two unfortunate terms are often used in discussions of type safety: strong typing (or strongly typed) and weak typing (or weakly typed). These generally have something to do with the overall level of type safety offered by a PL. But these terms have no standard definitions. They are used in different ways by different people. (I have seen at least three definitions of “strongly typed” in common use. C is strongly typed by one of them and weakly typed by the other two.)

Therefore:

Avoid using the terms “strong” and “weak” typing, or “strongly” and “weakly” typed.

2020-02-19 CS F331 / CSCE A331 Spring 2020 35 PL Feature: Type System Type Safety [3/3]

A static type system is sound if it guarantees that operations that are incorrect for a type will not be performed; otherwise it is unsound.

Haskell has a sound type system. C and C++ have unsound type systems. This is not a criticism!

In the world of dynamic typing, there does not seem to be any standard terminology corresponding to soundness. However, we can still talk about whether a dynamic type system strictly enforces type safety.

2020-02-19 CS F331 / CSCE A331 Spring 2020 36