<<

Shift-Reduce Parsing continued Parsing Wrap-Up PL Feature:

CS F331 Programming Languages CSCE A331 Concepts Lecture Slides Wednesday, February 20, 2019

Glenn G. Chappell Department of Computer Science University of Alaska Fairbanks [email protected]

© 2017–2019 Glenn G. Chappell Review Overview of Lexing & Parsing

Parsing

Lexer Parser Character Lexeme AST Stream Stream or Error

cout << ff(12.6); cout << ff(12.6); expr id op id lit punct op op binOp: << Two phases: expr expr § Lexical analysis (lexing) id: cout funcCall § Syntax analysis (parsing) id: ff expr The output of a parser is often an abstract numLit: 12.6 syntax tree (AST). Specifications of these can vary.

20 Feb 2019 CS F331 / CSCE A331 Spring 2019 2 Review The Basics of Syntax Analysis — Categories of Parsers

Parsing methods can be divided into two broad categories. Top-Down § Go through derivation from top to bottom, expanding nonterminals. § Important subcategory: LL parsers (read input Left-to-right, produce Leftmost derivation). § Often hand-coded—but not always. § Method we look at: Predictive Recursive Descent. Bottom-Up § Go through the derivation from bottom to top, reducing substrings to nonterminals. § Important subcategory: LR parsers (read input Left-to-right, produce Rightmost derivation). § Almost always automatically generated. § Method we look at: Shift-Reduce.

20 Feb 2019 CS F331 / CSCE A331 Spring 2019 3 Review The Basics of Syntax Analysis — Categories of Grammars

LL(1) grammars: those usable by LL parsers without lookahead. LR(1) grammars: those usable by LR parsers without lookahead.

All Grammars

CFGs

LR(1) Grammars Regular Grammars LL(1) Grammars

20 Feb 2019 CS F331 / CSCE A331 Spring 2019 4 Review Recursive-Descent Parsing [1/2]

Recursive Descent is a top-down parsing method. § When we avoid backtracking: predictive. Predictive Recursive- Descent is an LL parsing method. § There is one parsing function for each nonterminal. This parses all strings that the nonterminal can be expanded into.

The natural grammar for expressions with left-associative binary operators is not LL(1). But we can transform it appropriately.

Not Usable (left recursion) Usable e → t { ( “+” | “-” ) t } | e ( “+” | “-” ) t

20 Feb 2019 CS F331 / CSCE A331 Spring 2019 5 Review Recursive-Descent Parsing [2/2]

On a correct parse, a parser typically returns an abstract syntax tree (AST). We specify the format of an AST for each line in our grammar. It is helpful to include information in the AST telling what kind of entity each node represents.

Expression: a + 2

AST (diagram): binOp: +

simpleVar: a numLit: 2

AST (Lua): {{ BIN_OP, "+" }, { SIMPLE_VAR, "a" }, { NUMLIT_VAL, "2" }} See rdparser4.lua.

20 Feb 2019 CS F331 / CSCE A331 Spring 2019 6 Review Shift-Reduce Parsing [1/3]

We are looking at a class of table-based bottom-up parsing algorithms.

Tables are produced before execution. Shortly we will take a brief look at this.

Parser execution uses a state machine with a stack, called a Shift- Reduce parser. The name comes from two operations: § Shift. Advance to the next input symbol. § Reduce. Apply a production: replace substring with nonterminal.

In the form we presented it, Shift-Reduce parsing is an LR parsing method that can handle all LR(1) grammars. As we will see, in practice, the class of grammars is usually further restricted.

20 Feb 2019 CS F331 / CSCE A331 Spring 2019 7 Review Shift-Reduce Parsing [2/3]

The parser runs as a state machine with an associated stack. § A stack item holds a symbol—terminal or nonterminal—and a state. § The current state is the state in the top stack item.

Top st ack it em expr 3 Current state ID 7 Stack: == 5

On the left: symbol On the right: state The parsing table includes: action table (columns are terminals) and goto table (columns are nonterminals). Rows are states.

Operation § Begin by pushing an item holding the start state and any symbol. § At each step, do a lookup in the action table using the current state and the current input symbol. Do what the action table entry says.

20 Feb 2019 CS F331 / CSCE A331 Spring 2019 8 Review Shift-Reduce Parsing [3/3]

Action-Table Entries

S# (# is the number of a state) Shift—Push item: current symbol + state #. Advance input.

R# (# is the number of a production) Reduce—Pop RHS of production #. Push LHS + state from goto table (lookup: state before push + LHS nonterminal).

ACCEPT Terminate: syntactically correct.

ERROR (I represent this by a blank table cell) Terminate: syntax error.

20 Feb 2019 CS F331 / CSCE A331 Spring 2019 9 Shift-Reduce Parsing continued ASTs

How can a Shift-Reduce parser construct and return an AST?

Hold three things in each stack item: symbol, state, AST.

Process § When doing Shift, push the AST for the lexeme being shifted. § When doing Reduce, construct and push the AST for the new nonterminal, based on the ASTs in the popped stack items. § When doing ACCEPT, the top of the stack should have the start symbol (program?) and its AST. Return this AST to the caller.

If an AST is stored as a pointer-based tree, and a stack item holds a pointer to the root node of the AST, then this process can be very efficient. In particular AST nodes never need to be copied.

20 Feb 2019 CS F331 / CSCE A331 Spring 2019 10 Shift-Reduce Parsing Parsing-Table Generation [1/2]

Shift-Reduce parsers (and variations) dominate the field of automatically generated parsers.

A parsing table can be generated using a formalized, automatic process that is similar to the process we followed when writing a state-machine lexer.

However, for a grammar describing a real-world PL, the resulting parsing tables are typically far too large. Thus, when Shift- Reduce parsers were first introduced [. Knuth 1965], they were not considered practical.

This changed as ways were found of producing much smaller parsing tables.

20 Feb 2019 CS F331 / CSCE A331 Spring 2019 11 Shift-Reduce Parsing Parsing-Table Generation [2/2]

One practical way of generating smaller parsing tables is Lookahead LR (LALR) [F. DeRemer 1969]. Despite the name, this does not necessarily do lookahead as we have described it; rather, during parsing-table generation, a kind of hypothetical lookahead is done, with the results used to collapse multiple states into a single state.

However, this state-collapse idea does not work for all LR(k) grammars. The LR(k) grammars for which it gives correct results are called—you guessed it—LALR(k) grammars.

LALR parsers appear to be the most common of the automatically generated parsers. In particular, the Yacc (“Yet Another Compiler”) parser generator and its descendants (GNU Bison, etc.), use LALR.

20 Feb 2019 CS F331 / CSCE A331 Spring 2019 12 Shift-Reduce Parsing Lookahead [1/2]

Multiple-lexeme lookahead makes Predictive Recursive-Descent parsing a more powerful technique. Without lookahead, such a parser can handle LL(1) grammars and the associated LL(1) languages. With an additional lexeme of lookahead, we can handle LL(2) grammars and LL(2) languages. This allows us to use grammars and languages that we could not handle without lookahead.

But the situation with LR parsers is different.

20 Feb 2019 CS F331 / CSCE A331 Spring 2019 13 Shift-Reduce Parsing Lookahead [2/2]

We can modify a Shift-Reduce parser to do multiple-lexeme lookahead. The grammars that can be used by an LR parser (like Shift-Reduce) that makes each decision based on k upcoming lexemes are the LR(k) grammars. And the languages that can be generated by an LR(k) grammar are the LR(k) languages.

There are LR(2) grammars that are not LR(1) grammars. And there are LR(3) grammars that are not LR(2) gammars, etc. However, the class of LR(k) languages is exactly the same class for all values of k. For example, given an LR(2) grammar, we can always transform it into an LR(1) grammar that generates the same language. Adding lookahead thus allows an LR parser to handle additional grammars, but no additional languages.

20 Feb 2019 CS F331 / CSCE A331 Spring 2019 14 Shift-Reduce Parsing Automata [1/2]

A Deterministic Push-Down Automaton (DPDA) is a DFA with a stack added. Each transition has an associated push or pop.

DPDAs are not capable of recognizing all CFLs. Those they can recognize are the Deterministic Context-Free Languages.

Note, for those interested:

The machine that can recognize any CFL is the Nondeterministic Push-Down Automaton (NPDA). An automaton is nondeterministic if its operation is not fully specified; perhaps there are choices along the way. A nondeterministic automaton accepts its input if there is some sequence of choices that leads to acceptance in the ordinary deterministic sense.

20 Feb 2019 CS F331 / CSCE A331 Spring 2019 15 Shift-Reduce Parsing Automata [2/2]

A Shift-Reduce parser is almost a DPDA. The main difference is that a DPDA only pops one stack item at a time, while a Reduce operation may pop multiple items.

But this difference is not sufficient to affect the languages that can be recognized. In fact, the following three categories of languages are identical. § The languages that can be generated by an LR(1) grammar, that is, the LR(k) languages (for any k). § The languages that can be parsed by a Shift-Reduce parser. § The languages that can be recognized by a DPDA, that is, the Deterministic Context-Free Languages.

20 Feb 2019 CS F331 / CSCE A331 Spring 2019 16 Parsing Wrap-Up Efficiency of Parsing [1/5]

We have discussed parsing algorithms that can handle some, but not all, CFLs. We have not yet discussed their efficiency. This leads to several questions. § How fast are the parsing algorithms we have covered? § How fast are other practical parsing algorithms? § Are there parsing algorithms that can handle all CFLs? § If so, how fast are these algorithms?

20 Feb 2019 CS F331 / CSCE A331 Spring 2019 17 Parsing Wrap-Up Efficiency of Parsing [2/5]

How do we determine the time efficiency of a parser?

In order to express efficiency using asymptotic notation (big-O, Θ, etc.), we need to know three things. § How do we measure the size of the input? § What operations are allowed? § What operations do we count (basic operations)?

We will use the following conventions. § The size of the input, denoted by n, is the number of symbols. So for our parsers, n is the number of lexemes. If lexical analysis is included, then n is the number of characters. § Allowed operations are retrieving the next input symbol and any internal-processing operations we want. § Basic operations are operators on fundamental types and calls to client-provided functions (e.g., retrieve next lexeme).

20 Feb 2019 CS F331 / CSCE A331 Spring 2019 18 Parsing Wrap-Up Efficiency of Parsing [3/5]

Based on these conventions:

Practical parsers (and lexers) run in linear time.

This includes Predictive Recursive-Descent parsers and Shift- Reduce parsers.

20 Feb 2019 CS F331 / CSCE A331 Spring 2019 19 Parsing Wrap-Up Efficiency of Parsing [4/5]

In the late 1960s and 1970s, methods for parsing any CFL were found. These generally cannot handle an arbitrary CFG; the grammar may need to be transformed. § Ealey’s Algorithm [J. Ealey 1968] § The CYK Algorithm [J. Cocke & J.T. Schwartz 1970, D.H. Younger 1967, T. Kasami 1965] § The GLR Algorithm [B. Lang 1974]

All of these have a worst-case time of Θ(n3) for an arbitrary CFL. Some can have better performance for some CFLs.

20 Feb 2019 CS F331 / CSCE A331 Spring 2019 20 Parsing Wrap-Up Efficiency of Parsing [5/5]

An interesting method is Valiant’s Algorithm [L.G. Valiant 1975]. This is an implementation of CYK using matrix multiplication. It can thus benefit from of fast matrix-multiplication algorithms.

Using Strassen’s matrix-multiplication algorithm [V. Strassen 1969], Valiant’s Algorithm parses any CFL in about Θ(n2.8).

Using Le Gall’s matrix-multiplication algorithm [F. Le Gall 2014], Valiant’s Algorithm parses any CFL in about Θ(n2.372864). But Le Gall’s only achieves speed-ups for extremely large matrices; this version of Valiant’s Algorithm is slow for realistic input.

Again:

Practical parsers (and lexers) run in linear time.

20 Feb 2019 CS F331 / CSCE A331 Spring 2019 21 Parsing Wrap-Up Parsing in Practice [1/3]

Many parsing methods have been developed. Some of them are significantly different from the ones we have looked at.

However, in my experience, the parser used in a production compiler is generally one of the following two kinds. § A hand-coded top-down parser using Recursive Descent—or something similar. § An automatically generated bottom-up parser that runs as a table- based Shift-Reduce parser—or something similar.

So, while we have by no means looked at all known parsing methods, what we have covered should give you the flavor of the parsers that are used in production .

20 Feb 2019 CS F331 / CSCE A331 Spring 2019 22 Parsing Wrap-Up Parsing in Practice [2/3]

Producing a parser is a very practical skill.

This might seem unlikely; as a member of a software-development team, it is unlikely you will write a compiler or interpreter.

But what is parsing?

We have defined it: parsing is determining whether input is syntactically correct and, if so, finding its structure. However, there is another way of looking at it:

Parsing is making sense of input.

And that is something that computer programs need to do a lot.

20 Feb 2019 CS F331 / CSCE A331 Spring 2019 23 Parsing Wrap-Up Parsing in Practice [3/3]

Lastly, writing a parser is generally not a terribly difficult task.

So knowing how to produce a parser can be a useful addition to your personal toolbox.

20 Feb 2019 CS F331 / CSCE A331 Spring 2019 24 PL Feature: Type System Basic Concepts [1/3]

These slides are an incomplete summary of the reading “A Primer on Type Systems”.

A type system is a way of classifying entities in a program by the kinds of values they represent, in order to prevent undesirable program states. The classification assigned to an entity is its type. In C++, int is a type. abc is a variable of type int. int abc; 123 and 456 are literals of type int. abc = 123 + 456;

123 + 456 is an expression of type int. cout << 4.2; 4.2 is a literal of type double. cout is a variable of type std::ostream.

20 Feb 2019 CS F331 / CSCE A331 Spring 2019 25 PL Feature: Type System Basic Concepts [2/3]

The great majority of PLs include some kind of type system.

In the past, many PLs had a fixed set of types. Many modern PLs have an extensible type system: one that allows programmers to define new types. class Zebra { // New C++ type named "Zebra" …

Type checking means checking & enforcing the restrictions associated with a type system.

The various actions involved with a type system (determining types, type checking) are collectively known as typing.

20 Feb 2019 CS F331 / CSCE A331 Spring 2019 26 PL Feature: Type System Basic Concepts [3/3]

Types are used in three ways. They are used to determine: § Which values an entity may take on. int abc = vector(); // Type error: RHS is not int

§ Which operations are legal. cout << *abc; // Type error: cannot dereference int

§ Which of multiple possible operations to perform. cout << 123 + 456; // + does int addition string ss1, ss2; cout << ss1 + ss2; // + does string concatenation

20 Feb 2019 CS F331 / CSCE A331 Spring 2019 27 PL Feature: Type System Classifying Type Systems [1/2]

We classify type systems along three axes. § Overall type system: static or dynamic. § How are types specified: manifest or implicit. § How are types checked: nominal or structural.

We can also consider .

20 Feb 2019 CS F331 / CSCE A331 Spring 2019 28 PL Feature: Type System Classifying Type Systems [2/2]

The following table shows how the type systems of various PLs can be classified along our first two axes.

Type Specification

Mostly Manifest Mostly Implicit Overall Static C, C++, Java Haskell, OCaml Type Python, Lua, Ruby, Dynamic Not much goes here System JavaScript, Scheme

20 Feb 2019 CS F331 / CSCE A331 Spring 2019 29 PL Feature: Type System Type Safety [1/3]

A PL or PL construct is type-safe if it forbids operations that are incorrect for the types on which they operate. Some PLs/constructs discourage incorrect operations without forbidding them. We may compare their level of type safety.

The C/C++ printf function is not type-safe. The following assumes age has type int, but does not check. It may behave oddly if age has a different type. printf("I am %d years old.", age);

C++ stream I/O is type-safe. Below, age is output correctly, based on its type. This will not compile if that type cannot be output. cout << "I am " << age << " years old.";

20 Feb 2019 CS F331 / CSCE A331 Spring 2019 30 PL Feature: Type System Type Safety [2/3]

Two unfortunate terms are often used in discussions of type safety: strong typing (or strongly typed) and weak typing (or weakly typed). These generally have something to do with the overall level of type safety offered by a PL. But these terms have no standard definitions. They are used in different ways by different people. (I have seen at least three definitions of “strongly typed” in common use. C is strongly typed by one of them and weakly typed by the other two.)

Therefore: Avoid using the terms “strong” and “weak” typing, or “strongly” and “weakly” typed.

20 Feb 2019 CS F331 / CSCE A331 Spring 2019 31 PL Feature: Type System Type Safety [3/3]

A static type system is sound if it guarantees that operations that are incorrect for a type will not be performed; otherwise it is unsound.

Haskell has a sound type system. C & C++ have unsound type systems. This is not a criticism!

In the world of dynamic typing, there does not seem to be any standard terminology corresponding to soundness. However, we can still talk about whether a dynamic type system strictly enforces type safety.

20 Feb 2019 CS F331 / CSCE A331 Spring 2019 32