Grammars and for SSC1

Hayo Thielecke

Introduction Grammars and Parsing for SSC1 Grammars Derivations Parse trees abstractly From Hayo Thielecke grammars to code Translation to Java methods Parse trees in Java November 2008 Parser generators , ANTLR, JavaCC and SableCC Summary Outline of the parsing part of the module

Grammars and Parsing for 1 Introduction SSC1 Hayo Thielecke 2 Grammars Derivations Introduction Grammars Parse trees abstractly Derivations Parse trees abstractly 3 From grammars to code From grammars to Translation to Java methods code Translation to Parse trees in Java Java methods Parse trees in Java 4 Parser generators Parser generators Yacc, ANTLR, JavaCC and SableCC Yacc, ANTLR, JavaCC and SableCC 5 Summary Summary Motivation: a challenge

Grammars and Parsing for SSC1

Hayo Write a program that reads strings like this and evaluates Thielecke them: Introduction (2+3*(2-3-4))*2 Grammars Derivations In particular, brackets and precedence must be handled Parse trees abstractly correctly (* binds more tightly than +). From grammars to If you attempt brute-force hacking, you may end up with code Translation to something that is inefficient, incorrect, or both. —Try it! Java methods Parse trees in With parsing, that sort of problem is straighforward. Java Parser Moreover, the techniques scale up to more realistic generators Yacc, ANTLR, JavaCC and problems. SableCC Summary Books

Grammars and Parsing for SSC1

Hayo I hope these slides are detailed enough, but if you want to dig Thielecke deeper: Introduction There are lots of book on compilers. Grammars The ones which I know best are: Derivations Parse trees Appel, Modern Compiler Design in Java. abstractly From grammars to Aho, Sethi, and Ullman, nicknamed “The Dragon Book” code Translation to Parsing is covered in both, but the Composite Pattern for trees Java methods Parse trees in and Visitors for tree walking only in Appel. Java Parser See also the websites for ANTLR (http://antlr.org) and generators Yacc, ANTLR, SableCC (http://sablecc.org). JavaCC and SableCC Summary Why do you need to learn about grammars?

Grammars and Parsing for Grammars are widespread in programming. SSC1 Hayo XML is based on grammars and needs to be parsed. Thielecke Knowing grammars makes it much easier to learn the Introduction syntax of a new programming language. Grammars Derivations Parse trees Powerful tools exists for parsing (e.g., yacc, bison, abstractly ANTLR, SableCC, . . . ). But you have to understand From grammars to grammars to use them. code Translation to Java methods Grammars give us examples of some more advanced Parse trees in Java object-oriented programming: Composite Pattern and Parser polymorphic methods. generators Yacc, ANTLR, JavaCC and You may need parsing in your final-year project for reading SableCC complex input. Summary Describing syntax

Grammars and Parsing for SSC1

Hayo One often has to describe syntax precisely, particularly Thielecke

when dealing with computers. Introduction “A block consists of a sequence of statements enclosed in Grammars Derivations curly brackets ” Parse trees abstractly Informal English descriptions are too clumsy and not From grammars to precise enough. Rather, we need precise rules, something code Translation to like Block → .... Java methods Parse trees in Such precise grammar rules exist for all programming Java Parser language. See for example http://java.sun.com/docs/ generators Yacc, ANTLR, books/jls/second_edition/html/syntax.doc.html. JavaCC and SableCC Summary Example from programming language syntax

Some rules for statements S in Java or C: Grammars and Parsing for SSC1 S → if ( E ) S else S Hayo S → while ( E ) S Thielecke S → V = E; Introduction Grammars S → { B } Derivations Parse trees B → SB abstractly From B → grammars to code Translation to Here V is for variables and E for expressions. Java methods Parse trees in Java E → E - 1 Parser generators E → ( E ) Yacc, ANTLR, JavaCC and E → 1 SableCC Summary E → E == 0 V → foo Nesting in syntax

Grammars and Specifically, we need to be very careful about bracketing and Parsing for nesting. Compare: SSC1 Hayo Thielecke while(i < n) a[i] = 0; Introduction Grammars i = i + 1; Derivations Parse trees abstractly and From grammars to code while(i < n) Translation to Java methods { Parse trees in Java a[i] = 0; Parser i = i + 1; generators Yacc, ANTLR, JavaCC and } SableCC Summary Theses snippets looks very similar. But their difference is clear in the parse tree. What do we need in a grammar

Grammars and Parsing for SSC1

Hayo Thielecke

Some symbols can occur in the actual syntax, like while Introduction Grammars or cat. Derivations Parse trees We also need other symbols that act like variables (for abstractly From nouns in English or statements in Java, say). grammars to code Rules then say how to replace these symbols, e.g. a noun Translation to Java methods Parse trees in can be cat or mat. Java Parser generators Yacc, ANTLR, JavaCC and SableCC Summary Grammars: formal definition

Grammars and Parsing for SSC1

Hayo A context-free grammar consists of Thielecke some terminal symbols a, b,..., +, ),. . . Introduction Grammars some non-terminal symbols A, B, S,. . . Derivations Parse trees a distinguished non-terminal start symbol S abstractly From some rules of the form grammars to code Translation to Java methods A → X ... X Parse trees in 1 n Java Parser where n ≥ 0, A is a non-terminal, and the Xi are symbols. generators Yacc, ANTLR, JavaCC and SableCC Summary Notation: Greek letters

Grammars and Mathematicians and computer scientists are inordinately fond of Parsing for Greek letters. SSC1 Hayo Thielecke

Introduction

Grammars Derivations Parse trees α alpha abstractly From grammars to code Translation to β beta Java methods Parse trees in Java Parser γ gamma generators Yacc, ANTLR, JavaCC and SableCC ε epsilon Summary Notational conventions for grammars

Grammars and Parsing for SSC1

Hayo Thielecke

We will use Greek letters α, β, . . . , to stand for strings of Introduction

symbols that may contain both terminals and non-terminals. Grammars Derivations In particular, ε is used for the empty string (of length 0). Parse trees We will write A, B, . . . for non-terminals. abstractly From Terminal symbols are usually written in typewriter font, like grammars to code for, while, [. Translation to Java methods These conventions are handy once you get used to them and Parse trees in Java are found in most books. Parser generators Yacc, ANTLR, JavaCC and SableCC Summary Some abbreviations in grammars (BNF)

Grammars and Parsing for SSC1 A rule with an alternative, written as a vertical bar |, Hayo Thielecke

A → α | β Introduction Grammars Derivations is the same as having two rules for the same non-terminal: Parse trees abstractly From A → α grammars to code Translation to A → β Java methods Parse trees in Java There is also some shorthand notation for repetitions: Parser ∗ generators α stands for zero or more occurences of α, and Yacc, ANTLR, + JavaCC and α stands for one or more occurrences of α. SableCC Summary Derivations

Grammars and Parsing for SSC1

Hayo If A → α is a rule, we can replace A by α for any strings β Thielecke and γ on the left and right: Introduction

Grammars β A γ ⇒ βαγ Derivations Parse trees abstractly This is one derivation step. From grammars to A string w consisting only of terminal symbols is code Translation to Java methods generated by the grammar if there is a sequence of Parse trees in derivation steps leading to it from the start symbol S: Java Parser generators Yacc, ANTLR, S ⇒ · · · ⇒ w JavaCC and SableCC Summary An example derivation

Recall the rules Grammars and Parsing for SSC1 S → { B } Hayo B → SB Thielecke B → Introduction Grammars Replacing always the leftmost non-terminal symbol, we have: Derivations Parse trees abstractly S ⇒ { B } From grammars to ⇒ { SB } code Translation to Java methods ⇒ {{ B } B } Parse trees in Java ⇒ {{} B } Parser generators ⇒ {{} SB } Yacc, ANTLR, JavaCC and ⇒ {{}{ B } B} SableCC Summary ⇒ {{}{} B} ⇒ {{}{}} The language of a grammar

Grammars and Parsing for SSC1

Hayo In this context, a language is a set of strings (of terminal Thielecke

symbols). Introduction For a given grammar, the language of that grammar Grammars Derivations consists of all strings of terminals that can be derived from Parse trees abstractly the start symbol. From grammars to For useful grammars, there are usually infinitely many code Translation to strings in its language (e.g., all Java programs). Java methods Parse trees in Two different grammars can define the same language. Java Parser Sometimes we may redesign the grammar as long as the generators Yacc, ANTLR, language remains the same. JavaCC and SableCC Summary Grammars and brackets

Grammars and Parsing for Grammars are good at expressing various forms of bracketing SSC1 Hayo structure. Thielecke In XML: an element begins with and ends with . Introduction The most typical context-free language is the Dyck language of Grammars balanced brackets: Derivations Parse trees abstractly D → From grammars to code D → DD Translation to Java methods Parse trees in D → ( D ) Java D → [ D ] Parser generators Yacc, ANTLR, JavaCC and (This makes grammars more powerful than Regular SableCC Expressions.) Summary Matching brackets

Grammars and Note that the brackets have to match in the rule Parsing for SSC1

Hayo D → [ D ] Thielecke

Introduction This is different from “any number of [ and then any number Grammars of ]”. Derivations Parse trees For comparison, we could have a different grammar in which abstractly From the brackets do not have to match: grammars to code Translation to D → ODC Java methods Parse trees in Java O → [ O Parser generators O → ε Yacc, ANTLR, JavaCC and C → C ] SableCC Summary C → ε Another pitfall in reading rules

Grammars and Parsing for SSC1 Recall this rule: Hayo Thielecke D → DD Introduction

Grammars Note that the repetition DD does not mean that the same Derivations Parse trees string has to be repeated. It means any string derived from a abstractly From D followed by any string derived from a D. They may be the grammars to same, but need not. code Translation to Java methods Parse trees in D ⇒ DD Java Parser ... ⇒ [] D generators Yacc, ANTLR, JavaCC and ... ⇒ [][[]] SableCC Summary Recursion in grammars

Grammars and Parsing for SSC1

Hayo A symbol may occur on the right hand-side of one of its rules: Thielecke

Introduction E → (E) Grammars Derivations Parse trees We often have mutual recursion in grammars: abstractly From grammars to S → { B } code Translation to B → SB Java methods Parse trees in Java Parser Mutual recursion also exists in Java: for example, method f generators Yacc, ANTLR, calls g and g calls f. JavaCC and SableCC Summary Recursion in grammars and in Java

Grammars and Compare recursion in grammars, methods and classes: Parsing for SSC1

Hayo T → ... T ... T ... Thielecke

int sumTree(Tree t) Introduction Grammars { ... Derivations Parse trees ... return sumTree(t.left) + sumTree(t.right); abstractly } From grammars to code Translation to and classes Java methods Parse trees in Java public class Tree Parser { ... generators Yacc, ANTLR, JavaCC and public Tree left; SableCC public Tree right; Summary } Parse trees abstractly

The internal nodes are labelled with nonterminals. Grammars and Parsing for If there is a rule A → X1 ... Xn, then an internal node can have SSC1 Hayo the label A and children X1,..., Xn. Thielecke The root node of the whole tree is labelled with the start symbol. Introduction Grammars The leaf nodes are labelled with terminal symbols or . Derivations Parse trees abstractly From grammars to Root: Start symbol code Translation to Java methods Parse trees in Java Parser Non-terminal A Non-terminal B generators Yacc, ANTLR, JavaCC and SableCC Summary Terminal a ...... Terminal z Example: parse trees

We define a grammar such that the parse trees are binary trees: Grammars and Parsing for SSC1 B → 1 | 2 | ... Hayo B → BB Thielecke Here are two parse trees (for the strings “1” and “1 1 2”): Introduction Grammars Derivations B B Parse trees abstractly From grammars to code Translation to 1 B B Java methods Parse trees in Java Parser generators 1 B B Yacc, ANTLR, JavaCC and SableCC Summary

1 2 Parse trees and derivations

Grammars and Parsing for SSC1

Hayo Thielecke Parse trees and derivations are equivalent; they contain the Introduction

same information. Grammars For each derivation of a word, there is a parse tree for the word. Derivations Parse trees (Idea: each step using A → α tells us that the children of some abstractly From A-labelled node are labelled with the symbols in α.) grammars to code For each parse tree, there is a (unique leftmost) derivation. Translation to Java methods (Idea: walk over the tree in depth-first order; each internal Parse trees in Java node gives us a rule.) Parser generators Yacc, ANTLR, JavaCC and SableCC Summary Automata and grammars—what we are not going to do

Grammars and Parsing for SSC1

Hayo Thielecke Pushdown automata (stack machines) are covered in Models of Computation. Introduction Grammars The use of a stack for parsing will be covered in more detail in Derivations Parse trees Compilers and Languages. abstractly Independently of formal automata models, we can use a From grammars to programming perspective: grammars give us classes or code Translation to methods. Java methods Parse trees in For experts: we let Java manage the stack for us (as its call Java Parser stack). generators Yacc, ANTLR, JavaCC and SableCC Summary Overview: from grammars to Java code

Grammars and Parsing for SSC1

Hayo Thielecke Methods for processing the language follow the structure of the Introduction

grammar: Grammars Each non-terminal gives a method (with mutual recursion Derivations Parse trees between such methods). abstractly From Each grammar gives us a Java class hierarchy (with mutual grammars to code recursion between such classes). Translation to Java methods Each word in the language of the grammar gives us an object Parse trees in Java of the class corresponding to the start symbol (a parse tree). Parser generators Yacc, ANTLR, JavaCC and SableCC Summary Recursive methods

Grammars and Parsing for SSC1

Hayo From grammars to mutually recursive methods: Thielecke For each non-terminal A there is a method A. The method Introduction Grammars body is a switch statement that chooses a rule for A. Derivations Parse trees For each rule A → X1 ... Xn, there is a branch in the abstractly From switch statement. There are method calls for all the grammars to non-terminals among X ,..., X . code 1 n Translation to Java methods Each grammar gives us some recursive methods. Parse trees in Java For each derivation in the language, we have a sequence of Parser generators method calls. Yacc, ANTLR, JavaCC and SableCC Summary Reminder: the switch statement in Java

Grammars and Assume test() returns an integer. We can use switch to Parsing for SSC1 branch on which integer is returned: Hayo Thielecke switch(test()) { Introduction Grammars case 1: Derivations Parse trees // test() is 1 abstractly From break; // we are done with this case grammars to case 2: case 3: code Translation to Java methods // test() is 2 or 3 Parse trees in break; Java Parser default: generators Yacc, ANTLR, JavaCC and // test() had some other value SableCC break; Summary } Switch and if

Grammars and Parsing for SSC1

We could write the same more clumsily with if and else: Hayo Thielecke temp = test(); Introduction

if (temp == 1) { Grammars // test() returned 1 Derivations Parse trees } abstractly From else if (temp == 2 && temp == 3) { grammars to code // test() returned 2 or 3 Translation to Java methods } Parse trees in Java else { Parser // test() returned some other value generators Yacc, ANTLR, JavaCC and } SableCC Summary Grammar of switch

Grammars and Parsing for SSC1 We extend the grammar for statements S with switch Hayo Thielecke statements: Introduction S → switch (E){ B } Grammars Derivations Parse trees B → CB | ε abstractly From C → LT grammars to code L → K | default: Translation to Java methods Parse trees in K → case V : | case V : K Java Parser T → break; | return E; | ST generators Yacc, ANTLR, JavaCC and Here E can be any expression and V any constant value. SableCC Summary Example: a Java method for a grammar rule

Grammars and Parsing for The grammar rule for while-loops S → while ( E ) S gives SSC1 us this Java code: Hayo Thielecke

public static void S() { // method for S Introduction

switch(...) { // which rule for S? Grammars Derivations case ...: // if this one: Parse trees ... "while (" ... // process some text abstractly From E(); // call method for E grammars to code ... ")" ... // process some text Translation to Java methods Parse trees in S(); // call method for S Java break; // finished with this ruleParser generators case ...: // other rules Yacc, ANTLR, JavaCC and } SableCC } Summary Recognizing or parsing

Grammars and Parsing for SSC1

Hayo Abstactly, a parser reads a string and constructs a parse Thielecke

tree for that string if there is one; otherwise, the parser Introduction

reports an error. That is equivalent to recognizing. Grammars Derivations The parse tree need not be constructed as an actual data Parse trees abstractly structure in memory, since that can consume a lot of From grammars to storage. It is enough if the parser provides enough code Translation to information to construct the parse tree in principle, using Java methods Parse trees in events or parsing actions. Java Parser Example: the Xerces XML parser constructs a parse tree, generators Yacc, ANTLR, SAX does not. JavaCC and SableCC Summary Recognizing a grammar

Grammars and Parsing for SSC1

Hayo Thielecke

There are many different parsing technologies (LL(k), Introduction LR(1), LALR(1), . . . ). Grammars Derivations Parse trees Here we consider only predictive parsers, sometime also abstractly called recursive descent parsers. They correspond to From grammars to translating grammar rules into code as described above. code Translation to Java methods The hard part is choosing the rules according to the Parse trees in lookahead. Java Parser generators Yacc, ANTLR, JavaCC and SableCC Summary Lookahead example

Grammars and Parsing for SSC1

Hayo Thielecke S → E S S if ( ) else Introduction S → while ( E ) S Grammars Derivations Parse trees S → V = E; abstractly S → { B } From grammars to code B → SB Translation to Java methods Parse trees in B → Java Parser If we see if in the input, we choose the first rule; if we see generators Yacc, ANTLR, JavaCC and while, the second. SableCC Less obvious: how to chose between the last two rules. Summary An easy grammar to parse

Grammars and Parsing for Many constructs start with a keyword telling us immediately SSC1 Hayo what it is. Thielecke Some languages are easy to parse with lookahead, e.g. Lisp Introduction

and Scheme. Idea: everything is a prefix operator. Grammars Derivations Parse trees E → (+ EE) abstractly From E → (* EE) grammars to code Translation to E → 1 Java methods Parse trees in Java Obvious with 2 symbols of lookahead. Can do with 1 by Parser generators tweaking the grammar. Yacc, ANTLR, JavaCC and C and Java are not so easy to parse (and C++ is worse). Any SableCC type name can begin a function definition, for instance. Summary Predictive parser

Grammars and Parsing for SSC1

Hayo A predictive parser can be constructed from grammar Thielecke rules. Introduction Grammars The parser is allowed to “look ahead” in the input; Derivations Parse trees based on what it sees there, it then makes predictions. abstractly From Canonical example: matching brackets. grammars to code If the parser sees a [ as the next input symbol, it Translation to Java methods “predicts” that the input contains something in brackets. Parse trees in Java More technically: switch on the lookahead; [ labels one Parser generators case. Yacc, ANTLR, JavaCC and SableCC Summary The lookahead and match methods

Grammars and Parsing for SSC1

Hayo Thielecke A predictive parser relies on two methods for accessing the Introduction input string: Grammars Derivations char lookhead() returns the next symbol in the input, Parse trees without removing it. abstractly From grammars to void match(char c) compares the next symbol in the code Translation to output to c. If they are the same, the symbol is removed Java methods Parse trees in from the input. Otherwise, the parsing is stopped with an Java error; in Java, this can be done by throwing an exception. Parser generators Yacc, ANTLR, JavaCC and SableCC Summary Simplied view of parsing

Grammars and Parsing for SSC1

Hayo Thielecke

In the real world, lookahead and match are calls to the Introduction Grammars lexical analyzer, and they return tokens, not characters. Derivations Parse trees There are efficiency issues of buffering the input file, etc. abstractly From We ignore all that to keep the parser as simple as possible. grammars to code (Only single-letter keywords.) Translation to Java methods Parse trees in But this setting is sufficient to demonstrate the principles. Java Parser generators Yacc, ANTLR, JavaCC and SableCC Summary Parsing with lookahead

Grammars and Parsing for SSC1

Hayo Parsing with lookahead is easy if every rule for a given Thielecke non-terminal starts with a different terminal symbol: Introduction Grammars Derivations S → [ S ] Parse trees abstractly S → + From grammars to code Idea: suppose you are trying to parse an S. Look at the first Translation to Java methods Parse trees in symbol in the input: Java if it is a [, use the first rule; Parser generators if it is a +, use the second rule. Yacc, ANTLR, JavaCC and SableCC Summary Translation to code (only recognizing)

Grammars and Parsing for void parseS() throws ParseError SSC1 Hayo { Thielecke switch(lookahead()) { // what is in the input? Introduction case ’[’: // If I have seen a [ Grammars match(’[’); // remove the [ Derivations Parse trees parseS(); // now parse what is inside [...]abstractly From match(’]’); // make sure there is a ] grammars to break; // done in this case code Translation to Java methods case ’+’: // If I have seen a + Parse trees in match(’+’); // get rid of it Java Parser break; // and we are done generators Yacc, ANTLR, JavaCC and default: error(); // throws ParseError SableCC } Summary } How do we get the symbols for the case labels?

Grammars and Parsing for SSC1

Hayo Thielecke Parsing with lookahead is easy if every rule for a given Introduction non-terminal starts with a different terminal symbol. Grammars Derivations In that case, the lookahead immediately tells us which rule Parse trees to choose. abstractly From grammars to But what if not? The right-hand-side could instead start code Translation to with a non-terminal, or be the empty string. Java methods Parse trees in More general methods for using the lookahead: FIRST Java Parser and FOLLOW construction. generators Yacc, ANTLR, JavaCC and SableCC Summary FIRST and FOLLOW sets

Grammars and Parsing for We define FIRST, FOLLOW and nullable: SSC1 Hayo A terminal symbol b is in first(α) if there is a derivation Thielecke

Introduction ∗ α ⇒ b β Grammars Derivations Parse trees (b is the first symbol in something derivable from α). abstractly From A terminal symbol b is in follow(X ) if if there is a grammars to code derivation Translation to ∗ Java methods S ⇒ α X b γ Parse trees in Java Parser (b can appear immediately behind X in some derivation) generators ∗ Yacc, ANTLR, α is nullable if α ⇒ ε (we can derive the empty string JavaCC and SableCC from it) Summary FIRST and FOLLOW give the case labels

Grammars and Parsing for SSC1

Hayo Thielecke FIRST and FOLLOW gives us the case labels for the branches of the switch statement. Introduction Grammars The branch for A → α gets the labels in FIRST(α). Derivations Parse trees The branch for A → ε gets the labels in (A). abstractly FOLLOW From grammars to FIRST and FOLLOW are tedious to compute by hand. We code Translation to won’t go into the details here. Java methods Parse trees in Parser generators compute this sort of information Java Parser automatically. generators Yacc, ANTLR, JavaCC and SableCC Summary From grammars to Java classes

Grammars and Parsing for We translate a grammar to some mutually recursive Java SSC1 classes: Hayo Thielecke For each non-terminal A there is an abstract class A Introduction

For each rule A → X1 ... Xn, there is a concrete subclass Grammars Derivations of A. It has fields for all non-terminals among X1,..., Xn. Parse trees abstractly The toString method of A concatenates the toString From grammars to of all the Xi : code Translation to if Xi is a terminal symbol, it is already a string; Java methods Parse trees in if Xi is a non-terminal, its toString method is called. Java Parser Thus each grammar gives us a class hierarchy. generators Yacc, ANTLR, Instead of an abstract class, we could also use an interface for JavaCC and each non-terminal A, which the classes for the rules then SableCC Summary implement. Parse trees as Java objects

Grammars and Parsing for SSC1

Hayo Thielecke

Introduction Suppose we translate the grammar to Java classes. Then for Grammars each string in the language, we have an object. Its fields may Derivations Parse trees refer to other objects. abstractly From Together, these objects represent the parse tree for that word. grammars to code Its toString method constructs a (leftmost) derivation of the Translation to Java methods sentence. Parse trees in Java Parser generators Yacc, ANTLR, JavaCC and SableCC Summary Example: binary trees

Grammars and Parsing for SSC1 abstract class BinTree { Hayo Thielecke public abstract String toString(); } Introduction Grammars Derivations Parse trees class Leaf extends BinTree abstractly { From grammars to private int label; code Translation to Java methods Parse trees in Leaf(int n) { label = n; } Java Parser generators Yacc, ANTLR, public String toString() { return "" + label; } JavaCC and SableCC } Summary Example continued

Grammars and Parsing for class Node extends BinTree SSC1

{ Hayo private BinTree left, right; Thielecke Introduction Node(BinTree l, BinTree r) Grammars Derivations { Parse trees abstractly left = l; right = r; From grammars to } code Translation to Java methods Parse trees in public String toString() { Java return "(" + left.toString() + "," Parser generators + right.toString() + ")"; } Yacc, ANTLR, JavaCC and } SableCC Summary Exercise: create an object whose toString() prints ((1,2),(1,2)). Example: a Java class for a grammar rule

The grammar rule for while-loops Grammars and Parsing for SSC1

S → while ( E ) S Hayo Thielecke gives us this Java class: Introduction public class While extends S Grammars Derivations Parse trees { abstractly private E testExp; From grammars to code Translation to private S loopBody; Java methods Parse trees in Java Parser public String toString () { generators Yacc, ANTLR, return "while (" + testExp.toString() + ")" JavaCC and + loopBody; SableCC Summary } } COMPOSITE pattern and (parse) trees

Grammars and Parsing for SSC1

Hayo The representation of parse trees is an instance of the Thielecke COMPOSITE pattern in object-oriented software. Gamma et. al. define it as: Introduction Grammars “Compose objects into tree structures to represent part-whole Derivations Parse trees hierarchies.” abstractly From (From Design Patterns: Elements of Reusable Object-Oriented grammars to Software by Erich Gamma, Richard Helm, Ralph Johnson, John code Translation to Java methods Vlissides.) Parse trees in This is essentially the representation parse trees described Java Parser above. It describes a “hierarchy” as a tree, where the nodes are generators Yacc, ANTLR, JavaCC and the whole and its children the parts. SableCC Summary Composite pattern in UML notation

Grammars and Parsing for SSC1

Hayo Consider the binary tree grammar above. In UML notation, we Thielecke

have the following class diagram: Introduction Grammars Derivations abstract class B Parse trees abstractly From grammars to code Translation to Java methods Parse trees in class Leaf class Node Java Parser generators A B “is a” Leaf or a Node, and a Node “has a” B. Yacc, ANTLR, JavaCC and SableCC Summary Building the parse tree

Grammars and Parsing for SSC1

Hayo Thielecke For the parser that only recognizes, we had a void return type. We extend our translation of grammar rules to code: Introduction Grammars The method for non-terminal A has as its return type the Derivations Parse trees abstract class that we created for A. abstractly From Whenver we call a method for a non-terminal, we grammars to code remember its return value in a local variable. Translation to Java methods Parse trees in At the end of the translation of a rule, we call the Java constructor. Parser generators Yacc, ANTLR, JavaCC and SableCC Summary Classes for the parse tree

Grammars and Parsing for Grammar: S → [ S ] | + SSC1 Hayo Thielecke abstract class S { ... } Introduction

Grammars class Bracket extends S Derivations Parse trees { abstractly private S inBrackets; From grammars to code Translation to Brackets(S s) { inBrackets = s; } Java methods Parse trees in ... Java Parser } generators Yacc, ANTLR, JavaCC and SableCC class Plus extends S { ... } Summary Parsing while building the parse tree

Grammars and Parsing for S parseS() // return type is abstract class S SSC1

{ Hayo switch(lookahead()) { Thielecke case ’[’: Introduction { Grammars Derivations S treeS; Parse trees abstractly match(’[’); From grammars to treeS = parseS(); // remember tree code Translation to match(’]’); Java methods Parse trees in return new Bracket(treeS); Java // call the constructor for this rule Parser generators } Yacc, ANTLR, JavaCC and ... SableCC Summary } } Methods in the parse tree

Grammars and Parsing for To perform useful work after the parse tree has been SSC1 Hayo constructed, we call its methods. Thielecke Canonical example: expression evaluation. Suppose we have a Introduction

grammar for arithmetical expressions: Grammars Derivations Parse trees E → E + E abstractly From E → E ∗ E grammars to code Translation to E → 1 | 2 | 3 | ... Java methods Parse trees in Java The each expression “knows” how to evaluate itself (depending Parser generators on whether it is an addition, a multiplication, . . . ). Yacc, ANTLR, JavaCC and The methods in the parse tree can be given additional SableCC parameters to move information around the tree. Summary Expression evaluation as a method of the parse tree

Grammars and Parsing for SSC1

Hayo abstract class Expression { abstract int eval() ... } Thielecke

Introduction

class Plus extends Expression Grammars Derivations { Parse trees .... abstractly From public int eval() grammars to code { Translation to Java methods return left.eval() + right.eval(); Parse trees in Java } Parser generators .... Yacc, ANTLR, JavaCC and } SableCC Summary Adding parameters to the methods

Grammars and Suppose we also have variables: E → x | .... Parsing for SSC1

Hayo abstract class Expression Thielecke { abstract int eval(Environment env); Introduction Grammars } Derivations Parse trees abstractly From class Variable extends Expression grammars to { code Translation to Java methods private String name; Parse trees in Java Parser public int eval(Environment env) { generators Yacc, ANTLR, JavaCC and return env.get(name); SableCC } Summary } OO and parsing

Grammars and Parsing for SSC1

Hayo Parse trees are an instance of the composite pattern. Thielecke The classes made for rules A → α extend the class for A Introduction Grammars The return type of a parsing method for A is the abstract Derivations Parse trees class A, but what is actually return is an instance of one of abstractly From its subclasses. grammars to code The methods in the abstract class are overidden by the Translation to Java methods subclasses. During a treewalk, we rely on dynamic Parse trees in Java dispatch. Parser generators Polymorphism is crucial. Yacc, ANTLR, JavaCC and SableCC Summary Parser generators

Grammars and Parsing for SSC1 Except when the grammar is very simple, one typically does not Hayo Thielecke program a parser by hand from scratch. Instead, one uses a parser generator. Introduction Compare: Grammars Derivations Parse trees abstractly Compiler From Java −→ JVM code grammars to code Translation to Java methods Parser generator Parse trees in Grammar −→ Parser Java Parser generators Examples of parser generators: yacc, bison, ANLTR, JavaCC, Yacc, ANTLR, JavaCC and SableCC, . . . SableCC Summary More on parser generators

Grammars and Parsing for Parser generators use some ASCII syntax rather than SSC1 symbols like →. Hayo Thielecke With yacc, one attaches parsing actions to each production that tell the parser what to do. Introduction Grammars Some parsers construct the parse tree automatically. All Derivations Parse trees one has to do is tree-walking. abstractly From Parser generators often come with a collection of useful grammars to code grammars for Java, XML, HTML and other languages Translation to Java methods Parse trees in If you need to parse a non-standard language, you need a Java Parser grammar suitable for input to the parser generator generators Yacc, ANTLR, Pitfalls: ambiguous grammars, left recursion JavaCC and SableCC A parser generator is often combined with a lexical Summary analyzer generator (like and yacc). Yacc parser generator

Grammars and Parsing for SSC1 “Yet Another Compiler-Compiler” Hayo Thielecke An early (1975) parser generator geared towards C. Introduction Technically, an LALR(1) parser. LALR(1) is too Grammars complicated to explain here. Derivations Parse trees Very influential. You should have heard about if for abstractly From historical reasons (like 1066). grammars to code Translation to Still widely referred to: “This is the yacc for hblahi” Java methods Parse trees in means “This large automates doing hblahi while hiding Java much of the complexity” Parser generators Yacc, ANTLR, Linux version is called bison, see JavaCC and . SableCC http://www.gnu.org/software/bison/ Summary ANTLR parser generator

Grammars and Parsing for Works with Java or C SSC1 Hayo Download and documentation at http://antlr.org Thielecke

uses LL(k): similar to our predictive parsers, but using Introduction more than one symbol of lookahead Grammars Derivations Parse trees Parse tree can be constructed automatically abstractly From you just have to annotate the grammar to tell ANTLR grammars to when to construct a node code Translation to Java methods The parse trees are a somewhat messy data structure, not Parse trees in Java very typed or OO Parser generators ANTLR has been used in several student projects here Yacc, ANTLR, JavaCC and See http://www.antlr.org/wiki/display/ANTLR3/ SableCC Summary Expression+evaluator for a simple example. JavaCC parser generator

Grammars and Parsing for SSC1

Hayo Thielecke

JavaCC is a parser generator aimed at Java. Introduction See https://javacc.dev.java.net/ for downloads and Grammars Derivations Parse trees documentation. abstractly uses LL(1) if possible From grammars to code Blurb: “Java Compiler Compiler [tm] (JavaCC [tm]) is the Translation to Java methods most popular parser generator for use with Java [tm] Parse trees in Java applications.” Parser generators Yacc, ANTLR, JavaCC and SableCC Summary SableCC parser generator

Grammars and Parsing for SSC1

Works with Java Hayo Thielecke Download and documentation at http://sablecc.org uses LALR(1) for parsing, just like Yacc Introduction Grammars SableCC has no problem with left recursion, as LR parsing Derivations Parse trees does not only depend on the look-ahead abstractly From you may get cryptic errors about shift/reduce and grammars to code reduce/reduce conflicts if the grammar is unsuitable Translation to Java methods Parse trees in SableCC constructs an object-oriented parse tree, similar Java to the way we have constructed Java classes Parser generators Yacc, ANTLR, uses the Visitor Pattern for processing the parse tree JavaCC and SableCC SableCC has been used in several students projects here Summary Visitor pattern for walking the parse tree

Grammars and Parsing for SSC1 Having to modify the methods in the tree classes is poor Hayo software engineering. Thielecke It would be much better if we had a general-purpose tree Introduction Grammars walker into which we can plug whatever functionality is Derivations Parse trees desired. abstractly From The canonical way to do that is the Visitor Design pattern. grammars to code The tree is separate from specialized code that “visits” it. Translation to Java methods Parse trees in The tree classes only have “accept” methods for visitor Java objects. Parser generators Yacc, ANTLR, See the Design Patterns book by Gamma, Helms, Johnson JavaCC and and Vlissides, or . SableCC www.sablecc.org Summary A problem: left recursion

Grammars and Left recursion is a problem for the simple parsers we have Parsing for written, and for some parser generators, like ANTLR and SSC1 Hayo JavaCC. Thielecke Example: Introduction

Grammars E → E − E Derivations Parse trees E → 1 abstractly From grammars to The symbol 1 is in the lookahead for both rules, so cannot code Translation to Java methods guide our choice between them. Parse trees in If you have this situation, and you want to use ANTLR, you Java Parser need to rewrite the grammar. generators Yacc, ANTLR, JavaCC and There is a standard technique for eliminating left recursion, SableCC desribed in most Compiling books. Summary Left recursion is not a problem for LR parsers like yacc or SableCC. Another problem: ambiguous grammars

Grammars and Parsing for A grammar is ambiguous if there is a string that has more than SSC1 Hayo one parse tree. Thielecke

Standard example: Introduction

Grammars E → E − E Derivations Parse trees E → 1 abstractly From grammars to code One such string is 1-1-1. It could mean (1-1)-1 or 1-(1-1) Translation to Java methods depending on how you parse it. Parse trees in Java Ambiguous grammars are a problem for parsing, as we do not Parser know what is intended. generators Yacc, ANTLR, JavaCC and Again, one can try to rewrite the grammar to eliminate the SableCC problem. Summary Parser generator overview

Grammars and Parsing for SSC1

Hayo Thielecke

Introduction Parser generator LR or LL Tree processing Grammars Derivations Yacc/Bison LALR(1) Parsing actions in C Parse trees SableCC LALR(1) Visitor Pattern in Java abstractly From ANTLR LL(k) Tree grammars + Java or C++ grammars to code JavaCC LL(k) JJTree + Visitors in Java Translation to Java methods Parse trees in Java Parser generators Yacc, ANTLR, JavaCC and SableCC Summary Summary

Grammars and Parsing for We have seen some basics of parsing (grammars, SSC1 derivations, parse trees). Hayo Thielecke We have translated grammars to Java code. Introduction You should be able to do this (exercise, exam). Grammars Derivations We have touched upon some more advanced material Parse trees (FIRST/FOLLOW, parser generators, Visitor pattern). abstractly From grammars to The code contains generally useful programming concepts: code Translation to mutual recursion of methods and data, composite pattern, Java methods Parse trees in abstract classes, polymorphism, exceptions, . . . Java Parser In practice, one would use a parser generator rather then generators Yacc, ANTLR, reinvent the wheel JavaCC and SableCC One still needs to have some idea of what a parser Summary generator does, as it is not just a button to click on