Grammars and Parsing for SSC1
Total Page:16
File Type:pdf, Size:1020Kb
Grammars and Parsing for SSC1 Hayo Thielecke Introduction Grammars and Parsing for SSC1 Grammars Derivations Parse trees abstractly From Hayo Thielecke grammars to code Translation to Java methods Parse trees in Java November 2008 Parser generators Yacc, ANTLR, JavaCC and SableCC Summary Outline of the parsing part of the module Grammars and Parsing for 1 Introduction SSC1 Hayo Thielecke 2 Grammars Derivations Introduction Grammars Parse trees abstractly Derivations Parse trees abstractly 3 From grammars to code From grammars to Translation to Java methods code Translation to Parse trees in Java Java methods Parse trees in Java 4 Parser generators Parser generators Yacc, ANTLR, JavaCC and SableCC Yacc, ANTLR, JavaCC and SableCC 5 Summary Summary Motivation: a challenge Grammars and Parsing for SSC1 Hayo Write a program that reads strings like this and evaluates Thielecke them: Introduction (2+3*(2-3-4))*2 Grammars Derivations In particular, brackets and precedence must be handled Parse trees abstractly correctly (* binds more tightly than +). From grammars to If you attempt brute-force hacking, you may end up with code Translation to something that is inefficient, incorrect, or both. |Try it! Java methods Parse trees in With parsing, that sort of problem is straighforward. Java Parser Moreover, the techniques scale up to more realistic generators Yacc, ANTLR, JavaCC and problems. SableCC Summary Books Grammars and Parsing for SSC1 Hayo I hope these slides are detailed enough, but if you want to dig Thielecke deeper: Introduction There are lots of book on compilers. Grammars The ones which I know best are: Derivations Parse trees Appel, Modern Compiler Design in Java. abstractly From grammars to Aho, Sethi, and Ullman, nicknamed \The Dragon Book" code Translation to Parsing is covered in both, but the Composite Pattern for trees Java methods Parse trees in and Visitors for tree walking only in Appel. Java Parser See also the websites for ANTLR (http://antlr.org) and generators Yacc, ANTLR, SableCC (http://sablecc.org). JavaCC and SableCC Summary Why do you need to learn about grammars? Grammars and Parsing for Grammars are widespread in programming. SSC1 Hayo XML is based on grammars and needs to be parsed. Thielecke Knowing grammars makes it much easier to learn the Introduction syntax of a new programming language. Grammars Derivations Parse trees Powerful tools exists for parsing (e.g., yacc, bison, abstractly ANTLR, SableCC, . ). But you have to understand From grammars to grammars to use them. code Translation to Java methods Grammars give us examples of some more advanced Parse trees in Java object-oriented programming: Composite Pattern and Parser polymorphic methods. generators Yacc, ANTLR, JavaCC and You may need parsing in your final-year project for reading SableCC complex input. Summary Describing syntax Grammars and Parsing for SSC1 Hayo One often has to describe syntax precisely, particularly Thielecke when dealing with computers. Introduction \A block consists of a sequence of statements enclosed in Grammars Derivations curly brackets " Parse trees abstractly Informal English descriptions are too clumsy and not From grammars to precise enough. Rather, we need precise rules, something code Translation to like Block ! :::. Java methods Parse trees in Such precise grammar rules exist for all programming Java Parser language. See for example http://java.sun.com/docs/ generators Yacc, ANTLR, books/jls/second_edition/html/syntax.doc.html. JavaCC and SableCC Summary Example from programming language syntax Some rules for statements S in Java or C: Grammars and Parsing for SSC1 S ! if ( E ) S else S Hayo S ! while ( E ) S Thielecke S ! V = E; Introduction Grammars S ! f B g Derivations Parse trees B ! SB abstractly From B ! grammars to code Translation to Here V is for variables and E for expressions. Java methods Parse trees in Java E ! E - 1 Parser generators E ! ( E ) Yacc, ANTLR, JavaCC and E ! 1 SableCC Summary E ! E == 0 V ! foo Nesting in syntax Grammars and Specifically, we need to be very careful about bracketing and Parsing for nesting. Compare: SSC1 Hayo Thielecke while(i < n) a[i] = 0; Introduction Grammars i = i + 1; Derivations Parse trees abstractly and From grammars to code while(i < n) Translation to Java methods { Parse trees in Java a[i] = 0; Parser i = i + 1; generators Yacc, ANTLR, JavaCC and } SableCC Summary Theses snippets looks very similar. But their difference is clear in the parse tree. What do we need in a grammar Grammars and Parsing for SSC1 Hayo Thielecke Some symbols can occur in the actual syntax, like while Introduction Grammars or cat. Derivations Parse trees We also need other symbols that act like variables (for abstractly From nouns in English or statements in Java, say). grammars to code Rules then say how to replace these symbols, e.g. a noun Translation to Java methods Parse trees in can be cat or mat. Java Parser generators Yacc, ANTLR, JavaCC and SableCC Summary Grammars: formal definition Grammars and Parsing for SSC1 Hayo A context-free grammar consists of Thielecke some terminal symbols a, b,..., +, ),. Introduction Grammars some non-terminal symbols A, B, S,. Derivations Parse trees a distinguished non-terminal start symbol S abstractly From some rules of the form grammars to code Translation to Java methods A ! X ::: X Parse trees in 1 n Java Parser where n ≥ 0, A is a non-terminal, and the Xi are symbols. generators Yacc, ANTLR, JavaCC and SableCC Summary Notation: Greek letters Grammars and Mathematicians and computer scientists are inordinately fond of Parsing for Greek letters. SSC1 Hayo Thielecke Introduction Grammars Derivations Parse trees α alpha abstractly From grammars to code Translation to β beta Java methods Parse trees in Java Parser γ gamma generators Yacc, ANTLR, JavaCC and SableCC " epsilon Summary Notational conventions for grammars Grammars and Parsing for SSC1 Hayo Thielecke We will use Greek letters α, β, . , to stand for strings of Introduction symbols that may contain both terminals and non-terminals. Grammars Derivations In particular, " is used for the empty string (of length 0). Parse trees We will write A, B, . for non-terminals. abstractly From Terminal symbols are usually written in typewriter font, like grammars to code for, while, [. Translation to Java methods These conventions are handy once you get used to them and Parse trees in Java are found in most books. Parser generators Yacc, ANTLR, JavaCC and SableCC Summary Some abbreviations in grammars (BNF) Grammars and Parsing for SSC1 A rule with an alternative, written as a vertical bar j, Hayo Thielecke A ! α j β Introduction Grammars Derivations is the same as having two rules for the same non-terminal: Parse trees abstractly From A ! α grammars to code Translation to A ! β Java methods Parse trees in Java There is also some shorthand notation for repetitions: Parser ∗ generators α stands for zero or more occurences of α, and Yacc, ANTLR, + JavaCC and α stands for one or more occurrences of α. SableCC Summary Derivations Grammars and Parsing for SSC1 Hayo If A ! α is a rule, we can replace A by α for any strings β Thielecke and γ on the left and right: Introduction Grammars β A γ ) βαγ Derivations Parse trees abstractly This is one derivation step. From grammars to A string w consisting only of terminal symbols is code Translation to Java methods generated by the grammar if there is a sequence of Parse trees in derivation steps leading to it from the start symbol S: Java Parser generators Yacc, ANTLR, S )···) w JavaCC and SableCC Summary An example derivation Recall the rules Grammars and Parsing for SSC1 S ! { B } Hayo B ! SB Thielecke B ! Introduction Grammars Replacing always the leftmost non-terminal symbol, we have: Derivations Parse trees abstractly S ) { B } From grammars to ) { SB } code Translation to Java methods ) {{ B } B } Parse trees in Java ) {{} B } Parser generators ) {{} SB } Yacc, ANTLR, JavaCC and ) {{}{ B } B} SableCC Summary ) {{}{} B} ) {{}{}} The language of a grammar Grammars and Parsing for SSC1 Hayo In this context, a language is a set of strings (of terminal Thielecke symbols). Introduction For a given grammar, the language of that grammar Grammars Derivations consists of all strings of terminals that can be derived from Parse trees abstractly the start symbol. From grammars to For useful grammars, there are usually infinitely many code Translation to strings in its language (e.g., all Java programs). Java methods Parse trees in Two different grammars can define the same language. Java Parser Sometimes we may redesign the grammar as long as the generators Yacc, ANTLR, language remains the same. JavaCC and SableCC Summary Grammars and brackets Grammars and Parsing for Grammars are good at expressing various forms of bracketing SSC1 Hayo structure. Thielecke In XML: an element begins with <tag> and ends with </tag>. Introduction The most typical context-free language is the Dyck language of Grammars balanced brackets: Derivations Parse trees abstractly D ! From grammars to code D ! DD Translation to Java methods Parse trees in D ! ( D ) Java D ! [ D ] Parser generators Yacc, ANTLR, JavaCC and (This makes grammars more powerful than Regular SableCC Expressions.) Summary Matching brackets Grammars and Note that the brackets have to match in the rule Parsing for SSC1 Hayo D ! [ D ] Thielecke Introduction This is different from \any number of [ and then any number Grammars of ]". Derivations Parse trees For comparison, we could have a different grammar in which abstractly From the brackets do not have to match: grammars to code Translation to D ! ODC Java methods Parse trees in Java O ! [ O Parser generators O ! " Yacc, ANTLR, JavaCC and C ! C ] SableCC Summary C ! " Another pitfall in reading rules Grammars and Parsing for SSC1 Recall this rule: Hayo Thielecke D ! DD Introduction Grammars Note that the repetition DD does not mean that the same Derivations Parse trees string has to be repeated.