Topics

Introduction Chapter 3 English Description Syntax Regular Expressions Syntax BNF Variations of BNF Chomsky Hierarchy Parsing Ambiguity, associativity, and precedence.

Chapter 3: Syntax 2

Introduction Introduction: English Description

Language definition n Early days n Syntax Syntax and semantics: lengthy English explanations and many examples n Semantics Example (Syntax): the if-statement in Pascal may be Syntax described in words:

n Form, format, well-formedness, and compositional An if-statement consists of the word “if” followed by a condition, followed by the word “then”, followed by a statement, followed by an structure of the language. optional else part consisting of the word “else ” and another statement. n Description of the ways different parts of the language n Example (Semantics): the if-statement in Pascal may be combined to form other parts. may be described in words: Semantics An if-statement is executed by first evaluating its conditions. If the condition evaluates to true, then the statement following n Meaning and interpretation of the language. the “then” is executed. If the condition evaluates to false, and there is an else part, then the statement following the “ else” is n Description of what happens during the exsecution of a executed. program.

Chapter 3: Syntax 3 Chapter 3: Syntax 4

Syntax Definition of a Language

() Definition of what Formal language: set of finite string of constitutes a grammatically valid program in atomic symbols. that language. Alphabet: set of symbols. Syntax is specified as a set of rules, just as it Sentences: the strings that belong to the is for natural languages. language. Clear, concise, and formal definition syntax is Alphabet: {a,b} especially important for programmers, L1 = {a, b, ab} This language is finite: just three strings implementers, language designer, etc. L2 = {aa, aba, abba, abbba, … } The syntax of a language has a profound This language is infinite effect on the ease of use of a language. n More precise methods for defining languages are desirable than just using “…”. Chapter 3: Syntax 5 Chapter 3: Syntax 6

1 Definition of a Regular Expression Regular Expressions Invented by a mathematical logician Stephen 4. Concatenation. If r and r are regular Kleene 1 2 expressions, then (r1×r2) is a regular expression. The A regular expression over A denotes a language it denotes is the set of all strings formed by concatenating a string from the set denoted by r to the language with alphabet A and is defined by the 1 end of a string in the set denoted by r2. following set or rules: 3. Closure. If r is a regular expression, then r* is a 1. Empty. The symbol Æis a regular expression, denoting regular expression. The language it denotes consists of the language consisting of no strings {}. all strings formed by concatenating zero or more strings 2. Atom. Any single symbol of aÎA is a regular in the language denoted by r. expression denoting the language consisting of the single string {a}. 3. Alternation. If r is a regular expression and r is a - Plus, dot, asterisk, the empty set symbol, and parentheses 1 2 are part of the notation for regular expressions, not part of the regular expression, then (r1+r2) is a regular expression. The language it denotes has all the strings from the language languages being defined. - Definitions are recursive. denoted by r1 and all the strings from the language denoted by r2. Chapter 3: Syntax 7 Chapter 3: Syntax 8

Conventions for Writing Regular Regular Expressions: Example Expressions Sequences of ASCII characters make up Regular Expression Meaning a legal identifier x A character (stand for itself) l stand for the regular expression denoting any “xyz” A literal string (stands for itself) M | N M or N lowercase or uppercase letter M N M followed by N (concatenation) d stand for the regular expression denoting any M* Zero or more occurrences of M decimal digit. M+ One or more occurrences of M [a- zA - Z] Any alphabetic character n Modula-3 l × ( l+d+_ )* [0-9] Any digit n ML l × ( l+d+_+’ )* . Any single charcater n Ada l × ( l+d )* × ( _ × ( l+d ) × ( l+d )* )*

Chapter 3: Syntax 9 Chapter 3: Syntax 10

Formal Methods of Describing Regular Expressions Syntax: BNF Very popular tool in language design. Metalanguage: is a language used to define Generic lexical-analyzer generator: other language. n The regular expression is submitted directly. n Two commonly used generators: BNF: notation invented to describe the syntax of “Lex” (generating code). ALGOL 60 “Jlex” (for generating Java code). What is the problem with regular expressions? n Backus-Naur Form in honor of John Backus and Why don’t we use them to describe the syntax of programming Peter Naur, who developed the notation of this languages? metalanguage in unrelated research efforts.

n Major shortcoming: bracketing is not expressible n Language: a sequence of tokens. n n n Regular expressions are incapable of generating the language {a b } where the number of as must be equal to the number of bs, very useful for n Tokens:identifiers, numbers, keywords, punctuation, matching nested beginning and ending tags, such as occur in expressions etc. that constitute the lexicon of the language. (parentheses) and statement lists (braces).

Chapter 3: Syntax 11 Chapter 3: Syntax 12

2 BNF: Grammar BNF: Notation

n BNF grammar: a set of rewriting rules. A BNF definition typically contains the A a left-hand side: syntactic categories

n A syntactic category is a name for a set of token sequences following meta-symbols: A right-hand side: sequences of tokens and syntactic “::=” meaning “is defined to be” categories. “<>” to delimit syntactic categories ::= sequence of tokens and syntactic categories There may be many rules with the same left-hand side. “|” meaning “or”

n A token sequence belongs to a syntactic category if it Strictly speaking, the symbol for “or” is not can be derived by taking the right-hand sides of rules necessary, but it is convenient for combining multiple for the category and replacing the syntactic category right-hand sides for the same syntactic category. occurring in right-hand side with any token sequence belonging to that category.

Chapter 3: Syntax 13 Chapter 3: Syntax 14

BNF: Examples (1) BNF: Examples (2)

Describe the syntax of regular expressions over the Describe the ALGOL 60 for construct. alphabet {a,b} ::= |

Chapter 3: Syntax 15 Chapter 3: Syntax 16

BNF: Examples (3) BNF: A Stream of Tokens

Describe simple integer arithmetic expressions with addition and multiplication. Previous categories allow the to ::= + | * look at a program as a stream of tokens. | () | n Each one a member of a particular grammatical ::= | ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 category Lexicon of a programming language contains the n Separated from the next token by whitespaces or grammatical categories: a comment.

n Identifier (variable names, function names, etc)

n Literal or constants (integer and decimal numbers) Comment // compute result = the nth Fibonacci number void main () { n Operator (+,-,*,/,etc) int n; Keyword Separator n Separator (;,.,{,},etc) n = 8; Identifier n Keyword or reserved words (int, main, if, for, etc) Literal Operator

Chapter 3: Syntax 17 Chapter 3: Syntax 18

3 Variations of BNF: EBNF EBNF: Example

Describe simple integer arithmetic expressions with Several extensions to BNF have been proposed to addition and multiplication. make BNF definitions more readable. ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ::= { ::= {+ } | {- }

n Do not add to the expressive power of the formalism, just to the The curly brackets “{ }” denote zero or more repetitions. convenience. The square brackets enclose a series of alternatives from which one must Extended BNF (EBNF for short) was introduced to choose. simplify the specification of recursion in grammar rules Definitions of language syntax in EBNF tend to be (curly brackets), and to introduce the idea of optional slightly clearer and briefer than BNF definitions. part in a rule’s right-hand side (square brackets). EBNF does not force the use of recursive definitions on the reader in very instance.

Chapter 3: Syntax 19 Chapter 3: Syntax 20

Variations of BNF: Syntax EBNF: Example Diagrams The Ada reference manual use extended BNF. Graphical representation that indicates the

n Uses different convention to distinguish syntactic sequence of terminals and nonterminals categories from terminals. encountered in the right-hand side of the n Syntactic categories are denoted by simple identifiers rule. possibly containing underscores. n Circles or ovals for terminals n Keywords and punctuation are in bold face. n Squares or rectangles for nonterminals block ::= [ block_identifier: ] [ declare{declaration} ] n Connected with lines and arrows to indicate begin statement {statement} appropriate sequencing. [ exception handler {handler} ] Syntax diagrams can also condense End [ block_identifier ]; several productions into one diagram.

Chapter 3: Syntax 21 Chapter 3: Syntax 22

Syntax Diagrams (1) Syntax Diagrams (2)

0 As an example of how to express the square brackets in

. digits . syntax diagrams: the Pascal if-statement. .

9 If-statement if expression then

number digit statement else statement

exp term Syntax diagrams are always written from the EBNF, not + the BNF

Chapter 3: Syntax 23 Chapter 3: Syntax 24

4 Definition of a Grammar Chomsky Hierarchy

BNF notation is slightly different form of what are In the mid 1950s, Chomsky described a hierarchy called context-free grammars. These grammars that relates the power of different types of grammars. n Type-0 grammars (unrestricted grammars) include all and others were developed independently by formal grammars. They generate exactly all languages that Chomsky. can be recognized by a Turing Machine. Grammar: a 4-tuple n Type-1 grammars (context-sensitive grammars). These grammars have rules of the form aAb®agb with A a n T is the set of terminal symbols nonterminal and a, ß and ? strings of terminals and nonterminals. The strings a and ß may be empty, but ? must n N is the set of nonterminal symbols TÇN=Æ be nonempty. The rule S®Î is allowed if S does not n S, a nonterminal, is the start symbol appear on the right side of any rule. The languages described by these grammars are exactly all languages that n P are the productions of the grammar can be recognized by a non-deterministic Turing machine A production has the form a?ßwhere a and ß are strings of whose tape is bounded by a constant times the length of the terminals and nonterminals (a¹Î). input.

Chapter 3: Syntax 25 Chapter 3: Syntax 26

Chomsky Hierarchy Parsing

n Type-2 grammars (context-free grammars) generate the context-free languages. These are defined by rules of the Parsing problem: the interesting form A®g with A a nonterminal and ? a string of terminals and nonterminals. Context free languages are the problem concerning grammars is how to theoretical basis for the syntax of most programming efficiently recognize when a string is a languages.

n Type-3 grammars (regular grammars) generate the sentence of a grammar. regular languages. Such a grammar restricts its rules to a single nonterminal on the left-hand side and a right-hand BNF: a simple arithmetic expression side consisting of a single terminal, possibly followed (or preceded, but not both in the same grammar) by a single grammar nonterminal. The rule is also here allowed if S does not ::= + | * appear on the right side of any rule. This family of formal | () | languages can be obtained by regular expressions. Regular languages are commonly used to define search patterns ::= | and the lexical structure of programming languages. ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

Chapter 3: Syntax 27 Chapter 3: Syntax 28

Parsing: Example

Justify 352 as a Number. We just parse the string 352 as an instance of n Derive the string from the rules in a sequence of steps. the grammatical category Integer. n Begin with the start symbol S=number

1. Form the string Number Digit as a particular kind of Number , from Parsing process used in the design and the first alternative in the second rule. analysis of programming language syntax. 2. Substitute Number Digit for Number in the string, again using the second rule, gaining the string Number Digit Digit. n Clearer style than English for expressing a sequence

3. Substitute Digit for Number, using the second alternative in the of steps. second rule, gaining Digit Digit Digit. n Describe the parse graphically in the form of a parse 4. Substitute 3 as a particular kind of Digit from the third rule, achieving 3 Digit Digit. tree. 5. Substitute 5 for Digit in the string, achieving the string 3 5 Digit. The parse tree is labeled by nonterminals at 6. Finally, substitute 2 for Digit in the string, achieving 3 5 2 interior nodes and terminals at leaves.

Chapter 3: Syntax 29 Chapter 3: Syntax 30

5 Parse Tree: Example Abstract Syntax Tree

Not all the terminals and nonterminals may be necessary to determine completely the syntactic structure of an expression. Example: the structure of the number 352

2 2 5 5 3 3

Chapter 3: Syntax 31 Chapter 3: Syntax 32

Abstract Syntax Tree: Example Abstract Syntax Tree

3+4*5 Parse Tree Abstract syntax tree may also do away with terminals that are redundant once Abstract Parse Tree the structure of the tree is determined. + + Example:

n ::= if then else * 3 * 4 5 if then else 3 4 5

Chapter 3: Syntax 33 Chapter 3: Syntax 34

Ambiguity Ambiguity: Example

Parsing natural languages is difficult due to the Same example: 3+4*5 ambiguity of language. Ambiguity: the sentence can be understood in + * two different ways (two or more different parse 3 + 5 trees). * Either the grammar must be revised to remove 4 5 3 4 the ambiguity, or a disambiguating rule must be stated to establish which structure is meant. Which of the two parse trees is the correct one for the expression 3+4*5?

Chapter 3: Syntax 35 Chapter 3: Syntax 36

6 Ambiguity: Resolution Associativity

If operations are applied in a different order then the There is still some ambiguity problem: resulting semantics are quite different

n First syntax tree: 23 n Rule for still allows to parse 3+4+5 as n Second syntax tree: 35 either (3+4)+5 or 3+(4+5). n Meaning from mathematics: choose first tree, since multiplication has precedence over addition. Addition is either right or left-associative. State a disambiguation rule separately from the grammar or revise the grammar. + + n Usual way to revise the grammar is to write a new grammar rule that establishes a “precedence cascade” to force the 3 + 5 matching of the “*” at a lower point in the parse tree. +

n ::= + | 4 5 3 4 n ::= * | () |

Chapter 3: Syntax 37 Chapter 3: Syntax 38

Associativity: Example (1) Associativity: Example (2)

In the case of addition this does not affect A left recursive rule for an operation the result. causes it to left associate, while a right - In the case of subtraction it surely would: recursive rule causes it to right-associate. 8-4-2=2 if minus is left-associative, but 8-

4-2=6 if minus is right-associative. + +

Replace rule + 5 + ::= + 5 3 with ::= + or ::= + 4 3 4 The first rule is left-recursive while the second is right-recursive.

Chapter 3: Syntax 39 Chapter 3: Syntax 40

Ambiguity Dangling else problem

The BNF for simple arithmetic expressions A classical example of ambiguity in is now unambiguous. programming languages. Sometimes the process of rewriting a Occurs when two adjacent if statements grammar to eliminate ambiguity causes are followed by an else statement. if (x<0) the grammar to become extremely if (y<0) y = y-1; complex, and in such cases we prefer to else y=0; Parse attaches the else clause to the second if statement state a disambiguation rule. n y will become 0 whenever x<0 and y>=0. Parse attaches the else clause to the first if statement

n y will become 0 whenever x>=0.

Chapter 3: Syntax 41 Chapter 3: Syntax 42

7 Dangling else problem: Dangling else problem Solutions ALGOL 60 introduced the if-then and the ALGOL 60: n Prohibited the nested if statement, as it could always if-then-else statements be avoided by using the begin/end statement. S ? if C then S | if C then S else S | S’ PL/I and Pascal: n Sequence of tokens: if C1 then S1 else if C2 then S2 else S3 has only one interpretation. n Adopted the solution of matching dangling else to the nearest unmatched if statement. n Sequence of tokens: if C1 then if C2 then S1 else S2 has two interpretations, ALGOL 68:

n introduced the keyword fi . S S n Ada solves the problem with end if.

If C1 then S If C1 then S else S 2 “Terminating keyword” solution appears to be generally favored over the “nearest unmatched” If C then S else S If C2 then S1 2 1 2 solution in more recent programming languages. Chapter 3: Syntax 43 Chapter 3: Syntax 44

Dangling else problem: Variations on BNF and EBNF Solutions Java solves the problem by expanding the BNF In place of the arrow, a colon is used and the grammar for if statements in a rather bizarre RHS is placed on the next line. way. Instead of a vertical bar to separate alternative

n Separates the definition into two different syntactic RHSs , they are simply placed on separate lines. categories, (IfThenStatement,IfThenElseStatement), In place of square brackets to indicate each which is a subcategory of the general category something being optional, the subscript opt is Statement. used. IfThenStatement ? if (Expression) Statement Rather than using the | symbol in a IfThenElseStatement ? if (Expression) StatementNoShortIf else parenthesized list of elements to indicate a Statement choice, the words “one of” are used.

Chapter 3: Syntax 45 Chapter 3: Syntax 46

Derivation Derivation

Method for describing the parse of a Sentential form: each string on the right of a string. double arrow. n Contains terminal and nonterminals symbols. A derivation is a simple linear n Left end of the derivation is the start symbol S representation of a parse tree n Each intermediate step creates a sentential form n more helpful when the string being derived Results from replacing the left-most nonterminal symbol has a simple grammatical structure by a string of terminals and nonterminals that appears on the right-hand side of some rule that has the same Example: derivation of 352 symbol on its left-hand side. Number Þ Number Digit Þ Number Digit Digit Þ Digit Digit Digit n Derivations that use this order of replacement are Þ 3 Digit Digit Þ 3 5 Digit Þ 3 5 2 called leftmost derivations.

Chapter 3: Syntax 47 Chapter 3: Syntax 48

8 Derivation

In addition to leftmost, a derivation may be rightmost or in an order that is neither leftmost nor rightmost. Derivation order has no effect on the language generated by a grammar. Different sentences in the language can be generated.

n Alternative RHSs of rules with which to replace nonterminalsin the derivation, The language defined by a BNF grammar is the set of all strings that can be parsed, or derived, using the rules of the grammar. Chapter 3: Syntax 49

9