Some Preliminaries • For the next several weeks we’ll look at how one can define a programming language • What is a language, anyway? “Language is a system of gestures, grammar, signs, 3 sounds, symbols, or words, which is used to represent and communicate concepts, ideas, meanings, and thoughts” • Human language is a way to communicate representaons from one (human) mind to another Syntax • What about a programming language? A way to communicate representaons (e.g., of data or a procedure) between human minds and/or machines CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.

Introducon Why and How We usually break down the problem of defining Why? We want specificaons for several a programming language into two parts communies: • Other language designers • defining the PL’s syntax • Implementers • defining the PL’s semancs • Machines? Syntax - the form or structure of the • Programmers (the users of the language) expressions, statements, and program units How? One way is via natural language descrip- Semancs - the meaning of the expressions, ons (e.g., user manuals, text books) but there statements, and program units are more formal techniques for specifying the There’s not always a clear boundary between syntax and semancs the two Syntax part Syntax Overview

• Language preliminaries • Context-free grammars and BNF • Syntax diagrams

This is an overview of the standard process of turning a text file into an executable program.

Introducon Lexical Structure of Programming Languages A sentence is a string of characters over some

alphabet (e.g., def add1(n): return n + 1) • The structure of its lexemes (words or tokens) – token is a category of lexeme A language is a set of sentences • The scanning phase (lexical analyser) collects characters into tokens A lexeme is the lowest level syntacc unit of a • phase (syntacc analyser) determines language (e.g., *, add1, begin) syntacc structure

A token is a category of lexemes (e.g., idenfier) Stream of tokens and Result of characters values parsing Formal approaches to describing syntax:

• Recognizers - used in compilers lexical Syntactic • Generators - what we'll study analyser analyser Role of Grammars in PLs • A (formal) grammar is a set of rules for • A grammar defines a (programming) strings in a language, at least w.r.t. its syntax

• The rules describe how to form strings • It can be used to from the language’s alphabet that are valid – Generate sentences in the language according to the language's syntax – Test whether a string is in the language • A grammar does not describe the meaning – Produce a structured representation of of the strings or what can be done with the string that can be used in them in whatever context — only their form subsequent processing

Adapted from Wikipedia

Grammars

Context-Free Grammars • Developed by Noam Chomsky in the mid-1950s • Language generators, meant to describe the syntax of natural languages • Define a class of languages called context-free languages

Backus Normal/Naur Form (1959) • Invented by John Backus to describe Algol 58 and refined by Peter Naur for Algol 60. • BNF is equivalent to context-free grammars

•Chomsky & Backus independently came up with equiv- alent formalisms for specifying the syntax of a language •Backus focused on a praccal way of specifying an BNF (connued) arficial language, like Algol •Chomsky made fundamental contribuons to mathe- macal linguiscs and was movated by the study of A metalanguage is a language used to describe human languages another language. In BNF, abstracons are used to represent classes of syntacc structures -- they act like NOAM CHOMSKY, MIT Instute Professor; syntacc variables (also called nonterminal Professor of Linguiscs, Linguisc Theory, Syntax, symbols), e.g. Semancs, Philosophy of Language ::= while do This is a rule; it describes the structure of a while Six parcipants from the 1960 Algol conference at the 1974 statement ACM conference on the history of programming languages. Top: John McCarthy, Fritz Bauer, Joe Wegstein. Boom: John Backus, Peter Naur, Alan Perlis.

BNF Non-terminals, pre-terminals & terminals • A rule has a le-hand side (LHS) which is a single • A non-terminal symbol is any symbol in the RHS of a non-terminal symbol and a right-hand side (RHS), rule. They represent abstracons in the language, e.g. one or more terminal or non-terminal symbols ::= if • A grammar is a finite, nonempty set of rules then else • A non-terminal symbol is “defined” by its rules • A terminal symbol (AKA lexemes) is a symbol that’s not • Mulple rules can be combined with the vercal-bar in any rule’s LHS. They are literal symbols that will ( | ) symbol (read as “or”) appear in a program (e.g., if, then, else in rule above) • These two rules: ::= if ::= then else ::= ; • A pre-terminal symbol is one that appears as a LHS of are equivalent to this one: rules, but in every case, the RHSs consist of single ::= | ; terminal symbol, e.g., in ::= 0 | 1 | 2 | 3 … 7 | 8 | 9 BNF BNF Example • Repeon is done with recursion A simple grammar for a subset of English • E.g., Syntacc lists are described in BNF A sentence is noun phrase and verb phrase using recursion followed by a period ! • An is a sequence of one or more ::= .! ::=

! s separated by commas.
::= a | the! ::= man | apple | worm | penguin! ::= |! ::= | ::= eats | throws | sees | is! ,

Derivaons Derivaon using BNF • A derivation is a repeated application of rules, beginning with the start symbol and ending with a sequence of just Here is a derivation for “the man eats the apple.” terminal symbols • It demonstrates, or proves, that the derived sentence is -> . “generated” by the grammar and is thus in the language

. that the grammar defines the . • As an example, consider our baby English grammar the man . ::= .! the man . ::=
! This is a
::= a | the! lemost the man eats . ::= man | apple | worm | penguin! derivaon the man eats
< noun> . ::= | ! ::= eats | throws | sees | is the man eats the . the man eats the apple . Derivaon Another BNF Example -> Note: There is some Every string of symbols in the derivaon is a variaon in notaon -> for BNF grammars. sentenal form (e.g., the man eats .) | ; Here we are using -> in the rules instead -> = of ::= . A sentence is a sentenal form that has only -> a | b | c | d terminal symbols (e.g., the man eats the apple .) -> + | - -> | const A lemost derivaon is one in which the le- most nonterminal in each sentenal form is the one that is expanded in the next step A derivaon may be either lemost or rightmost or something else

Another BNF Example Finite and Infinite languages -> Note: There is some variaon in notaon -> for BNF grammars. • A simple language may have a finite number | ; Here we are using -> in the rules instead of sentences -> = of ::= . -> a | b | c | d • The set of strings represenng integers -> + | - between -10**6 and +10**6 is a finite -> | const language Here is a derivaon: • A finite language can be defined by enumer- => => ang the sentences, but using a grammar is => = usually easier (e.g., small integers) => a = => a = + • Most interesng languages have an infinite => a = + number of sentences => a = b + => a = b + const! Is English a finite or infinite language? Rules with epsilons

• Assume we have a finite set of words • An epsilon (ε) symbol expands to ‘nothing’ • Consider adding rules like the following to the • It is used as a compact way to make previous example something optional ::= .! • For example: ::= and | or | because NP -> DET NP PP • Hint: When you see recursion in a BNF, the PP -> PREP NP | ε language is probably infinite PPEP -> in | on | of | under –When might it not be? • You can always rewrite a grammer with ε –The recursive rule might not be reachable. symbols to an equivalent one without There might be epsilons.

Parse Tree Another Parse Tree A parse tree is a hierarchical representation of a derivation ! ! ! ! ! ! ! !

! = ! ! the man eats !
a + ! ! the apple ! const! ! ! b ! Ambiguous English Sentences

• A grammar is ambiguous if and only if • (iff) it generates a sentenal form that I saw the man on the hill with a has two or more disnct parse trees telescope • Ambiguous grammars are, in general, • Time flies like an arrow very undesirable in formal languages • Fruit flies like a banana • Can you guess why? • Buffalo buffalo Buffalo buffalo • We can eliminate ambiguity by revising buffalo buffalo Buffalo buffalo the grammar

See: Syntacc Ambiguity

-> -> 1|2|3 An ambiguous grammar Two derivaons for 1+2*3 -> +|-|*|/ Here is a simple grammar for expressions that -> ! -> ! -> 1 ! -> ! is ambiguous -> 1 + ! -> 1 ! -> 1 + ! -> 1 + ! -> ! -> 1 + 2 ! -> 1 + 2 ! -> 1 + 2 * ! -> 1 + 2 * ! -> 1|2|3! Fyi… In a programming language, an expression is some code that is evaluated and produces a -> 1 + 2 * 3! -> 1 + 2 * 3! -> +|-|*|/! value. A statement is code that is executed and does something but does not produce a value. ! ! e e The sentence 1+2*3 can lead to two different e op e e op op parse trees corresponding to 1+(2*3) and * (1+2)*3 1 + e op e e op e 3

2 * 3 1 + 2 -> -> -> 1|2|3 -> 1|2|3 Two derivaons for 1+2*3 -> +|-|*|/ Two derivaons for 1+2*3 -> +|-|*|/

-> ! -> ! ! -> 1 ! -> ! e e -> 1 + ! -> 1 ! -> 1 + ! -> 1 + ! -> 1 + 2 ! -> 1 + 2 ! e op e e op e -> 1 + 2 * ! -> 1 + 2 * ! -> 1 + 2 * 3! -> 1 + 2 * 3! * 3 1 + e op e ! ! e op e e e 2 * 3 1 + 2 e op e e op op + *

1 + e op e 3 e op e * 3 1 * Abstract syntax trees +

2 * 3 1 + 2 2 3 1 2 The leaves of the trees are terminals and correspond to the sentence Lower trees represent the way we think about the sentence ‘meaning’

Operators • The tradional operator notaon introduces many problems • Operators are used in – Prefix notaon: Expression (* (+ 1 3) 2) in Lisp – Infix notaon: Expression (1 + 3) * 2 in Java – Posix notaon: Increment foo++ in C • Operators can have one or more operands – Increment in C is a unary operator: foo++ – Subtracon in C is a binary operator: foo - bar – Condional expression in C is a ternary operators: (foo == 3 ? 0 : 1) Operator notaon Precedence and Associavity • How do we interpret expressions like • Precedence and associavity deal with the evaluaon order within expressions (a) 2 + 3 + 4 (b) 2 + 3 * 4 • Precedence rules specify order in which • Is 2 + 3 * 4 computed as operators of different precedence level are – 5 * 4 = 20 evaluated, e.g.: – 2 + 12 = 14 “*” Has a higher precedence that “+”, so “*” groups more ghtly than “+” • You might argue that it doesn’t maer for (a), • Higher-precedence operators applied first but it can when the limits of representaon are hit (e.g., round off in numbers) 2+3*4 = 2+(3*4) = 2+12 = 14 • Key concepts: explaining rules in terms of • What is the results of 4 * 3 ** 2 if * has lower operator precedence and associavity and precedence that **? realizing these in BNF/CFG grammars

Precedence and Associavity Precedence and Associavity

• A language’s precedence hierarchy should • A language’s precedence hierarchy should match our convenons and intuions, but the match our convenons and intuions, but the result’s not always perfect, as in this Pascal result’s not always perfect, as in this Pascal example: example: If A > 0 and A < 10 then A := 0 ; If A > 0 and A < 10 then A := 0 ; • Pascal’s relaonal operators (e.g., <) have • Pascal’s relaonal operators (e.g., <) have lowest precedence! lowest precedence! • So how is this interpreted? • So we have:

If A > 0 and A < 10 then A := 0 ; If (A > (0 and A) < 10) then A := 0 ;

Pascal compiler finds a type error Operator Precedence: Precedence Table If (A > (0 and A) < 10) then A := 0 ;

http://www.compileonline.com/

Operator Precedence: Precedence Table Operators: Associavity • Associavity rules specify order to evaluate operators of same precedence level • Operators are typically either le associave or right associave. • Le associavity is typical for +, - , * and / • So A + B + C – Means: (A + B) + C – And not: A + (B + C) • Does it maer? Operators: Associavity Operators: Associavity

• For + and * it doesn’t maer in theory •Most languages use a similar associavity rules for the most common operators (though it can in pracce) but for – and / it maers in theory, too • But Languages diverge on exponenaon and some others • What should A-B-C mean? – In Fortran, ** associates from right-to-le, (A – B) – C ≠ A – (B – C) as in normally the case for mathemacs • What is the results of 2 ** 3 ** 4 ? – In Ada, ** doesn’t associate; you must – 2 ** (3 ** 4) = 2 ** 81 = write the previous expression as 2 ** (3 ** 2417851639229258349412352 4) to obtain the expected answer – (2 ** 3) ** 4 = 8 ** 4 = 256 • Ada’s goal of reliability may explain the decision: make programmers be explicit here

Associavity in C Comment on example in Python

• In C, as in most languages, most of the >>> a = b = 1 # this is a special Python idiom and does! operators associate le to right >>> print a,b # what you would expect! 1 1! >>> a = (b = 2) # but assignment is a statement and does! a + b + c => (a + b) + c File "", line 1 # not return a value. ! a = (b = 2)! • The various assignment operators however ^! associate right to le SyntaxError: invalid syntax! >>> a ! = += -= *= /= %= >>= <<= &= ^= |= 1! >>> a += 1 # Python has the += assignment operator! >>> a # which works just as in other languages! • Consider a += b += c, which is interpreted as 2! >>> a += b += 1 # But this idiom is not supported! a += (b += c) and not as (a += b) += c File "", line 1 # and it is probably not a big loss! a += b += 1! • Why? What does it do in Python? ^! SyntaxError: invalid syntax! Precedence and associavity in Grammar

If we use the parse tree to indicate precedence levels of the operators, we cannot have ambiguity An unambiguous expression grammar: -> - | -> / const | const !

Precedence and associavity in Grammar Grammar (connued) Sentence: const – const / const Operator associavity can also be Derivaon: => - indicated by a grammar => - => const - -> + | const (ambiguous)! Parse tree: => const - / const ! => const - const / const -> + const | const (unambiguous) ! - ! ! / const + const! ! Does this grammar rule + const! const const ! make the + operator right const! or le associave?

An Expression Grammar A derivaon Here is a grammar to define simple arithmec A derivaon of a+b*2 using the expression grammar: expressions over variables and numbers.

Exp => // Exp ::= Exp BinOp Exp! Exp ::= num Another common Exp BinOp Exp => // Exp ::= id! Exp ::= id notaon variant id BinOp Exp => // BinOp ::= '+'! Exp ::= UnOp Exp where single id + Exp => // Exp ::= Exp BinOp Exp! Exp := Exp BinOp Exp quotes are used to indicate terminal id + Exp BinOp Exp => // Exp ::= num! Exp ::= '(' Exp ')' id + Exp BinOp num => // Exp ::= id! symbols and UnOp ::= '+' unquoted symbols id + id BinOp num => // BinOp ::= '*'! UnOp ::= '-' are taken as non- id + id * num! terminals. BinOp ::= '+' | '-' | '*' | '/ a + b * 2!

A parse tree Precedence • Precedence refers to the order in which A parse tree for a+b*2: operaons are evaluated • Usual convenon: exponents > mult, div > add, sub __Exp__! • Deal with operaons in categories: exponents, / | \! mulops, addops. Exp BinOp Exp! • A revised grammar that follows these convenons: | | / | \! Exp ::= Exp AddOp Exp! id + Exp BinOp Exp! Exp ::= Term! Term ::= Term MulOp Term! | | | |! Term ::= Factor! a id * num! Factor ::= '(' + Exp + ')‘! | |! Factor ::= num | id! AddOp ::= '+' | '-’! b 2! MulOp ::= '*' | '/'! Associavity Adding associavity to the grammar

• Associavity refers to the order in which two Adding associavity to the BinOp expression of the same operaon should be computed grammar Exp ::= Exp AddOp Term! • 3+4+5 = (3+4)+5, le associave (all BinOps) Exp ::= Term ! • 3^4^5 = 3^(4^5), right associave Term ::= Term MulOp Factor! • We’ll see that condionals right associate but Term ::= Factor ! Factor ::= '(' Exp ')'! with a wrinkle: an else clause associates with Factor ::= num | id! closest unmatched if AddOp ::= '+' | '-'! if a then if b then c else d MulOp ::= '*' | '/'! = if a then (if b then c else d)

Grammar Exp ::= Exp AddOp Term! Derivaon Exp ::= Term! Exp =>! Term ::= Term MulOp Factor! Term ::= Factor ! Exp AddOp Term =>! Factor ::= '(' Exp ')’! Exp AddOp Exp AddOp Term =>! Factor ::= num | id! AddOp ::= '+' | '-‘! Term AddOp Exp AddOp Term =>! MulOp ::= '*' | '/'! Factor AddOp Exp AddOp Term =>! Num AddOp Exp AddOp Term =>! Parse tree Num + Exp AddOp Term =>! E Num + Factor AddOp Term =>! Num + Num AddOp Term =>! E A T Num + Num - Term =>! - F Num + Num - Factor =>! E A T Num + Num - Num! num ! T + F F num

num Example: condionals Example: condionals • Most languages allow two condional forms, • All languages use a standard rule to determine with and without an else clause: which if expression an else clause aaches to – if x < 0 then x = -x • The rule: – if x < 0 then x = -x else x = x+1 – An else clause aaches to the nearest if to • But we’ll need to decide how to interpret: its le that does not yet have an else clause – if x < 0 then if y < 0 x = -1 else x = -2 • Example: • To which if does the else clause aach? – if x < 0 then if y < 0 x = -1 else x = -2 • This is like the syntacc ambiguity in aach- – if x < 0 then if y < 0 x = -1 else x = -2 ment of preposional phrases in English – the man near a cat with a hat

Example: condionals Example: condionals • Goal: create a correct grammar for condionals The final unambiguous grammar • Must be non-ambiguous and associate else with Statement ::= Matched | Unmatched nearest unmatched if Matched ::= 'if' test 'then' Matched 'else' Matched Statement ::= Conditional | 'whatever'! | 'whatever' Conditional ::= 'if' test 'then' Statement 'else‘ Statement! Unmatched ::= 'if' test 'then' Statement Conditional ::= 'if' test 'then' Statement! | 'if' test 'then' Matched ‘else’ Unmatched ! • The grammar is ambiguous. The first Condional allows unmatched ifs to be Condionals – Good: if test then (if test then whatever else whatever) – Bad: if test then (if test then whatever) else whatever • Goal: write grammar that forces an else clause to aach to the nearest if w/o an else clause Syntacc Sugar Extended BNF • Syntacc sugar: syntacc features designed to Syntacc sugar: doesn’t extend the expressive make code easier to read or write while power of the formalism, but makes it easier to alternaves exist use, i.e., more readable and more writable • Makes a language sweeter for humans to use: • Oponal parts are placed in brackets ([]) things can be expressed more clearly, con- -> ident [ ( )] cisely or in an preferred style • Put alternave parts of RHSs in parentheses and

• Syntacc sugar can be removed from language separate them with vercal bars without effecng what can be done -> (+ | -) const • All applicaons of the construct can be systemacally replaced with equivalents that • Put repeons (0 or more) in braces ({}) don’t use it -> leer {leer | digit} adapted from Wikipedia

BNF vs EBNF Syntax Graphs BNF: Syntax Graphs - Put the terminals in ellipses and -> + put the nonterminals in rectangles; connect with | - lines with arrowheads | e.g., Pascal type declaraons -> * | / Provides an intuive, graphical notaon! | type_identifier! EBNF:! (! identifier! )! -> {(+ | -) } ,! constant! constant! -> {(* | /) } ..! Parsing • A grammar describes the strings of tokens that are syntaccally legal in a PL • A recogniser simply accepts or rejects strings. • A generator produces sentences in the language described by the grammar • A parser construct a derivaon or parse tree for a sentence (if possible) • Two common types of parsers are: – boom-up or data driven – top-down or hypothesis driven • A is a way to implement a top-down parser that is parcularly simple.

Parsing complexity Parsing complexity • How hard is the parsing task? 3 • Parsing an arbitrary context free grammar is O(n ), • How hard is the parsing task? e.g., it can take me proporonal the cube of the number of symbols in the input. This is bad! • Parsing an arbitrary context free grammar 3 • If we constrain the grammar somewhat, we can is O(n ) in the worst case. always parse in linear me. This is good! • E.g., it can take me proporonal the cube • Linear-me parsing of the number of symbols in the input – LL parsers • LL(n) : Le to right, Lemost derivaon, • So what? » Recognize LL grammar look ahead at most n » Use a top-down strategy symbols. • This is bad! – LR parsers • LR(n) : Le to right, Right derivaon, look » Recognize LR grammar ahead at most n » Use a boom-up strategy symbols. Parsing complexity Linear complexity parsing • Praccal parsers have me complexity that is linear • If it takes t1 seconds to parse your C program in the number of tokens, i.e., O(n) with n lines of code, how long will it take to • If v2.0 or your program is twice as long, it will take take if you make it twice as long? twice as long to parse 3 * -me(n) = t 1, me(2n) = 2 me(n) • This is achieved by modifying the grammar so it can be parsed more easily -8 mes longer • Linear-me parsing • Suppose v3 of your code is has 10n lines? • LL(n) : Le to right, – LL parsers Lemost derivaon, • 103 or 1000 mes as long » Recognize LL grammar look ahead at most n » Use a top-down strategy symbols. • Windows Vista was said to have ~50M lines – LR parsers • LR(n) : Le to right, of code » Recognize LR grammar Right derivaon, look » Use a boom-up strategy ahead at most n symbols.

Recursive Decent Parsing Hierarchy of Linear Parsers

• Basic containment relationship •Each nonterminal in the grammar has a – All CFGs can be recognized by LR parser subprogram associated with it; the – Only a subset of all the CFGs can be recognized by LL parsers subprogram parses all sentenal forms that the nonterminal can generate

• The recursive descent parsing CFGs LR parsing subprograms are built directly from the grammar rules LL parsing • Recursive descent parsers, like other top- down parsers, cannot be built from le- recursive grammars (why not?) Recursive Decent Parsing Example The Example: For the grammar: -> {(*|/)}

We could use the following recursive descent parsing subprogram (e.g., one in C)

void term() { factor(); /* parse first factor*/ •The Chomsky hierarchy while (next_token == ast_code || has four types of languages and their associated grammars and machines. next_token == slash_code) { lexical(); /* get next token */ •They form a strict hierarchy; that is, regular languages < context-free factor(); /* parse next factor */ languages < context-sensive languages < recursively enumerable languages. } •The syntax of computer languages are usually describable by regular or } context free languages.

Summary

• The syntax of a programming language is usually defined using BNF or a context free grammar • In addion to defining what programs are syntaccally legal, a grammar also encodes meaningful or useful abstracons (e.g., block of statements) • Typical syntacc noons like operator precedence, associavity, sequences, oponal statements, etc. can be encoded in grammars • A parser is based on a grammar and takes an input string, does a derivaon and produces a parse tree.