Some Preliminaries • For the next several weeks we’ll look at how one can define a programming language • What is a language, anyway? “Language is a system of gestures, grammar, signs, 3 sounds, symbols, or words, which is used to represent and communicate concepts, ideas, meanings, and thoughts” • Human language is a way to communicate representa ons from one (human) mind to another Syntax • What about a programming language? A way to communicate representa ons (e.g., of data or a procedure) between human minds and/or machines CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.
Introduc on Why and How We usually break down the problem of defining Why? We want specifica ons for several a programming language into two parts communi es: • Other language designers • defining the PL’s syntax • Implementers • defining the PL’s seman cs • Machines? Syntax - the form or structure of the • Programmers (the users of the language) expressions, statements, and program units How? One way is via natural language descrip- Seman cs - the meaning of the expressions, ons (e.g., user manuals, text books) but there statements, and program units are more formal techniques for specifying the There’s not always a clear boundary between syntax and seman cs the two Syntax part Syntax Overview
• Language preliminaries • Context-free grammars and BNF • Syntax diagrams
This is an overview of the standard process of turning a text file into an executable program.
Introduc on Lexical Structure of Programming Languages A sentence is a string of characters over some
alphabet (e.g., def add1(n): return n + 1) • The structure of its lexemes (words or tokens) – token is a category of lexeme A language is a set of sentences • The scanning phase (lexical analyser) collects characters into tokens A lexeme is the lowest level syntac c unit of a • Parsing phase (syntac c analyser) determines language (e.g., *, add1, begin) syntac c structure
A token is a category of lexemes (e.g., iden fier) Stream of tokens and Result of characters values parsing Formal approaches to describing syntax:
• Recognizers - used in compilers lexical Syntactic • Generators - what we'll study analyser analyser Formal Grammar Role of Grammars in PLs • A (formal) grammar is a set of rules for • A grammar defines a (programming) strings in a formal language language, at least w.r.t. its syntax
• The rules describe how to form strings • It can be used to from the language’s alphabet that are valid – Generate sentences in the language according to the language's syntax – Test whether a string is in the language • A grammar does not describe the meaning – Produce a structured representation of of the strings or what can be done with the string that can be used in them in whatever context — only their form subsequent processing
Adapted from Wikipedia
Grammars
Context-Free Grammars • Developed by Noam Chomsky in the mid-1950s • Language generators, meant to describe the syntax of natural languages • Define a class of languages called context-free languages
Backus Normal/Naur Form (1959) • Invented by John Backus to describe Algol 58 and refined by Peter Naur for Algol 60. • BNF is equivalent to context-free grammars
•Chomsky & Backus independently came up with equiv- alent formalisms for specifying the syntax of a language •Backus focused on a prac cal way of specifying an BNF (con nued) ar ficial language, like Algol •Chomsky made fundamental contribu ons to mathe- ma cal linguis cs and was mo vated by the study of A metalanguage is a language used to describe human languages another language. In BNF, abstrac ons are used to represent classes of syntac c structures -- they act like NOAM CHOMSKY, MIT Ins tute Professor; syntac c variables (also called nonterminal Professor of Linguis cs, Linguis c Theory, Syntax, symbols), e.g. Seman cs, Philosophy of Language
BNF Non-terminals, pre-terminals & terminals • A rule has a le -hand side (LHS) which is a single • A non-terminal symbol is any symbol in the RHS of a non-terminal symbol and a right-hand side (RHS), rule. They represent abstrac ons in the language, e.g. one or more terminal or non-terminal symbols
Deriva ons Deriva on using BNF • A derivation is a repeated application of rules, beginning with the start symbol and ending with a sequence of just Here is a derivation for “the man eats the apple.” terminal symbols • It demonstrates, or proves, that the derived sentence is
Another BNF Example Finite and Infinite languages
• Assume we have a finite set of words • An epsilon (ε) symbol expands to ‘nothing’ • Consider adding rules like the following to the • It is used as a compact way to make previous example something optional
Parse Tree Another Parse Tree A parse tree is a hierarchical representation of a derivation
• A grammar is ambiguous if and only if • (iff) it generates a senten al form that I saw the man on the hill with a has two or more dis nct parse trees telescope • Ambiguous grammars are, in general, • Time flies like an arrow very undesirable in formal languages • Fruit flies like a banana • Can you guess why? • Buffalo buffalo Buffalo buffalo • We can eliminate ambiguity by revising buffalo buffalo Buffalo buffalo the grammar
See: Syntac c Ambiguity
2 * 3 1 + 2
1 + e op e 3 e op e * 3 1 * Abstract syntax trees +
2 * 3 1 + 2 2 3 1 2 The leaves of the trees are terminals and correspond to the sentence Lower trees represent the way we think about the sentence ‘meaning’
Operators • The tradi onal operator nota on introduces many problems • Operators are used in – Prefix nota on: Expression (* (+ 1 3) 2) in Lisp – Infix nota on: Expression (1 + 3) * 2 in Java – Pos ix nota on: Increment foo++ in C • Operators can have one or more operands – Increment in C is a unary operator: foo++ – Subtrac on in C is a binary operator: foo - bar – Condi onal expression in C is a ternary operators: (foo == 3 ? 0 : 1) Operator nota on Precedence and Associa vity • How do we interpret expressions like • Precedence and associa vity deal with the evalua on order within expressions (a) 2 + 3 + 4 (b) 2 + 3 * 4 • Precedence rules specify order in which • Is 2 + 3 * 4 computed as operators of different precedence level are – 5 * 4 = 20 evaluated, e.g.: – 2 + 12 = 14 “*” Has a higher precedence that “+”, so “*” groups more ghtly than “+” • You might argue that it doesn’t ma er for (a), • Higher-precedence operators applied first but it can when the limits of representa on are hit (e.g., round off in numbers) 2+3*4 = 2+(3*4) = 2+12 = 14 • Key concepts: explaining rules in terms of • What is the results of 4 * 3 ** 2 if * has lower operator precedence and associa vity and precedence that **? realizing these in BNF/CFG grammars
Precedence and Associa vity Precedence and Associa vity
• A language’s precedence hierarchy should • A language’s precedence hierarchy should match our conven ons and intui ons, but the match our conven ons and intui ons, but the result’s not always perfect, as in this Pascal result’s not always perfect, as in this Pascal example: example: If A > 0 and A < 10 then A := 0 ; If A > 0 and A < 10 then A := 0 ; • Pascal’s rela onal operators (e.g., <) have • Pascal’s rela onal operators (e.g., <) have lowest precedence! lowest precedence! • So how is this interpreted? • So we have:
If A > 0 and A < 10 then A := 0 ; If (A > (0 and A) < 10) then A := 0 ;
Pascal compiler finds a type error Operator Precedence: Precedence Table If (A > (0 and A) < 10) then A := 0 ;
http://www.compileonline.com/
Operator Precedence: Precedence Table Operators: Associa vity • Associa vity rules specify order to evaluate operators of same precedence level • Operators are typically either le associa ve or right associa ve. • Le associa vity is typical for +, - , * and / • So A + B + C – Means: (A + B) + C – And not: A + (B + C) • Does it ma er? Operators: Associa vity Operators: Associa vity
• For + and * it doesn’t ma er in theory •Most languages use a similar associa vity rules for the most common operators (though it can in prac ce) but for – and / it ma ers in theory, too • But Languages diverge on exponen a on and some others • What should A-B-C mean? – In Fortran, ** associates from right-to-le , (A – B) – C ≠ A – (B – C) as in normally the case for mathema cs • What is the results of 2 ** 3 ** 4 ? – In Ada, ** doesn’t associate; you must – 2 ** (3 ** 4) = 2 ** 81 = write the previous expression as 2 ** (3 ** 2417851639229258349412352 4) to obtain the expected answer – (2 ** 3) ** 4 = 8 ** 4 = 256 • Ada’s goal of reliability may explain the decision: make programmers be explicit here
Associa vity in C Comment on example in Python
• In C, as in most languages, most of the >>> a = b = 1 # this is a special Python idiom and does! operators associate le to right >>> print a,b # what you would expect! 1 1! >>> a = (b = 2) # but assignment is a statement and does! a + b + c => (a + b) + c File "
If we use the parse tree to indicate precedence levels of the operators, we cannot have ambiguity An unambiguous expression grammar:
Precedence and associa vity in Grammar Grammar (con nued) Sentence: const – const / const Operator associa vity can also be Deriva on:
An Expression Grammar A deriva on Here is a grammar to define simple arithme c A deriva on of a+b*2 using the expression grammar: expressions over variables and numbers.
Exp => // Exp ::= Exp BinOp Exp! Exp ::= num Another common Exp BinOp Exp => // Exp ::= id! Exp ::= id nota on variant id BinOp Exp => // BinOp ::= '+'! Exp ::= UnOp Exp where single id + Exp => // Exp ::= Exp BinOp Exp! Exp := Exp BinOp Exp quotes are used to indicate terminal id + Exp BinOp Exp => // Exp ::= num! Exp ::= '(' Exp ')' id + Exp BinOp num => // Exp ::= id! symbols and UnOp ::= '+' unquoted symbols id + id BinOp num => // BinOp ::= '*'! UnOp ::= '-' are taken as non- id + id * num! terminals. BinOp ::= '+' | '-' | '*' | '/ a + b * 2!
A parse tree Precedence • Precedence refers to the order in which A parse tree for a+b*2: opera ons are evaluated • Usual conven on: exponents > mult, div > add, sub __Exp__! • Deal with opera ons in categories: exponents, / | \! mulops, addops. Exp BinOp Exp! • A revised grammar that follows these conven ons: | | / | \! Exp ::= Exp AddOp Exp! id + Exp BinOp Exp! Exp ::= Term! Term ::= Term MulOp Term! | | | |! Term ::= Factor! a id * num! Factor ::= '(' + Exp + ')‘! | |! Factor ::= num | id! AddOp ::= '+' | '-’! b 2! MulOp ::= '*' | '/'! Associa vity Adding associa vity to the grammar
• Associa vity refers to the order in which two Adding associa vity to the BinOp expression of the same opera on should be computed grammar Exp ::= Exp AddOp Term! • 3+4+5 = (3+4)+5, le associa ve (all BinOps) Exp ::= Term ! • 3^4^5 = 3^(4^5), right associa ve Term ::= Term MulOp Factor! • We’ll see that condi onals right associate but Term ::= Factor ! Factor ::= '(' Exp ')'! with a wrinkle: an else clause associates with Factor ::= num | id! closest unmatched if AddOp ::= '+' | '-'! if a then if b then c else d MulOp ::= '*' | '/'! = if a then (if b then c else d)
Grammar Exp ::= Exp AddOp Term! Deriva on Exp ::= Term! Exp =>! Term ::= Term MulOp Factor! Term ::= Factor ! Exp AddOp Term =>! Factor ::= '(' Exp ')’! Exp AddOp Exp AddOp Term =>! Factor ::= num | id! AddOp ::= '+' | '-‘! Term AddOp Exp AddOp Term =>! MulOp ::= '*' | '/'! Factor AddOp Exp AddOp Term =>! Num AddOp Exp AddOp Term =>! Parse tree Num + Exp AddOp Term =>! E Num + Factor AddOp Term =>! Num + Num AddOp Term =>! E A T Num + Num - Term =>! - F Num + Num - Factor =>! E A T Num + Num - Num! num ! T + F F num
num Example: condi onals Example: condi onals • Most languages allow two condi onal forms, • All languages use a standard rule to determine with and without an else clause: which if expression an else clause a aches to – if x < 0 then x = -x • The rule: – if x < 0 then x = -x else x = x+1 – An else clause a aches to the nearest if to • But we’ll need to decide how to interpret: its le that does not yet have an else clause – if x < 0 then if y < 0 x = -1 else x = -2 • Example: • To which if does the else clause a ach? – if x < 0 then if y < 0 x = -1 else x = -2 • This is like the syntac c ambiguity in a ach- – if x < 0 then if y < 0 x = -1 else x = -2 ment of preposi onal phrases in English – the man near a cat with a hat
Example: condi onals Example: condi onals • Goal: create a correct grammar for condi onals The final unambiguous grammar • Must be non-ambiguous and associate else with Statement ::= Matched | Unmatched nearest unmatched if Matched ::= 'if' test 'then' Matched 'else' Matched Statement ::= Conditional | 'whatever'! | 'whatever' Conditional ::= 'if' test 'then' Statement 'else‘ Statement! Unmatched ::= 'if' test 'then' Statement Conditional ::= 'if' test 'then' Statement! | 'if' test 'then' Matched ‘else’ Unmatched ! • The grammar is ambiguous. The first Condi onal allows unmatched ifs to be Condi onals – Good: if test then (if test then whatever else whatever) – Bad: if test then (if test then whatever) else whatever • Goal: write grammar that forces an else clause to a ach to the nearest if w/o an else clause Syntac c Sugar Extended BNF • Syntac c sugar: syntac c features designed to Syntac c sugar: doesn’t extend the expressive make code easier to read or write while power of the formalism, but makes it easier to alterna ves exist use, i.e., more readable and more writable • Makes a language sweeter for humans to use: • Op onal parts are placed in brackets ([]) things can be expressed more clearly, con-
• Syntac c sugar can be removed from language separate them with ver cal bars without effec ng what can be done
BNF vs EBNF Syntax Graphs BNF: Syntax Graphs - Put the terminals in ellipses and
Parsing complexity Parsing complexity • How hard is the parsing task? 3 • Parsing an arbitrary context free grammar is O(n ), • How hard is the parsing task? e.g., it can take me propor onal the cube of the number of symbols in the input. This is bad! • Parsing an arbitrary context free grammar 3 • If we constrain the grammar somewhat, we can is O(n ) in the worst case. always parse in linear me. This is good! • E.g., it can take me propor onal the cube • Linear- me parsing of the number of symbols in the input – LL parsers • LL(n) : Le to right, Le most deriva on, • So what? » Recognize LL grammar look ahead at most n » Use a top-down strategy symbols. • This is bad! – LR parsers • LR(n) : Le to right, Right deriva on, look » Recognize LR grammar ahead at most n » Use a bo om-up strategy symbols. Parsing complexity Linear complexity parsing • Prac cal parsers have me complexity that is linear • If it takes t1 seconds to parse your C program in the number of tokens, i.e., O(n) with n lines of code, how long will it take to • If v2.0 or your program is twice as long, it will take take if you make it twice as long? twice as long to parse 3 * - me(n) = t 1, me(2n) = 2 me(n) • This is achieved by modifying the grammar so it can be parsed more easily -8 mes longer • Linear- me parsing • Suppose v3 of your code is has 10n lines? • LL(n) : Le to right, – LL parsers Le most deriva on, • 103 or 1000 mes as long » Recognize LL grammar look ahead at most n » Use a top-down strategy symbols. • Windows Vista was said to have ~50M lines – LR parsers • LR(n) : Le to right, of code » Recognize LR grammar Right deriva on, look » Use a bo om-up strategy ahead at most n symbols.
Recursive Decent Parsing Hierarchy of Linear Parsers
• Basic containment relationship •Each nonterminal in the grammar has a – All CFGs can be recognized by LR parser subprogram associated with it; the – Only a subset of all the CFGs can be recognized by LL parsers subprogram parses all senten al forms that the nonterminal can generate
• The recursive descent parsing CFGs LR parsing subprograms are built directly from the grammar rules LL parsing • Recursive descent parsers, like other top- down parsers, cannot be built from le - recursive grammars (why not?) Recursive Decent Parsing Example The Example: For the grammar: Chomsky hierarchy
We could use the following recursive descent parsing subprogram (e.g., one in C)
void term() { factor(); /* parse first factor*/ •The Chomsky hierarchy while (next_token == ast_code || has four types of languages and their associated grammars and machines. next_token == slash_code) { lexical(); /* get next token */ •They form a strict hierarchy; that is, regular languages < context-free factor(); /* parse next factor */ languages < context-sensi ve languages < recursively enumerable languages. } •The syntax of computer languages are usually describable by regular or } context free languages.
Summary
• The syntax of a programming language is usually defined using BNF or a context free grammar • In addi on to defining what programs are syntac cally legal, a grammar also encodes meaningful or useful abstrac ons (e.g., block of statements) • Typical syntac c no ons like operator precedence, associa vity, sequences, op onal statements, etc. can be encoded in grammars • A parser is based on a grammar and takes an input string, does a deriva on and produces a parse tree.