Context-Free Languages & Grammars (CFLs & CFGs) Reading: Chapter 5

1 Not all languages are regular n So what happens to the languages which are not regular?

n Can we still come up with a language recognizer?

n i.e., something that will accept (or reject) strings that belong (or do not belong) to the language?

2 Context-Free Languages n A language class larger than the class of regular languages n Supports natural, recursive notation called “context- free grammar” n Applications: Context- n Parse trees, compilers Regular free n XML (FA/RE) (PDA/CFG)

3 An Example n A is a word that reads identical from both ends

n E.g., madam, redivider, malayalam, 010010010 n Let L = { w | w is a binary palindrome} n Is L regular?

n No.

n Proof: N N (assuming N to be the p/l constant) n Let w=0 10 k n By Pumping lemma, w can be rewritten as xyz, such that xy z is also L (for any k≥0)

n But |xy|≤N and y≠e + n ==> y=0 k n ==> xy z will NOT be in L for k=0

n ==> Contradiction 4 But the language of

is a CFL, because it supports recursive substitution (in the form of a CFG)

n This is because we can construct a “grammar” like this: Same as: 1. A ==> e A => 0A0 | 1A1 | 0 | 1 | e Terminal 2. A ==> 0

3. A ==> 1 Variable or non-terminal Productions 4. A ==> 0A0

5. A ==> 1A1 How does this grammar work?

5 How does the CFG for palindromes work?

An input string belongs to the language (i.e., accepted) iff it can be generated by the CFG G: n Example: w=01110 A => 0A0 | 1A1 | 0 | 1 | e n G can generate w as follows: Generating a string from a grammar: 1. A => 0A0 1. Pick and choose a sequence

2. => 01A10 of productions that would allow us to generate the 3. => 01110 string. 2. At every step, substitute one variable with one of its productions.

6 Context-Free Grammar: Definition

n A context-free grammar G=(V,T,P,S), where:

n V: set of variables or non-terminals

n T: set of terminals (= alphabet U {e}) n P: set of productions, each of which is of the form V ==> a1 | a2 | … n Where each ai is an arbitrary string of variables and terminals

n S ==> start variable

CFG for the language of binary palindromes: G=({A},{0,1},P,A) P: A ==> 0 A 0 | 1 A 1 | 0 | 1 | e 7 More examples n Parenthesis matching in code n Syntax checking n In scenarios where there is a general need for:

n Matching a symbol with another symbol, or

n Matching a count of one symbol with that of another symbol, or

n Recursively substituting one symbol with a string of other symbols

8 Example #2

n Language of balanced paranthesis e.g., ()(((())))((()))….

n CFG? G: S => (S) | SS | e

How would you “interpret” the string “(((()))()())” using this grammar? Sà SS à (S) à (SS)à((S)SS)à (((S))(S)(S))à (((()))()())

9 Example #3

m n n A grammar for L = {0 1 | m≥n}

n CFG? G: S => 0S1 | A A => 0A | e

How would you interpret the string “00000111” using this grammar? Sà0S1à 0 0S1 1à 0 00S11 1à 0 000A11 1à 0 0000A11 1à 000001111

10 Example #4

A program containing if-then(-else) statements if Condition then Statement else Statement (Or) if Condition then Statement CFG?

11 More examples

n n L1 = {0 | n≥0 } n n L2 = {0 | n≥1 } i j k n L3={0 1 2 | i=j or j=k, where i,j,k≥0} i j k n L4={0 1 2 | i=j or i=k, where i,j,k≥1}

12 Applications of CFLs & CFGs n Compilers use parsers for syntactic checking n Parsers can be expressed as CFGs

1. Balancing paranthesis:

n B ==> BB | (B) | Statement

n Statement ==> …

2. If-then-else:

n S ==> SS | if Condition then Statement else Statement | if Condition then Statement | Statement

n Condition ==> …

n Statement ==> … 3. C paranthesis matching { … }

4. Pascal begin-end matching

5. YACC (Yet Another Compiler-Compiler)

13 More applications n Markup languages

n Nested Tag Matching

n HTML

n …

n XML

n ..

14 Tag-Markup Languages

Roll ==> Class Students Class ==> Text Text ==> Char Text | Char Char ==> a | b | … | z | A | B | .. | Z Students ==> Student Students | e Student ==> Text

Here, the left hand side of each production denotes one non-terminals (e.g., “Roll”, “Class”, etc.) Those symbols on the right hand side for which no productions (i.e., substitutions) are defined are terminals (e.g., ‘a’, ‘b’, ‘|’, ‘<‘, ‘>’, “ROLL”, etc.) 15 Structure of a production

head derivation body

A ======> a1 | a2 | … | ak

The above is same as:

1. A ==> a1 2. A ==> a2 3. A ==> a3 …

K. A ==> ak

16 CFG conventions n Terminal symbols <== a, b, c… n Non-terminal symbols <== A,B,C, … n Terminal or non-terminal symbols <== X,Y,Z n Terminal strings <== w, x, y, z n Arbitrary strings of terminals and non- terminals <== a, b, g, .. 17 Syntactic Expressions in Programming Languages

result = a*b + score + 10 * distance + c

terminals variables Operators are also terminals Regular languages have only terminals

n Reg expression = [a-z][a-z0-1]*

n If we allow only letters a & b, and 0 & 1 for constants (for simplification)

n Regular expression = (a+b)(a+b+0+1)*

18 String membership

How to say if a string belong to the language defined by a CFG? 1. Derivation

n Head to body Both are equivalent forms 2. Recursive inference

n Body to head G: Example: A => 0A0 | 1A1 | 0 | 1 | e

n w = 01110 A => 0A0 n Is w a palindrome? => 01A10 => 01110

19 Simple Expressions… n We can write a CFG for accepting simple expressions n G = (V,T,P,S)

n V = {E,F}

n T = {0,1,a,b,+,*,(,)}

n S = {E}

n P:

n E ==> E+E | E*E | (E) | F

n F ==> aF | bF | 0F | 1F | a | b | 0 | 1

20 Generalization of derivation n Derivation is head ==> body

n A==>X (A derives X in a single step) n A ==>*G X (A derives X in a multiple steps)

n Transitivity:

IF A ==>*GB, and B ==>*GC, THEN A ==>*G C

21 Context-Free Language

n The language of a CFG, G=(V,T,P,S), denoted by L(G), is the set of terminal strings that have a derivation from the start variable S.

n L(G) = { w in T* | S ==>*G w }

22 Left-most & Right-most G: Derivation Styles E => E+E | E*E | (E) | F F => aF | bF | 0F | 1F | e

* Derive the string a*(ab+10) from G: E = =>G a*(ab+10)

nE nE n==> E * E n==> E * E Left-most n==> F * E n==> E * (E) Right-most n==> aF * E n==> E * (E + E) derivation: derivation: n==> a * E n==> E * (E + F) n==> a * (E) n==> E * (E + 1F) Always Always substitute n==> a * (E + E) n==> E * (E + 10F) n==> a * (F + E) n==> E * (E + 10) substitute leftmost n==> a * (aF + E) n==> E * (F + 10) rightmost variable n==> a * (abF + E) n==> E * (aF + 10) variable n==> a * (ab + E) n==> E * (abF + 0) n==> a * (ab + F) n==> E * (ab + 10) n==> a * (ab + 1F) n==> F * (ab + 10) n==> a * (ab + 10F) n==> aF * (ab + 10) n==> a * (ab + 10) n==> a * (ab + 10) 23 Leftmost vs. Rightmost derivations

Q1) For every leftmost derivation, there is a rightmost derivation, and vice versa. True or False?

True - will use parse trees to prove this

Q2) Does every word generated by a CFG have a leftmost and a rightmost derivation? Yes – easy to prove (reverse direction)

Q3) Could there be words which have more than one leftmost (or rightmost) derivation? Yes – depending on the grammar 24 How to prove that your CFGs are correct?

(using induction)

25 Gpal: CFG & CFL A => 0A0 | 1A1 | 0 | 1 | e n Theorem: A string w in (0+1)* is in

L(Gpal), if and only if, w is a palindrome.

n Proof:

n Use induction

n on string length for the IF part

n On length of derivation for the ONLY IF part

26 Parse trees

27 Parse Trees

n Each CFG can be represented using a parse tree:

n Each internal node is labeled by a variable in V

n Each leaf is terminal symbol

n For a production, A==>X1X2…Xk, then any internal node labeled A has k children which are labeled from X1,X2,…Xk from left to right

Parse tree for production and all other subsequent productions: A ==> X1..Xi..Xk A

X1 … Xi … Xk

28 Examples

E A E + E 0 A 0 F F 1 A 1 a 1 e Derivation Recursiveinference Parse tree for 0110 Parse tree for a + 1

G: G: E => E+E | E*E | (E) | F A => 0A0 | 1A1 | 0 | 1 | e F => aF | bF | 0F | 1F | 0 | 1 | a | b 29 Parse Trees, Derivations, and Recursive Inferences

Production: A ==> X ..X ..X A 1 i k

X1 … Xi … Xk Derivation Recursive inference

Left-most Parse tree derivation

Derivation Right-most derivation Recursive inference 30 Interchangeability of different CFG representations n Parse tree ==> left-most derivation

n DFS left to right n Parse tree ==> right-most derivation

n DFS right to left n ==> left-most derivation == right-most derivation n Derivation ==> Recursive inference

n Reverse the order of productions n Recursive inference ==> Parse trees

n bottom-up traversal of parse tree

31 Connection between CFLs and RLs

32 What kind of grammars result for regular languages? CFLs & Regular Languages n A CFG is said to be right-linear if all the productions are one of the following two forms: A ==> wB (or) A ==> w Where: • A & B are variables, • w is a string of terminals n Theorem 1: Every right-linear CFG generates a n Theorem 2: Every regular language has a right-linear grammar n Theorem 3: Left-linear CFGs also represent

RLs 33 Some Examples

0 1 0,1 0 1 ØA => 01B | C 1 0 B => 11B | 0C | 1A A B C 1 0 A B 1 C C => 1A | 0 | 1 0 Right linear CFG? Right linear CFG? Finite Automaton?

34 Ambiguity in CFGs and CFLs

35 Ambiguity in CFGs

n A CFG is said to be ambiguous if there exists a string which has more than one left-most derivation

Example: S ==> AS | e LM derivation #1: LM derivation #2: S => AS S => AS A ==> A1 | 0A1 | 01 => 0A1S => A1S =>0A11S => 0A11S => 00111S => 00111S Input string: 00111 => 00111 => 00111 Can be derived in two ways 36 Why does ambiguity matter?

Values are E ==> E + E | E * E | (E) | a | b | c | 0 | 1 different !!! string = a * b + c E • LM derivation #1: •E => E + E => E * E + E E + E (a*b)+c ==>* a * b + c E * E c

a b E • LM derivation #2 •E => E * E => a * E => E * E a*(b+c) a * E + E ==>* a * b + c a E + E

The calculated value depends on which b c of the two parse trees is actually used. 37 Removing Ambiguity in Expression Evaluations

n It MAY be possible to remove ambiguity for some CFLs

n E.g.,, in a CFG for expression evaluation by imposing rules & restrictions such as precedence

n This would imply rewrite of the grammar Modified unambiguous version: n Precedence: (), * , + E => E + T | T T => T * F | F F => I | (E) I => a | b | c | 0 | 1

Ambiguous version: How will this avoid ambiguity?

E ==> E + E | E * E | (E) | a | b | c | 0 | 1 38 Inherently Ambiguous CFLs n However, for some languages, it may not be possible to remove ambiguity n A CFL is said to be inherently ambiguous if every CFG that describes it is ambiguous Example: n n m m n m m n n L = { a b c d | n,m≥ 1} U {a b c d | n,m≥ 1}

n L is inherently ambiguous

n Why? Input string: anbncndn

39 Summary n Context-free grammars n Context-free languages n Productions, derivations, recursive inference, parse trees n Left-most & right-most derivations n Ambiguous grammars n Removing ambiguity n CFL/CFG applications

n parsers, markup languages

40