<<

4.3 Writing a Grammar :

Regular Expressions vs. CFG's ¢ Can convert an an NFA into a CFG for the same language:

Can rewrite a reg. exp. as a CFG: e.g., (a | b)*abb : £ For each state i of the NFA, create a nonterminal symbol A . i £ A ¡ aA | bA | aA If state i goes to state j on symbol a, introduce the 0 0 0 1 production A ¡ aA. A ¡ bA i j 1 2 £ ∈,

¡ If state i goes to state j on symbol introduce the A bA 2 3 production A ¡ A . i j A ¡ ∈ ¡ 3 £ If i is an accepting state, introduce A ∈. i

£ If i is the start state, make A be the start symbol. i

Verifying the Language Generated So why not just use CFG's instead of regular by a Grammar expressions (“one big parser”) ? 1. Don't need full power of CFG's For a grammar G, want to prove that G generates all 2. Reg. exp.'s are more concise and easier to and only the strings of some language L. understand. In general, it's easier to prove “only” than “all”. 3. Can build more efficient lexers from reg. exp.'s Sticking to a leftmost (rightmost) derivation makes it 4. Get more modular design by separating lexer easier to use induction on the length of strings of L.

from parser (easier to design and maintain) E.g., prove that S ¤ ( S ) S | ∈ generates all and only the members of the language of balanced parens. Verifying the Language Generated Verifying the Language Generated by a Grammar by a Grammar

Claim: S ¡ ( S ) S | ∈ generates only the members of the ¢ language of balanced parens. ¡ ∈ Claim: S ( S ) S | generates the entire language of Proof: By induction on # of leftmost derivation steps

⇒ ∈ balanced parens.

Basis: S (one step), which is balanced vacuously ¢ ¡ Proof: By induction on the length of strings derived from S. Inductive step: Assume that all derivations of fewer than n 1 ¢ ⇒ ∈ steps produce balanced parens. Consider a leftmost derivation of Basis: S (length zero), which is balanced vacuously n steps, which must have the form S ⇒ ( S ) S ⇒+ ( x ) S ⇒+ ( x ) y. By the inductive hypothesis, x and y must be balanced (because they were derived in fewer than n steps). Therefore, the string produced by the derivation of n steps is also balanced.

Verifying the Language Generated Elimination of by a Grammar ¢ ¢ Inductive step: Assume that every balanced string of less Recall our simple left- from Chapter 2: expr ¡ expr + term | term

than length 2n is derivable from S. (All balanced strings have ¢

¢ Such a grammar causes an infinite regress for recursive- length 2n for some n 0 because of left and right parens).

¢ descent parsers. Consider a balanced string w of length 2n, n 1. This w ¢ So we fixed it using a transformation replacing rules of the

must begin with a left paren. Let (x) be the shortest prefix of ¡ ¡ ¡ form A Aα | β with A βA' , A' αA' | ∈. w having an equal number of left and right parens. Then w ¢ ¡ α can be written as (x) y, where (by the inductive hypothesis) But left recursion may not always be immediate (A A ). both x and y are balanced. Such a string can be derived as E.g., S ¡ Aa | b, A ¡ Ac | Sd | ∈ S ⇒ ( S ) S ⇒+ ( x ) S ⇒+ ( x ) y , proving that w is also ¢ I.e, there may be a “chain” of productions containing left derivable from S. recursion. Elimination of Left Recursion Left Factoring

¢ Following works in the general case: ¢ Recall that predictive parsing requires disjoint FIRST sets. ¢ E.g., grammar

1. Arrange the nonterminals in some order A , A , ..., A . ¡ 1 2 n stmt if expr then stmt else stmt | if expr then stmt 2. for i := 1 to n do begin has if in FIRST of both RHS's. γ ¡

replace each production of the form A A by ¢ i j In general, can replace A αβ | αβ with the productions A δ γ | δ γ | ... | δ γ , where 1 2 i 1 2 k ¡ α ¡ β | β A δ | δ | ... | δ are all the current A -productions. A A' , A' 1 2 j 1 2 k j eliminate the immediate left recursion among the A -productions where α is longest common prefix of RHS's. i end

Non-Context-Free Language Non-Context-Free Language Constructs Constructs ¢ Not all constraints in a programming language can be Example #2: Number of arguments in method call expressed by context-free rules.

¢ must match number in declaration Example #1: Java requirement that an identifier be declared before it is used: void foo(int s, int t) { ... } int foo = 5; ...; y = 3 * foo; void bar(int x, int y, int z) { ... } ¢ Grammatically, this requirement is equivalent to the language ... wcw, where w is in (a | b)*. foo(1, 2); ¢ This language is not context-free – why? bar(5, 4, 3); ¢ Such constraints are typically handled by other mechanisms (e.g., symbol table). Equivalent to non-CF language anbmcndm