Chomsky and Greibach Normal Forms
Total Page:16
File Type:pdf, Size:1020Kb
Chomsky and Greibach Normal Forms Teodor Rus [email protected] The University of Iowa, Department of Computer Science Computation Theory – p.1/27 Simplifying a CFG • It is often convenient to simplify a CFG; • One of the simplest and most useful simplified forms of CFG is called the Chomsky normal form; • Another normal form usually used in algebraic specifications is Greibach normal form. Note the difference between grammar cleaning and grammar simplification! Computation Theory – p.2/27 Fact Normal forms are useful when more advanced topics in computation theory are approached, as we shall see further. Computation Theory – p.3/27 Definition A context-free grammar G is in Chomsky normal form if every rule is of the form: A −→ BC A −→ a where a is a terminal, A,B,C are nonterminals, and B,C may not be the start variable (the axiom). Computation Theory – p.4/27 Observation The rule S −→ ǫ, where S is the start variable, is not excluded from a CFG in Chomsky normal form. Computation Theory – p.5/27 Theorem 2.9 Any context-free language is generated by a context-free grammar in Chomsky normal form. Proof idea: • Show that any CFG G can be converted into a CFG G′ in Chomsky normal form; • Conversion procedure has several stages where the rules that violate Chomsky normal form conditions are replaced with equivalent rules that satisfy these conditions; • Order of transformations:(1) add a new start variable, (2) eliminate all ǫ-rules, (3) eliminate unit-rules, (4) convert other rules; • Check that the obtained CFG G′ define the same language as the initial CFG G. Computation Theory – p.6/27 Proof Let G = (N, Σ,R,S) be the original CFG. Step 1: add a new start symbol S0 to N, and the rule S0 −→ S to R Note: this change guarantees that the start symbol of G′ does not occur on the rhs of any rule. Computation Theory – p.7/27 Step 2: eliminate ǫ-rules Repeat 1. Eliminate the ǫ rule A −→ ǫ from R where A is not the start symbol; 2. For each occurrence of A on the rhs of a rule, add a new rule to R with that occurrence of A deleted; Examples: (1) replace B −→ uAv by B −→ uAv|uv; (2) replace B −→ uAvAw by B −→ uAvAw|uvAw|aAvw|uvw. 3. Replace the rule B −→ A, (if it is present) by B −→ A|ǫ unless the rule B −→ ǫ has been previously eliminated; until all ǫ rules are eliminated. Computation Theory – p.8/27 Step 3: remove unit rules Repeat: 1. Remove a unit rule A −→ B ∈ R; 2. For each rule B −→ u ∈ R, add the rule A −→ u to R, unless B → u was a unit rule previously removed; until all unit rules are eliminated. Note: u is a string of variables and terminals. Computation Theory – p.9/27 Convert all remaining rules Repeat: 1. Replace a rule A −→ u1u2 ...uk, k ≥ 3, where each ui, 1 ≤ i ≤ k, is a variable or a terminal, by: A −→ u1A1, A1 −→ u2A2, ..., Ak−2 −→ uk−1uk where A1,A2, ..., Ak−2 are new variables; 2. If k ≥ 2 replace any terminal ui with a new variable Ui and add the rule Ui −→ ui; until no rules of the form A −→ u1u2 ...uk with k ≥ 3 remain. Computation Theory – p.10/27 Example CFG conversion Consider the grammar G6 whose rules are: S −→ ASA|aB A −→ B|S B −→ b|ǫ Notation: symbols removed are green and those added are red. After first step of transformation we get: S0 −→ S S −→ ASA|aB A −→ B|S B −→ b|ǫ Computation Theory – p.11/27 Removing ǫ rules Removing B → ǫ: S0 −→ S S −→ ASA|aB|a A −→ B|S|ǫ B −→ b|ǫ Removing A → ǫ: S0 −→ S S −→ ASA|aB|a|SA|AS|S A −→ B|S|ǫ B −→ b Computation Theory – p.12/27 Removing unit rule Removing S → S: S0 −→ S S −→ ASA|aB|a|SA|AS|S A −→ B|S B −→ b Removing S0 → S: S0 −→ S|ASA|aB|a|SA|AS S −→ ASA|aB|a|SA|AS A −→ B|S B −→ b Computation Theory – p.13/27 More unit rules Removing A → B: S0 −→ ASA|aB|a|SA|AS S −→ ASA|aB|a|SA|AS A −→ B|S|b B −→ b Removing A → S: S0 −→ ASA|aB|a|SA|AS S −→ ASA|aB|a|SA|AS A −→ S|b|ASA|aB|a|SA|AS B −→ b Computation Theory – p.14/27 Converting remaining rules S0 −→ AA1|UB|a|SA|AS S −→ AA1|UB|a|SA|AS A −→ b|AA1|UB|a|SA|AS A1 −→ SA U −→ a B −→ b Computation Theory – p.15/27 Observation • The conversion procedure produces several variables Ui along with several rules Ui → a; • Since all these represent the same rule, we may simplify the result using a single variable U and a single rule U → a. Computation Theory – p.16/27 Greibach Normal Form A context-free grammar G = (V, Σ,R,S) is in Greibach normal form if each rule r ∈ R has the property: lhs(r) ∈ V , rhs(r)= aα, a ∈ Σ and α ∈ V ∗. Note: Greibach normal form provides a justification of operator prefix-notation usually employed in algebra. Computation Theory – p.17/27 Greibach Theorem Every CFL L where ǫ 6∈ L can be generated by a CFG in Greibach normal form. Proof idea: Let G =(V, Σ,R,S) be a CFG generating L. Assume that G is in Chomsky normal form • Let V = {A1, A2,...,Am} be an ordering of nonterminals. • Construct the Greibach normal form from Chomsky normal form. Computation Theory – p.18/27 Construction 1. Modify the rules in R so that if Ai → Aj γ ∈ R then j > i 2. Starting with A1 and proceeding to Am this is done as follows: (a) Assume that productions have been modified so that for 1 ≤ i ≤ k, Ai → Aj γ ∈ R only if j > i; (b) If Ak → Ajγ is a production with j<k, generate a new set of productions substituting for the Aj the rhs of each Aj production; (c) Repeating (b) at most k − 1 times we obtain rules of the form Ak → Apγ, p ≥ k; (d) Replace rules Ak → Akγ by removing left-recursive rules. Computation Theory – p.19/27 Facts Let m be the highest order nonterminal generated during Greibach normal form construction algorithm. Note: if no left-recursive rules (Ai → Aiα, for some 1 ≤ i ≤ m) are in the grammar then we have: 1. The leftmost symbol on the rhs of each rule for Am must be a terminal, since Am is the highest order variable. 2. The leftmost symbol on the rhs of any rule for Am−1 must be either Am or a terminal symbol. If it is Am we can generate new rules replacing Am with the rhs of the Am rules. 3. Repeating (2) for Am−2,...,A2, A1 we obtain only rules whose rhs start with terminal symbols. Computation Theory – p.20/27 Left recursion A CFG containing rules of the form A → Aα|β is called left-recursive in A. The language generated by such rules is of the form A ⇒∗ βαn. If we replace the rules A → Aα|β by A → βB|β, B → αB|α where B is a new variable, then the language generated by A is the same while no left-recursive A-rules are used in the derivation. Computation Theory – p.21/27 Removing left-recursion Left-recursion can be eliminated by the following scheme: • If A → Aα1|Aα2 ... |Aαr are all A left recursive rules, and A → β1|β2| ... |βs are all remaining A-rules then chose a new nonterminal, say B; • Add the new B-rules B → αi|αiB, 1 ≤ i ≤ r; • Replace the A-rules by A → βi|βiB, 1 ≤ i ≤ s. This construction preserve the language L. Computation Theory – p.22/27 Conversion algorithm for (k=1; k<=m; k++) {for (j=1; j<=k-1; j++) {for (each A_k ---> A_j alpha) { for all rules A_j ---> beta add A_k ---> beta alpha remove A_k ---> A_j alpha } for (each rule A_k ---> A_k alpha) { add rules B_k ---> alpha | alpha B_k remove A_k ---> A_k alpha } for (each rule A_k ---> beta, beta does not begin with A_k) add rule A_k ---> beta B_k } } Computation Theory – p.23/27 More on Greibach NF See Introduction to Automata Theory, Languages, and Computation, J.E, Hopcroft and J.D Ullman, Addison- Wesley 1979, p. 94–96. Computation Theory – p.24/27 Example Convert the CFG G =({A1, A2, A3}, {a, b},R,A1) where R = {A1 → A2A3, A2 → A3A1|b, A3 → A1A2|a} into Greibach normal form. Computation Theory – p.25/27 Solution 1. Step 1: ordering the rules: (Only A3 rules violate ordering conditions, hence only A3 rules need to be changed). Following the procedure we replace A3 rules by: A3 → A3A1A3A2|bA3A2|a 2. Eliminating left-recursion we get: A3 → bA3A2B3|aB3|bA3A2|a, B3 → A1A3A2|A1A3A2B3; 3. All A3 rules start with a terminal. We use them to replace A1 → A2A3. This introduces the rules B3 → A1A3A2|A1A3A2B3; 4. Use A1 production to make them start with a terminal. Computation Theory – p.26/27 The result A3 → bA3A2B3|bA3A2|aB3|a A2 → bA3A2B3A1|bA3A2A1|aB3A1|aA1|b A1 → bA3A2B3A1A3|bA3A2A1A3|aB3A1A3|aA1A3|bA3 B3 → bA3A2B3A1A3A3A2B3|bA3A2B3A1A3A1A3A3A2 B3 → bA3A3A2B3|bA3A3A2|bA3A2A1A3A3A2B3|bA3A2A1A3A3A2 B3 → aB3A1A3A3A2B3|aB3A1A3A3A2 B3 → aA1A3A3A2B3|aA1A3A3A2 Disclaimer: I copied this result from Hopcroft and Ulman, page 98-99! Hence, I cannot guarantee its correctness.