Automata Theory and Formal Languages
Total Page:16
File Type:pdf, Size:1020Kb
Alberto Pettorossi Automata Theory and Formal Languages Third Edition ARACNE Contents Preface 7 Chapter 1. Formal Grammars and Languages 9 1.1. Free Monoids 9 1.2. Formal Grammars 10 1.3. The Chomsky Hierarchy 13 1.4. Chomsky Normal Form and Greibach Normal Form 19 1.5. Epsilon Productions 20 1.6. Derivations in Context-Free Grammars 24 1.7. Substitutions and Homomorphisms 27 Chapter 2. Finite Automata and Regular Grammars 29 2.1. Deterministic and Nondeterministic Finite Automata 29 2.2. Nondeterministic Finite Automata and S-extended Type 3 Grammars 33 2.3. Finite Automata and Transition Graphs 35 2.4. Left Linear and Right Linear Regular Grammars 39 2.5. Finite Automata and Regular Expressions 44 2.6. Arden Rule 56 2.7. Equations Between Regular Expressions 57 2.8. Minimization of Finite Automata 59 2.9. Pumping Lemma for Regular Languages 72 2.10. A Parser for Regular Languages 74 2.10.1. A Java Program for Parsing Regular Languages 82 2.11. Generalizations of Finite Automata 90 2.11.1. Moore Machines 91 2.11.2. Mealy Machines 91 2.11.3. Generalized Sequential Machines 92 2.12. Closure Properties of Regular Languages 94 2.13. Decidability Properties of Regular Languages 96 Chapter 3. Pushdown Automata and Context-Free Grammars 99 3.1. Pushdown Automata and Context-Free Languages 99 3.2. From PDA’s to Context-Free Grammars and Back: Some Examples 111 3.3. Deterministic PDA’s and Deterministic Context-Free Languages 117 3.4. Deterministic PDA’s and Grammars in Greibach Normal Form 121 3.5. Simplifications of Context-Free Grammars 123 3.5.1. Elimination of Nonterminal Symbols That Do Not Generate Words 123 3.5.2. Elimination of Symbols Unreachable from the Start Symbol 124 5 6 CONTENTS 3.5.3. Elimination of Epsilon Productions 125 3.5.4. Elimination of Unit Productions 126 3.5.5. Elimination of Left Recursion 129 3.6. Construction of the Chomsky Normal Form 131 3.7. Construction of the Greibach Normal Form 133 3.8. Theory of Language Equations 141 3.9. Summary on the Transformations of Context-Free Grammars 146 3.10. Self-Embedding Property of Context-Free Grammars 147 3.11. Pumping Lemma for Context-Free Languages 150 3.12. Ambiguity and Inherent Ambiguity 155 3.13. Closure Properties of Context-Free Languages 157 3.14. Basic Decidable Properties of Context-Free Languages 158 3.15. Parsers for Context-Free Languages 159 3.15.1. The Cocke-Younger-Kasami Parser 159 3.15.2. The Earley Parser 162 3.16. Parsing Classes of Deterministic Context-Free Languages 167 3.17. Closure Properties of Deterministic Context-Free Languages 169 3.18. Decidable Properties of Deterministic Context-Free Languages 170 Chapter 4. Linear Bounded Automata and Context-Sensitive Grammars 171 4.1. Recursiveness of Context-Sensitive Languages 179 Chapter 5. Turing Machines and Type 0 Grammars 183 5.1. Equivalence Between Turing Machines and Type 0 Languages 190 Chapter 6. Decidability and Undecidability in Context-FreeLanguages 195 6.1. Some Basic Decidability and Undecidabilty Results 199 6.1.1. Basic Undecidable Properties of Context-Free Languages 201 6.2. Decidability in Deterministic Context-Free Languages 204 6.3. Undecidability in Deterministic Context-Free Languages 205 6.4. Undecidable Properties of Linear Context-Free Languages 205 Chapter 7. Appendices 207 7.1. Iterated Counter Machines and Counter Machines 207 7.2. Stack Automata 215 7.3. Relationships Among Various Classes of Automata 217 7.4. Decidable Properties of Classes of Languages 221 7.5. Algebraic and Closure Properties of Classes of Languages 224 7.6. Abstract Families of Languages 225 7.7. From Finite Automata to Left Linear and Right Linear Grammars 230 7.8. Context-Free Grammars over Singleton Terminal Alphabets 232 7.9. The Bernstein Theorem 235 7.10. Existence of Functions That Are Not Computable 237 Index 247 Bibliography 255 Preface These lecture notes present some basic notions and results on Automata Theory, Formal Languages Theory, Computability Theory, and Parsing Theory. I prepared these notes for a course on Automata, Languages, and Translators which I am teaching at the University of Roma Tor Vergata. More material on these topics and on parsing techniques for context-free languages can be found in standard textbooks such as [1, 8, 9]. The reader is encouraged to look at those books. A theorem denoted by the triple k.m.n is in Chapter k and Section m, and within that section it is identified by the number n. Analogous numbering system is used for algorithms, corollaries, definitions, examples, exercises, figures, and remarks. We use ‘iff’ to mean ‘if and only if’. Many thanks to my colleagues of the Department of Informatics, Systems, and Production of the University of Roma Tor Vergata. I am also grateful to my stu- dents and co-workers and, in particular, to Lorenzo Clemente, Corrado Di Pietro, Fulvio Forni, Fabio Lecca, Maurizio Proietti, and Valerio Senni for their help and encouragement. Finally, I am grateful to Francesca Di Benedetto, Alessandro Colombo, Donato Corvaglia, Gioacchino Onorati, and Leonardo Rinaldi of the Aracne Publishing Com- pany for their kind cooperation. Roma, June 2008 In the second edition we have corrected a few mistakes and added Section 7.7 on the derivation of left linear and right linear regular grammars from finite automata and Section 7.8 on context-free grammars with singleton terminal alphabets. Roma, July 2009 In the third edition we have made a few minor improvements in various chapters. Roma, July 2011 Alberto Pettorossi Department of Informatics, Systems, and Production University of Roma Tor Vergata Via del Politecnico 1, I-00133 Roma, Italy [email protected] http://www.iasi.cnr.it/~adp 7 CHAPTER 1 Formal Grammars and Languages In this chapter we introduce some basic notions and some notations we will use in the book. The set of natural numbers 0, 1, 2,... is denoted by N. Given a set A, A denotes the{ cardinality} of A, and 2A denotes the powerset of A, that is, the set of| all| subsets of A. Instead of 2A, we will also write Powerset(A). We say that a set S is countable iff either S is finite or there exists a bijection between S and the set N of natural numbers. 1.1. Free Monoids Let us consider a countable set V , also called an alphabet. The elements of V are called symbols. The free monoid generated by the set V is the set, denoted V ∗, consisting of all finite sequences of symbols in V , that is, V ∗ = v ...v n 0 and for i =0,...,n, v V . { 1 n | ≥ i ∈ } The unary operation ∗ (pronounced ‘star’) is called Kleene star (or Kleene closure, or ∗ closure). Sequences of symbols are also called words or strings. The length of a sequence v1 ...vn is n. The sequence of length 0 is called the empty sequence or empty word and it is denoted by ε. The length of a sequence w is also denoted by w . | | ∗ Given two sequences w1 and w2 in V , their concatenation, denoted w1 w2 or ∗ simply w1w2, is the sequence in V defined by recursion on the length of w1 as follows: w1 w2 = w2 if w1 = ε = v1((v2 ...vn) w2) if w1 = v1v2 ...vn with n>0. We have that w1 w2 = w1 + w2 . The concatenation operation is associative and its neutral| element| is| the| empty| | sequence ε. Any set of sequences which is a subset of V ∗ is called a language (or a formal language) over the alphabet V . Given two languages A and B, their concatenation, denoted A B, is defined as follows: A B = w w w A and w B . { 1 2 | 1 ∈ 2 ∈ } Concatenation of languages is associative and its neutral element is the singleton ε . When B is a singleton, say w , the concatenation A B will also be written as A{ }w or simply Aw. Obviously,{ if A} = or A = then A B = . ∅ ∅ ∅ We have that: V ∗ = V 0 V 1 V 2 . V k . ., where for each k 0, V k is the set of all sequences of length∪ ∪k of symbols∪ ∪ of V∪, that is, ≥ 9 10 1. FORMAL GRAMMARS AND LANGUAGES V k = v ...v for i =0,...,k, v V . { 1 k | i ∈ } Obviously, V 0 = ε , V 1 = V , and for h, k 0, V h V k = V h+k = V k+h. By V + we denote V ∗ {ε }. The unary operation +≥(pronounced ‘plus’) is called positive closure or + closure−{ }. The set V 0 V 1 is also denoted by V 0,1. ∪ Given an element a in a set V , a∗ denotes the set of all finite sequence of zero or more a’s (thus, a∗ is an abbreviation for a ∗), a+ denotes the set of all finite sequence of one or more a’s (thus, a+ is an{ abbreviation} for a +), a 0,1 denotes the set ε, a (thus, a 0,1 is an abbreviation for a 0,1), and aω {denotes} the infinite sequence{ made} out of all a’s. { } Given a word w, for any k 0, the prefix of w of length k, denoted w k, is defined as follows: ≥ w = if w k then w else u, where w = u v and u =k. k | |≤ | | In particular, for any w, we have that: w 0 = ε and w |w| = w. Given a language L V ∗, we introduce the following notation: ⊆ (i) L0 = ε { } (ii) L1 = L (iii) Ln+1 = L Ln ∗ k (iv) L = k≥0 L + S k (v) L = k>0 L (vi) L 0,1 =SL0 L1 ∪ We also have that Ln+1 = Ln L and L+ = L∗ ε .