An Introduction to Formal Language Theory That Integrates Experimentation and Proof
Total Page:16
File Type:pdf, Size:1020Kb
An Introduction to Formal Language Theory that Integrates Experimentation and Proof Allen Stoughton Kansas State University Draft of Fall 2004 Copyright °c 2003{2004 Allen Stoughton Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sec- tions, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled \GNU Free Documentation License". The LATEX source of this book and associated lecture slides, and the distribution of the Forlan toolset are available on the WWW at http: //www.cis.ksu.edu/~allen/forlan/. Contents Preface v 1 Mathematical Background 1 1.1 Basic Set Theory . 1 1.2 Induction Principles for the Natural Numbers . 11 1.3 Trees and Inductive De¯nitions . 16 2 Formal Languages 21 2.1 Symbols, Strings, Alphabets and (Formal) Languages . 21 2.2 String Induction Principles . 26 2.3 Introduction to Forlan . 34 3 Regular Languages 44 3.1 Regular Expressions and Languages . 44 3.2 Equivalence and Simpli¯cation of Regular Expressions . 54 3.3 Finite Automata and Labeled Paths . 78 3.4 Isomorphism of Finite Automata . 86 3.5 Algorithms for Checking Acceptance and Finding Accepting Paths . 94 3.6 Simpli¯cation of Finite Automata . 99 3.7 Proving the Correctness of Finite Automata . 103 3.8 Empty-string Finite Automata . 114 3.9 Nondeterministic Finite Automata . 120 3.10 Deterministic Finite Automata . 129 3.11 Closure Properties of Regular Languages . 145 3.12 Equivalence-testing and Minimization of Deterministic Finite Automata . 174 3.13 The Pumping Lemma for Regular Languages . 193 3.14 Applications of Finite Automata and Regular Expressions . 199 ii CONTENTS iii 4 Context-free Languages 204 4.1 (Context-free) Grammars, Parse Trees and Context-free Lan- guages . 204 4.2 Isomorphism of Grammars . 213 4.3 A Parsing Algorithm . 215 4.4 Simpli¯cation of Grammars . 219 4.5 Proving the Correctness of Grammars . 221 4.6 Ambiguity of Grammars . 225 4.7 Closure Properties of Context-free Languages . 227 4.8 Converting Regular Expressions and Finite Automata to Grammars . 230 4.9 Chomsky Normal Form . 233 4.10 The Pumping Lemma for Context-free Languages . 236 5 Recursive and R.E. Languages 242 5.1 A Universal Programming Language, and Recursive and Re- cursively Enumerable Languages . 243 5.2 Closure Properties of Recursive and Recursively Enumerable Languages . 246 5.3 Diagonalization and Undecidable Problems . 249 A GNU Free Documentation License 253 Bibliography 261 Index 263 List of Figures 1.1 Example Diagonalization Table for Cardinality Proof . 9 3.1 Regular Expression to FA Conversion Example . 151 3.2 DFA Accepting AllLongStutter . 194 4.1 Visualization of Proof of Pumping Lemma for Context-free Languages . 239 5.1 Example Diagonalization Table for R.E. Languages . 249 iv Preface Background Since the 1930s, the subject of formal language theory, also known as au- tomata theory, has been developed by computer scientists, linguists and mathematicians. (Formal) Languages are set of strings over ¯nite sets of symbols, called alphabets, and various ways of describing such languages have been developed and studied, including regular expressions (which \gen- erate" languages), ¯nite automata (which \accept" languages), grammars (which \generate" languages) and Turing machines (which \accept" lan- guages). For example, the set of identi¯ers of a given programming language is a formal language|one that can be described by a regular expression or a ¯nite automaton. And, the set of all strings of tokens that are generated by a programming language's grammar is another example of a formal language. Because of its many applications to computer science, e.g., to compiler construction, most computer science programs o®er both undergraduate and graduate courses in this subject. Many of the results of formal language theory are proved constructively, using algorithms that are useful in practice. In typical courses on formal language theory, students apply these algorithms to toy examples by hand, and learn how they are used in applications. But they are not able to experiment with them on a larger scale. Although much can be achieved by a paper-and-pencil approach to the subject, students would obtain a deeper understanding of the subject if they could experiment with the algorithms of formal language theory us- ing computer tools. Consider, e.g., a typical exercise of a formal language theory class in which students are asked to synthesize an automaton that accepts some language, L. With the paper-and-pencil approach, the stu- dent is obliged to build the machine by hand, and then (perhaps) prove that it is correct. But, given the right computer tools, another approach would be possible. First, the student could try to express L in terms of simpler languages, making use of various language operations (union, inter- v vi section, di®erence, concatenation, closure). He or she could then synthesize automata accepting the simpler languages, enter these machines into the system, and then combine these machines using operations corresponding to the language operations used to express L. With some such exercises, a student could solve the exercise in both ways, and could compare the results. Other exercises of this type could only be solved with machine support. Integrating Experimentation and Proof Over the past several years, I have been designing and developing a com- puter toolset, called Forlan, for experimenting with formal languages. For- lan is implemented in the functional programming language Standard ML [MTHM97, Pau96], a language whose notation and concepts are similar to those of mathematics. Forlan is used interactively. In fact, a Forlan session is simply a Standard ML session in which the Forlan modules are pre-loaded. Users are able to extend Forlan by de¯ning ML functions. In Forlan, the usual objects of formal language theory|automata, reg- ular expressions, grammars, labeled paths, parse trees, etc.|are de¯ned as abstract types, and have concrete syntax. The standard algorithms of formal language theory are implemented in Forlan, including conversions between di®erent kinds of automata and grammars, the usual operations on automata and grammars, equivalence testing and minimization of deter- ministic ¯nite automata, etc. Support for the variant of the programming language Lisp that we use (instead of Turing machines) as a universal pro- gramming language is planned. While developing Forlan, I have also been writing lectures notes on for- mal language theory that are based around Forlan, and this book is the outgrowth of those notes. I am attempting to keep the conceptual and no- tational distance between the textbook and toolset as small as possible. The book treats each concept or algorithm both theoretically, especially using proof, and through experimentation, using Forlan. Special proofs that are carried out assuming the correctness of Forlan's implementation are labeled \[Forlan]", and theorems that are only proved in this way are also so-labeled. Readers of this book are assumed to have a signi¯cant amount of expe- rience reading and writing informal mathematical proofs, of the kind one ¯nds in mathematics books. This experience could have been gained, e.g., in courses on discrete mathematics, logic or set theory. The core sections of the book assume no previous knowledge of Standard ML. Eventually, ad- vanced sections covering the implementation of Forlan will be written, and vii these sections will assume the kind of familiarity with Standard ML that could be obtained by reading [Pau96] or [Ull98]. Outline of the Book The book consists of ¯ve chapters. Chapter 1, Mathematical Background, consists of the material on set theory, induction principles for the natural numbers, and trees and inductive de¯nitions that is required in the remain- ing chapters. In Chapter 2, Formal Languages, we say what symbols, strings, alpha- bets and (formal) languages are, introduce and show how to use several string induction principles, and give an introduction to the Forlan toolset. The remaining three chapters introduce and study more restricted sets of languages. In Chapter 3, Regular Languages, we study regular expressions and lan- guages, four kinds of ¯nite automata, algorithms for processing and convert- ing between regular expressions and ¯nite automata, properties of regular languages, and applications of regular expressions and ¯nite automata to searching in text ¯les and lexical analysis. In Chapter 4, Context-free Languages, we study context-free grammars and languages, algorithms for processing grammars and for converting regu- lar expressions and ¯nite automata to grammars, and properties of context- free languages. It turns out that the set of all context-free languages is a proper superset of the set of all regular languages. Finally, in Chapter 5, Recursive and Recursively Enumerable Languages, we study a universal programming language based on Lisp, which we use to de¯ne the recursive and recursively enumerable languages. We study algo- rithms for processing programs and for converting grammars to programs, and properties of recursive and recursively enumerable languages. It turns out that the context-free languages are a proper subset of the recursive lan- guages, that the recursive languages are a proper subset of the recursively enumerable languages, and that there are languages that are not recursively enumerable. Furthermore, there are problems, like the halting problem (the problem of determining whether a program P halts when run on an input w), or the problem of determining if two grammars generate the same language, that can't be solved by programs. viii Further Reading and Related Work This book covers the core material that is typically presented in an under- graduate course on formal language theory. On the other hand, a typical textbook on formal language theory covers much more of the subject than we do.