CIT 425- AUTOMATA THEORY, COMPUTABILITY and FORMAL LANGUAGES LECTURE NOTE by DR. OYELAMI M. O. Introduction • This Course Cons

CIT 425- AUTOMATA THEORY, COMPUTABILITY AND FORMAL LANGUAGES LECTURE NOTE BY DR. OYELAMI M. O. Status: Core Description: Words and String. Concatenation, word Length; Language Definition. Regular Expression, Regular Language, Recursive Languages; Finite State Automata (FSA), State Diagrams; Pumping Lemma, Grammars, Applications in Computer Science and Engineering, Compiler Specification and Design, Text Editor and Implementation, Very Large Scale Integrated (VLSI) Circuit Specification and Design, Natural Language Processing (NLP) and Embedded Systems. Introduction This course constitutes the theoretical foundation of computer science. Loosely speaking we can think of automata, grammars, and computability as the study of what can be done by computers in principle, while complexity addresses what can be done in practice. This course has applications in the following areas: o Digital design, o Programming languages o Compilers construction Languages Dictionaries define the term informally as a system suitable for the expression of certain ideas, facts, or concepts, including a set of symbols and rules for their manipulation. While this gives us an intuitive idea of what a language is, it is not sufficient as a definition for the study of formal languages. We need a precise definition for the term. A formal language is an abstraction of the general characteristics of programming languages. Languages can be specified in various ways. One way is to list all the words in the language. Another is to give some criteria that a word must satisfy to be in the language. Another important way is to specify a language through the use of some terminologies: 1 Alphabet: A finite, nonempty set Σ of symbols. String: Finite sequence of symbols from the alphabet. For example, if the alphabet Σ = {a, b}, then abab and aaabbba are strings on Σ. The concatenation of two strings w and υ is the string obtained by appending the symbols of υ to the right end of w, that is, if w=a1a2…an and b=b1b2…bm then the concatenation of w and v , denoted by wv, is wv=a1a2…anb1b2…bm. The reverse of a string is obtained by writing the symbols in reverse order; if w is a string as R R shown above, then its reverse w is w =an…a2a1. The length of a string w, denoted by |w|, is the number of symbols in the string. Empty String: A string with no symbols at all denoted by λ. The following simple relations hold for all w. Substring: Any string of consecutive symbols in some string w, a substring of w, i.e w=vu Then the substrings υ and u are said to be a prefix and a suffix of w, respectively. For example, if w = abbab, then {λ, a, ab, abb, abba, abbab} is the set of all prefixes of w, while bab, ab, b are some of its suffixes. If u and υ are strings, then the length of their concatenation is the sum of the individual lengths, that is, |uv|=|u| + |v| If w is a string, then wn stands for the string obtained by repeating w n times. As a special case, we define w0=λ for all w. If Σ is an alphabet, then we use Σ* to denote the set of strings obtained by concatenating zero or more symbols from Σ. The set Σ* always contains λ. To exclude the empty string, we define While Σ is finite by assumption, Σ* and Σ+ are always infinite since there is no limit on the length of the strings in these sets. A language is defined very generally as a subset of Σ*. A string in a language L will be called a word/sentence of L. This definition is quite broad; any set of strings on an alphabet Σ can be considered a language. 2 Example 3 GRAMMARS To study languages mathematically, we need a mechanism to describe them. Definition: A grammar G is defined as a quadruple G = (V, T, S, P), where V is a finite set of objects called variables/non-terminals, T is a finite set of objects called terminal symbols, S ∈ V is a special symbol called the start variable, P is a finite set of productions. Derivation The production rules are the heart of a grammar; they specify how the grammar transforms one string into another, and through this they define a language associated with the grammar. The production rules are of the form x → y where x is an element of (V ∪ T)+ and y is in (V ∪ T)*. The productions are applied in the following manner: Given a string w of the form w=uxv, we say the production x → y is applicable to this string, and we may use it to replace x with y, thereby obtaining a new string: z=uyv 4 Language Generated by a Grammar 5 Example Example Let G be the grammar with vocabulary V = {S, A, a, b}, set of terminals T = {a, b}, starting symbol S, and productions P = {S → aA, S → b, A → aa}. What is L(G), the language of this grammar? Solution From the start state S we can derive aA using the production S → aA. We can also use the production S → b to derive b. From aA the production A → aa can be used to derive aaa. No additional words can be derived. Hence, L(G) = {b, aaa}. 6 Example Example State the grammar that generates the set {0n1n | n = 0, 1, 2, . }. Solution The solution is the grammar G = (V, T, S, P), where V = {A, S}, T = {0, 1}, S is the starting symbol, and the productions are S → 0A1 S → A A → 0A1 A → λ. Chomsky Classification of Grammars Grammars can be classified according to the types of productions that are allowed. According to Noam Chomosky, there are four types of grammars: Type 0, Type 1, Type 2, and Type 3. The following table shows how they differ from each other: 7 8 9 10 Type 3 grammar is also called Regular Grammar. Regular Languages and Regular Grammars A language is regular if there exists a finite accepter for it. Therefore, every regular language can be described by some dfa or some nfa. Such a description can be very useful, for example, if we want to show the logic by which we decide if a given string is in a certain language. But in many instances, we need more concise ways of describing regular languages. Regular Expressions Finite languages can be described by means of regular expressions. Regular expression is a formula that describes a possible set of string of a language. • Regular expressions have the capability to express finite languages by defining a pattern for finite strings of symbols. • The grammar defined by regular expressions is known as regular grammar. • The language defined by regular grammar is known as regular language. • Like in Mathematics, brackets (and ) are used for grouping. There are a number of algebraic laws that are obeyed by regular expressions, which can be used to manipulate regular expressions. Operations The three operations employed by a regular expression on languages are: Union of two languages L and M is written as L U M = {s | s is in L or s is in M} Concatenation of two languages L and M is written as LM = {st | s is in L and t is in M} The Kleene Closure of a language L is written as L* = Zero or more occurrence of language L. Notations If r and s are regular expressions denoting the languages L(r) and L(s), then Union: (r)|(s) is a regular expression denoting L(r) U L(s) or L(r + s) Concatenation: (r)(s) is a regular expression denoting L(r)L(s) or L( r.s ) Kleene closure: (r)* is a regular expression denoting (L(r))* or L(r*) (r) is a regular expression denoting L(r) 11 Precedence and Associativity *, concatenation (.), and | (pipe sign) are left associative * has the highest precedence Concatenation (.) has the second highest precedence. | (pipe sign) has the lowest precedence of all. Representing valid Tokens of a Language in Regular Expression If x is a regular expression, then: x* means zero or more occurrence of x. i.e., it can generate { e, x, xx, xxx, xxxx, … } x+ means one or more occurrence of x. i.e., it can generate { x, xx, xxx, xxxx … } or x.x* x? means at most one occurrence of x i.e., it can generate either {x} or {e}. [a-z] is all lower-case alphabets of English language. [A-Z] is all upper-case alphabets of English language. [0-9] is all natural digits used in mathematics. Representing occurrence of symbols using regular expressions letter = [a – z] or [A – Z] digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 or [0-9] sign = [ + | - ] Representing Language Tokens using Regular Expressions Decimal = (sign)?(digit)+ Identifier = (letter)(letter | digit)* SUMMARY OF THE COMPONENTS OF REGULAR EXPRESSION 12 Regular Operators Example If language L is {nb, qb} and language M is {cf, cpq}, then L+M is {nb, qb, cf, cpq} LM is (nbcf, nbcpq, qbcf, qbcpq} L* is ( ϵ, nb, qb, nbnb, …, qbnbnbqb, …} Note: Σ is an alphabet and Σ* is the set of all strings from the alphabet. Examples of Regular Expression Union The symbol + means union or or. Example 1: 0+1 means either a zero or a one. Example 2: Consider the expression (0+1)01* The language described by this expression is the set of all binary strings •that start with either 0 or 1as indicated by (0+1), •for which the second symbol is 0 •that end with zero or more 1s as indicated by 1 The language described by this expression is {00, 001, 0011, 00111,..., 10, 101, 1011, 10111,...,} Concatenation The concatenation of two REs is obtained by writing the one after the other.

CIT 425- AUTOMATA THEORY, COMPUTABILITY and FORMAL LANGUAGES LECTURE NOTE by DR. OYELAMI M. O. Introduction • This Course Cons

Theory of Computer Science

Cs 61A/Cs 98-52

Regular Languages and Finite Automata for Part IA of the Computer Science Tripos

Practical Experiments with Regular Approximation of Context-Free Languages

Neural Edit Operations for Biological Sequences

Context-Free Grammars

Theory of Computation

A Representation-Based Approach to Connect Regular Grammar and Deep Learning

6.035 Lecture 2, Specifying Languages with Regular Expressions and Context-Free Grammars

Constraints for Membership in Formal Languages Under Systematic Search and Stochastic Local Search

Regular Languages and Finite Automata for Part IA of the Computer Science Tripos

Theory of Computation