CS243, Logic and Computation Beyond Regular Languages 1

CS243, Prof. Alvarez 1 REGULAR LANGUAGES Prof. Sergio A. Alvarez http://www.cs.bc.edu/∼alvarez/ Maloney Hall, room 569 [email protected] Computer Science Department voice: (617) 552-4333 Boston College fax: (617) 552-6790 Chestnut Hill, MA 02467 USA CS243, Logic and Computation Beyond regular languages Many computational problems may be cast as language recognition problems. For example, determining divisibility of an integer, n, by a given modulus, say 7, is equivalent to determining membership of the decimal representation of n in a particular set of strings. The resulting language recognition problem can be solved by a suitable machine. The machine models that we have discussed so far (DFA, NFA) can solve some, but not all such problems. We will explore the limits of FA, then push beyond them. 1 Regular languages Definition 1.1. A language L over an alphabet Σ is regular if L = L(R) for some regular expression R over Σ. Since regular expressions are linguistically equivalent to DFA and to NFA, a language is regular if, and only if, it is recognized by some DFA. 1.1 Examples of computational problems that correspond to regular languages 1.1.1 Syntax of numerical constants and identifiers Integers. Consider the problem of determining if a string represents a valid decimal integer. This problem occurs during lexical analysis of computer programs. The integer corresponds to a string over the alphabet f0 − 9g [ f+; −}. The sign symbol, if any, should occur at the beginning (left) of the string, and the first numerical digit of the string should not be 0 unless there are no other digits. For example, 35, 0, −759, and −0 are valid integers, but 03, −078, −, and 57 − 0 are not. Any valid decimal integer string should match the following regular expression, where dashes provide a shorthand notation to indicate ranges of values (e.g., 1 − 9 represents the union of the individual nonzero digit symbols): I = ( [ + [ −)(0 [ (1 − 9)(0 − 9)∗) Starting from this expression, it is easy to construct an NFA that recognizes the corre- sponding regular language L(I) of all valid decimal integers, and to convert that NFA to a DFA. CS243, Prof. Alvarez 1 REGULAR LANGUAGES Floating-point numbers. Floating-point numbers can be dealt with similarly. For example, consider floating-point numbers in the format below, where parentheses indicate optional elements (the parentheses are not valid symbols in the actual number strings), and nesting indicates dependence: the fraction can only occur if the decimal point occurs, and each of the E and the exponent can only occur if the other occurs. whole (.(fraction)) (E exponent) The whole and exponent must be valid decimal integers as defined via the regular expression I above. The fraction must be a valid decimal integer with no sign. For example, the strings −12:5, 7E2, −9:73E−25, 7:, −0:39, 5E0, and 0E4 are considered valid numbers, but the strings −01, 75E03, 3E, :25, and 50:−79 are not. A regular expression that captures the syntax of this type of floating-point number can be constructed in terms of the regular expression for decimal integers described above (exercise). Identifiers. Python provides the following naming rules: identifier ::= (letter|"_") (letter | digit | "_")* letter ::= lowercase | uppercase lowercase ::= "a"..."z" uppercase ::= "A"..."Z" digit ::= "0"..."9" Each line represents a regular expression. The symbol j is used instead of [, and the dots ::: are shorthand for a range of values. The top line defines the syntax of an identifier in terms of regular expressions for certain elements that are defined in the remaining lines. 1.1.2 Divisibility by 3 Consider the problem of determining if a decimal integer is a multiple of 3. This problem can be solved efficiently by relying on basic rules of modular arithmetic. Since a string of decimal digits d1; ··· dn represents the numerical value n X n−i v(d) = di10 ; i=1 and since 10 ≡ 1 mod 3; we see that n n X n−i X v(d) ≡ di(10 mod 3) ≡ di mod 3 i=1 i=1 Therefore, determining divisibility of d by 3 only requires computation of the mod 3 sum of the digits of d. In turn, this can be accomplished using a DFA with only three states, one for each possible remainder mod 3 (exercise). CS243, Prof. Alvarez 1 REGULAR LANGUAGES 1.2 Pumping property of regular languages Finiteness of the state space of a DFA leads to an interesting property of the language recognized by such a machine. This property provides clues about what languages are regular. Lemma 1.1. if M is a DFA that has a state space with N states, then the computation of M on any input string of length N or greater contains a closed path in the state space. Proof. Suppose that M and N are as in the statement, and that w is a string of length m ≥ N over the input alphabet of M. Let q0; q1; ··· qm be the state sequence of M on input w. Thus, qt is the state immediately after reading the t-th symbol wt of w. Since M has only N different states, but the state sequence has at least N + 1 elements, there must exist two different \times" s < t such that qs = qt. The portion of the computation qs; ··· qt is a closed path in the state space. Theorem 1.2. If L is a regular language, then there is a finite integer p such that any string w 2 L(M) of length p or greater may be split as a concatenation w = xyz, such that: 1. y 6= 2. xy has length p or less 3. 8n 2 Z+ [ f0g xynz 2 L(M) Furthermore, p can be taken to be the number of states in a DFA that recognizes L. Proof. Suppose that M is a DFA that recognizes L. Let p be the number of elements in the state space of M, and let w be any string of length m ≥ p that is accepted by M. By Lemma 1.1, the computation of M on input w, say q0; q1; ··· ; qm, contains a closed path in the state space. In fact, we know that such a loop occurs on input w1; ··· wp as well. For concreteness, say that qs = qt for the two distinct \times" s < t ≤ p. Split w as follows. Let x = w1; ··· ws, y = ws+1; ··· wt, and z = wt+1; ··· wm. Clearly, w = xyz, since each wi appears in precisely one of x; y; z. Also, y 6= , because s < t, and xy has length t ≤ p as discussed above. Finally, notice that y drives M from state qs back to n itself, since qs = qt. Let n be any non-negative integer, and consider the string xy z. Notice n that y drives M from state qs back to itself, since y does (this is true even if n = 0, since that simply skips the state loop altogether). It follows that xynz is accepted by M, because the final portion of the computation of M on input xynz is identical to that of M on input w, since it starts at the same state qt and is driven by z in both cases. This completes the proof. Theorem 1.2 is known as the Pumping Lemma for regular languages. Any integer p with the property stated in the Pumping Lemma for a given language L is known as a \pumping length" for L. The Pumping Lemma asserts that every regular language has a finite pumping length. Notice the following interesting consequence of the Pumping Lemma. CS243, Prof. Alvarez 2 NON-REGULAR LANGUAGES Corollary 1.3. Suppose that L is a regular language and that p is a pumping length of L. If L contains any strings of length p or greater, then L contains arbitrarily long strings and in particular is an infinite set. Proof. Since the middle portion y in the split w = xyz of any string of length p or greater is non-empty, the pumped string xynz will be as long as desired by making n large enough (the length of xynz is at least n). Example 1.1. The language L = fw 2 f0; 1g∗ j w contains an odd number of 1 symbolsg is recognized by a 2-state DFA, and therefore, by Theorem 1.2, has 2 as a pumping length. To confirm directly that 2 is a pumping length of L, consider any string w 2 L that has length 2 or greater. I claim that there is some nonempty substring y of consecutive symbols of w that has length 2 or less, occurs no later than the second position of w, and contains an even number of 1s: if w contains any 0s among the first two positions, just isolate a single one of those 0s and call it y; if the length two prefix of w contains only 1s, pick that prefix as y (this can be done because the length of w is two or greater). With this choice of y, split w as w = xyz. Then the pumped string xynz contains an odd number of 1s (because this is true for n = 1, and the number of 1s differs by an even number between different powers, n). This shows that 2 is a pumping length of L. Example 1.2. The language L = fw 2 f0; 1; 2g∗ j w represents a multiple of 3 in base 3 positional notationg is regular (exercise), and therefore, by Theorem 1.2, has some finite pumping length, p.

CS243, Logic and Computation Beyond Regular Languages 1

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support