<<

CS243, Prof. Alvarez 1 REGULAR LANGUAGES

Prof. Sergio A. Alvarez http://www.cs.bc.edu/∼alvarez/ Maloney Hall, room 569 [email protected] Department voice: (617) 552-4333 Boston College fax: (617) 552-6790 Chestnut Hill, MA 02467 USA CS243, Logic and Computation Beyond regular languages

Many computational problems may be cast as language recognition problems. For ex- ample, determining divisibility of an integer, n, by a given modulus, say 7, is equivalent to determining membership of the representation of n in a particular set of strings. The resulting language recognition problem can be solved by a suitable machine. The machine models that we have discussed so far (DFA, NFA) can solve some, but not all such problems. We will explore the limits of FA, then push beyond them. 1 Regular languages

Definition 1.1. A language L over an alphabet Σ is regular if L = L(R) for some regular expression R over Σ. Since regular expressions are linguistically equivalent to DFA and to NFA, a language is regular if, and only if, it is recognized by some DFA.

1.1 Examples of computational problems that correspond to reg- ular languages 1.1.1 Syntax of numerical constants and identifiers Integers. Consider the problem of determining if a string represents a valid decimal integer. This problem occurs during lexical analysis of computer programs. The integer corresponds to a string over the alphabet {0 − 9} ∪ {+, −}. The sign symbol, if any, should occur at the beginning (left) of the string, and the first of the string should not be 0 unless there are no other digits. For example, 35, 0, −759, and −0 are valid integers, but 03, −078, −, and 57 − 0 are not. Any valid decimal integer string should match the following regular expression, where dashes provide a shorthand notation to indicate ranges of values (e.g., 1 − 9 represents the union of the individual nonzero digit symbols):

I = ( ∪ + ∪ −)(0 ∪ (1 − 9)(0 − 9)∗)

Starting from this expression, it is easy to construct an NFA that recognizes the corre- sponding regular language L(I) of all valid decimal integers, and to convert that NFA to a DFA. CS243, Prof. Alvarez 1 REGULAR LANGUAGES

Floating-point . Floating-point numbers can be dealt with similarly. For exam- ple, consider floating-point numbers in the format below, where parentheses indicate optional elements (the parentheses are not valid symbols in the actual strings), and nesting indicates dependence: the can only occur if the decimal point occurs, and each of the E and the exponent can only occur if the other occurs. whole (.(fraction)) (E exponent) The whole and exponent must be valid decimal integers as defined via the regular expression I above. The fraction must be a valid decimal integer with no sign. For example, the strings −12.5, 7E2, −9.73E−25, 7., −0.39, 5E0, and 0E4 are considered valid numbers, but the strings −01, 75E03, 3E, .25, and 50.−79 are not. A regular expression that captures the syntax of this type of floating-point number can be constructed in terms of the regular expression for decimal integers described above (exercise).

Identifiers. Python provides the following naming rules: identifier ::= (letter|"_") (letter | digit | "_")* letter ::= lowercase | uppercase lowercase ::= "a"..."z" uppercase ::= "A"..."Z" digit ::= "0"..."9"

Each line represents a regular expression. The symbol | is used instead of ∪, and the dots ... are shorthand for a range of values. The top line defines the syntax of an identifier in terms of regular expressions for certain elements that are defined in the remaining lines.

1.1.2 Divisibility by 3 Consider the problem of determining if a decimal integer is a multiple of 3. This problem can be solved efficiently by relying on basic rules of modular . Since a string of decimal digits d1, ··· dn represents the numerical value

n X n−i v(d) = di10 , i=1 and since 10 ≡ 1 mod 3, we see that n n X n−i X v(d) ≡ di(10 mod 3) ≡ di mod 3 i=1 i=1 Therefore, determining divisibility of d by 3 only requires computation of the mod 3 sum of the digits of d. In turn, this can be accomplished using a DFA with only three states, one for each possible remainder mod 3 (exercise). CS243, Prof. Alvarez 1 REGULAR LANGUAGES 1.2 Pumping property of regular languages Finiteness of the state space of a DFA leads to an interesting property of the language recognized by such a machine. This property provides clues about what languages are regular.

Lemma 1.1. if M is a DFA that has a state space with N states, then the computation of M on any input string of length N or greater contains a closed path in the state space.

Proof. Suppose that M and N are as in the statement, and that w is a string of length m ≥ N over the input alphabet of M. Let q0, q1, ··· qm be the state of M on input w. Thus, qt is the state immediately after reading the t-th symbol wt of w. Since M has only N different states, but the state sequence has at least N + 1 elements, there must exist two different “times” s < t such that qs = qt. The portion of the computation qs, ··· qt is a closed path in the state space.

Theorem 1.2. If L is a regular language, then there is a finite integer p such that any string w ∈ L(M) of length p or greater may be split as a concatenation w = xyz, such that:

1. y 6= 

2. xy has length p or less

3. ∀n ∈ Z+ ∪ {0} xynz ∈ L(M) Furthermore, p can be taken to be the number of states in a DFA that recognizes L. Proof. Suppose that M is a DFA that recognizes L. Let p be the number of elements in the state space of M, and let w be any string of length m ≥ p that is accepted by M. By Lemma 1.1, the computation of M on input w, say q0, q1, ··· , qm, contains a closed path in the state space. In fact, we know that such a loop occurs on input w1, ··· wp as well. For concreteness, say that qs = qt for the two distinct “times” s < t ≤ p. Split w as follows. Let x = w1, ··· ws, y = ws+1, ··· wt, and z = wt+1, ··· wm. Clearly, w = xyz, since each wi appears in precisely one of x, y, z. Also, y 6= , because s < t, and xy has length t ≤ p as discussed above. Finally, notice that y drives M from state qs back to n itself, since qs = qt. Let n be any non-negative integer, and consider the string xy z. Notice n that y drives M from state qs back to itself, since y does (this is true even if n = 0, since that simply skips the state loop altogether). It follows that xynz is accepted by M, because the final portion of the computation of M on input xynz is identical to that of M on input w, since it starts at the same state qt and is driven by z in both cases. This completes the proof. Theorem 1.2 is known as the Pumping Lemma for regular languages. Any integer p with the property stated in the Pumping Lemma for a given language L is known as a “pumping length” for L. The Pumping Lemma asserts that every regular language has a finite pumping length. Notice the following interesting consequence of the Pumping Lemma. CS243, Prof. Alvarez 2 NON-REGULAR LANGUAGES

Corollary 1.3. Suppose that L is a regular language and that p is a pumping length of L. If L contains any strings of length p or greater, then L contains arbitrarily long strings and in particular is an infinite set. Proof. Since the middle portion y in the split w = xyz of any string of length p or greater is non-empty, the pumped string xynz will be as long as desired by making n large enough (the length of xynz is at least n). Example 1.1. The language L = {w ∈ {0, 1}∗ | w contains an odd number of 1 symbols} is recognized by a 2-state DFA, and therefore, by Theorem 1.2, has 2 as a pumping length. To confirm directly that 2 is a pumping length of L, consider any string w ∈ L that has length 2 or greater. I claim that there is some nonempty substring y of consecutive symbols of w that has length 2 or less, occurs no later than the second position of w, and contains an even number of 1s: if w contains any 0s among the first two positions, just isolate a single one of those 0s and call it y; if the length two prefix of w contains only 1s, pick that prefix as y (this can be done because the length of w is two or greater). With this choice of y, split w as w = xyz. Then the pumped string xynz contains an odd number of 1s (because this is true for n = 1, and the number of 1s differs by an even number between different powers, n). This shows that 2 is a pumping length of L. Example 1.2. The language L = {w ∈ {0, 1, 2}∗ | w represents a multiple of 3 in base 3 } is regular (exercise), and therefore, by Theorem 1.2, has some finite pumping length, p. We will determine a pumping length of L directly, without constructing a DFA that recognizes L. Let w be any string in L. A key observation is that the strings of L are precisely those ternary strings whose digits sum to 0 mod 3. Consider athe first three digits of w (assume that w has length 3 or greater). I claim that there is some nonempty substring y of these three digits that has a sum of 0 mod 3: this is certainly the case if all three digits are the same or if there is at least one 0 among the three; otherwise, one has either two 1s and one 2, or else one has two 2s and one 1, and in either case one can take y as consisting of a 2 and a contiguous 1. Pumping the associated split w = xyz, for the choice of y just described, produces a string xynz with the same as w mod 3 (for any non-negative integer value of n), and therefore the pumped string xynz also belongs to L. This proves that L has 3 as a pumping length.

2 Non-regular languages

The Pumping Lemma (Theorem 1.2) states a condition that is satisfied by every regular language. While the condition is not sufficient for regularity, it is certainly necessary. Thus, any language that can be shown to fail the pumping condition is not regular. Do any such languages actually exist?

Example 2.1. Consider the language L = {0n1n | n ∈ Z+} that consists of all binary strings consisting of some nonzero number of 0s followed by precisely the same number of 1s. Heuristically, any DFA that is capable of keeping track of the precise length of the CS243, Prof. Alvarez 2 NON-REGULAR LANGUAGES

initial segment of 0s will need to have at least as many states as there are 0s in the segment (otherwise states will be repeated during the computation and the count will be lost). This is not quite a rigorous argument, however, so we will argue more methodically below that L is, in fact, not regular. Let p be any candidate for a pumping length of L. In other words, we would need that all strings w of L that have length p or greater can be split as w = xyz, such that y is nonzero, y has length p or less, and the pumped string xynz is in L for all non-negative integers n. Consider the specific string w = 0p1p. This string has length 2p, which is certainly p or greater. However, if y is any non-empty substring of the first p symbols in w, then y consists solely of one or more 0s, and pumping the corresponding split w = xyz produces strings without an equal number of 0s and 1s. Therefore, L does not satisfy the Pumping Lemma, so L is not regular.

Example 2.2. Let L = {ww | w ∈ {0, 1}∗}. That is, a binary string belongs to L if, and only if, it can be written as the concatenation of some binary string with itself. We will show that L fails to satisfy the conclusion of the Pumping Lemma, and is therefore not regular. Suppose that p is any positive integer, and consider the string w = 0p1p0p1p ∈ L. This w has length 4p, which is, in particular, p or greater. Let w = xyz be any split in which y is nonempty and xy has length at most p. Because of the choice of w, the prefix xy will therefore only consist of 0s. If p were a pumping length for L, then “pumping down” to xz (equal to xy0z) should produce a string of L. In other words, it should be possible to remove p or fewer consecutive 0s from the left end of w to obtain a strictly shorter string of the form uu. We will show that this is not possible. That will prove that L has no finite pumping length, and therefore that L is not regular. Note first that the length of the removed portion y must be even if the remaining portion xz is to be in L. If an even number 2k of 0s is removed from the left end of w, then the remaining string xz will be 0p−2k1p0p1p, which has length 4p−2k. The left half of this string is 0p−2k1p0k. The right half ends in 1, and doesn’t match the left half.

Example 2.3. We show that determining primality is beyond the reach of DFA. Determining primality is equivalent to determining membership in the language {1p | p is prime}, where the alphabet consists of the single symbol 1. We claim that this language fails the conclusion of the Pumping Lemma and therefore is not regular. Suppose that N is any positive integer. We will show that N is not a pumping length of L. Since the argument will apply to all positive integer values N, we will conclude that L has no finite pumping length. Let w be any string of L that has length N or greater. In other words, the length of w is a prime p ≥ N. Let w = xyz be any split in which y is nonempty and xy has length N or less. We claim that some pumped version xynz will have a non-prime length and therefore will not belong to L. If x, y, and z have lengths a, b, c, respectively, where a + b + c = p, then xynz has length a + bn + c = (a + b + c) + (n − 1)b = p + (n − 1)b. We claim that the latter value fails to be prime for some positive integer n. Indeed, we can simply let n = p + 1, and then the length of the pumped string will be p(1 + b), which is obviously not prime. CS243, Prof. Alvarez 3 EXERCISES 3 Exercises

1. Is the language {a795 mod 117 | a ∈ Z+} regular? Explain carefully. 2. Let n be any positive integer. What is the smallest m = f(n) such that there is a DFA with precisely m states that recognizes some finite language that contains precisely n strings? Prove your answer. Explain carefully.

3. Suppose that M is a DFA with N states, and that w = w1 ··· wm is a string over the input alphabet of M. Let q0, q1, ··· qm be the computation of M on input w. Consider any “times” 0 < s < t ≤ m such that t−s ≥ N. Show that the statement of Lemma 1.1 implies that the corresponding portion qs, ··· qt of the computation of M on input w contains a closed path in the state space. Note carefully that you should not provide a proof “from scratch” of the existence of a closed path, but instead show that the existence of such a path follows from Lemma 1.1 as stated. Explain carefully.