Regular Languages

CSCI 2670

Department of Computer Science

Fall 2014

CSCI 2670 Regular Languages Outline

I Regular Expressions I Converting Regular Expressions to NFAs I Generalized Nondeterministic finite Automata I Converting GNFAs I Nonregular Languages I The Pumping Lemma

CSCI 2670 Regular Languages Regular Expressions

I Regular languages can also be defined via regular expressions (regexp), a form of shorthand for languages defined using regular operations.

Definition Let Σ be any alphabet. 1. Each symbol a ∈ Σ is a ; 2. ε is a regular expression; 3. ∅ is a regular expression; 4. if R1 and R2 are regular expressions, then (R1 ∪ R2) is a regular expression;

5. if R1 and R2 are regular expressions, then (R1 ◦ R2) is a regular expression; ∗ 6. if R1 is a regular expression, then R1 is a regular expression.

CSCI 2670 Regular Languages Regular Expressions

I If a ∈ Σ, the regexp a denotes the language {a}. I The regexp ε denotes the language {ε}. I The regexp (a ∪ b) denotes the language {a} ∪ {b}. I The regexp (a ◦ b) denotes the language {a} ◦ {b}. ∗ I The regexp (a ∪ b) ◦ a denotes {wa| w is any string over {a, b}}.

Note the following:

I R1 ◦ R2 is often abbreviated as R1R2. I If Σ = {a1, a2,...},Σb is used in place of (a1 ∪ a2 ∪ ...)b. I For any regexp R, R∅ = ∅ ∗ I ∅ = {ε} I (R1 ∪ R2) is sometimes written (R1|R2). + I R is the concatenation of one or more elements from R. k I R is the concatenation of k elements from R. I The precedence of operators (greatest to least) is: ∗, ◦, ∪.

CSCI 2670 Regular Languages Regular Expressions

Example What languages do the following denote (where Σ = {0, 1})? ∗ ∗ I 0 10 ∗ ∗ I Σ 1Σ ∗ ∗ I Σ 001Σ ∗ + ∗ I 1 (01 ) ∗ I (ΣΣ) ∗ I (ΣΣΣ) I (01 ∪ 10) ∗ ∗ I 0Σ 0 ∪ 1Σ 1 ∪ 0 ∪ 1 ∗ I (0 ∪ ε)1 I (0 ∪ ε)(1 ∪ ε)

CSCI 2670 Regular Languages Two Identities

For any regular expression R: I R ∪ ∅ = R I R ◦ ε = R

The following do not hold in general, however. I R ∪ ε = R I R ◦ ∅ = R

Example If R = ab, then I (R) = {ab} I L(ab ∪ ∅) = {ab} I L(ab ◦ ∅) = {} I L(ab ∪ ε) = {ab, ε} I L(ab ◦ ε) = {ab}

CSCI 2670 Regular Languages Equivalence between Regular Expressions and Finite Automata

Theorem A language is regular if and only if some regular expression describes it.

I That is, a language L is regular if and only if there exists a regular expression R such that L(R) = L. I Note that the theorem is a biconditional statement, and so to prove it, both directions must be proven.

I If a language is described by a regular expression, then it is regular. I If a language is regular, then it is described by some regular expression.

CSCI 2670 Regular Languages Equivalence between Regular Expressions and Finite Automata

Lemma If a language is described by a regular expression, then it is regular.

Proof. The proof proceeds by constructing, for each regexp R, an NFA N to recognize L(R). The proof uses structural induction. Basis:

1. R = a, where a ∈ Σ: N = ({q0, q1}, Σ, δ, q0, {q1}), where δ(q0, a) = {q1} and δ(q, b) = ∅ for all q 6= q0 or b ∈ Σ such that b 6= a.

2. R = ε: N = ({q0}, Σ, δ, q0, {q0}), where δ(q, a) = ∅ for all states q and a ∈ Σ

3. R = ∅: N = ({q0}, Σ, δ, q0, {}), where δ(q, a) = ∅ for all states q and a ∈ Σ

CSCI 2670 Regular Languages Equivalence between Regular Expressions and Finite Automata

1. L(a) = {a}

2. L(ε) = {ε}

3. L(∅) = ∅

CSCI 2670 Regular Languages Equivalence between Regular Expressions and Finite Automata

Lemma If a language is described by a regular expression, then it is regular.

Proof, Continued. Recursion: The recursive cases are taken care of by the proofs that regular languages are closed under union, concatenation, and .

4. R = R1 ∪ R2

5. R = R1 ◦ R2 ∗ 6. R = R1

CSCI 2670 Regular Languages Equivalence between Regular Expressions and Finite Automata

Example 1. Convert the regular expression (ab ∪ a)∗ to an NFA. 2. Convert the regular expression (a ∪ b)∗aba to an NFA.

??

CSCI 2670 Regular Languages Generalized NFAs

I To prove that each is described by a regular expression, we define generalized nondeterministic finite automata (GNFAs). I GNFAs are like NFAs, except: 1. The start state qstart has edges leading to every other state, but no incoming edges. 2. There is a unique accept state qaccept , qaccept 6= qstart , with incoming edges coming from every other node. It has no outgoing edges. 0 0 3. For all q, q ∈ Q − {qaccept , qstart }, there is exactly one edge from q to q . Note that q and q0 might be the same.

CSCI 2670 Regular Languages Generalized NFAs

1. The start state qstart has edges leading to every other state, but no incoming edges.

2. There is a unique accept state qaccept , qaccept 6= qstart , with incoming edges coming from every other node. It has no outgoing edges. 0 0 3. For all q, q ∈ Q − {qaccept , qstart }, there is exactly one edge from q to q . Note that q and q0 might be the same.

CSCI 2670 Regular Languages Generalized NFAs

I The labels of the edges in an GNFA will be arbitrary regular expressions. I A DFA M can be converted into a GNFA: I Add a new start state qstart with an ε edge leading to the old start state. I Add a new accept state qaccept with an ε edge leading from each accept state in M to qaccept . 0 0 0 I If edges q →a q and q →b q exist, replace both with q →(a∪b) q . 0 0 I If no edge leads from q to q , add q →∅ q . I It’s not proven in the text, but it should be clear that each of these alterations does not change the language accepted by the automaton.

CSCI 2670 Regular Languages From GNFAs to regular expressions

I The conversion from GNFA to regular expression proceeds by combining nodes and labels in the graph. If a qrip exists such that:

I qi →R1 qrip, I qrip →R2 qrip, qrip →R3 qj , and I qi →R4 qj , I then, I Delete qrip and each edge above. Add edge q → ∗ q . I i R1R2 R3∪R4 j I Do this for each qi and qj connected via qrip. I Repeat the process until only two nodes exist, qstart and qaccept .

Let CONVERT (G) be the regexp obtained as a result of this process.

CSCI 2670 Regular Languages Generalized NFAs (Definition)

Definition I A generalized nondeterministic finite automaton (GNFA) is a 5-tuple (Q, Σ, δ, qstart , qaccept ): I Q is a finite, nonempty set of states. I Σ is a finite, nonempty alphabet. I δ :(Q − {qaccept }) × (Q − {qstart }) → R is the transition function, where R is the set of regular expressions over Σ. I qstart ∈ Q is the start state. I qaccept is the unique accept state.

I The function δ identifies the labels for edge (qi , qj ), where qi ∈ Q − {qaccept } and qj ∈ Q − {qstart }. I Here, qi can’t be the accept state, because no edge originates there. I Here, qj can’t be the start state, because no edge ends there.

CSCI 2670 Regular Languages Language Recognition for GNFAs

Definition ∗ Let G be a GNFA and w = w1w2 ... wk a string, where each wi ∈ Σ . G accepts w iff there is a sequence of states q0, q1,... qk such that I q0 = qstart . I qk = qaccept . I for each i, wi ∈ L(Ri ), where Ri = δ(qi−1, qi ).

I We split w into w1w2 ... wk , where each wi corresponds to a string generated by a regular expression on an edge.

I Specifically, wi is in the language indicated by the label from qi−1 to qi .

CSCI 2670 Regular Languages Equivalence between Regular Expressions and GNFAs

Proposition For any GNFA G, CONVERT (G) is equivalent to G.

Proof. The proof proceeds by induction on the number of nodes in G.

Basis: If G has only 2 nodes, then they must be the distinct start and accept states, and the regular expression between them is CONVERT (G) and describes exactly the strings accepted by G.

Induction: Suppose the claim holds for GNFAs of k − 1 states and that G has k states (where k > 2). Since G has more than 2 states, it can be 0 reduced. Let G be a GNFA obtained by removing a state qrip from G according to the procedure described earlier. Let δ0 be the transition function for G 0.

CSCI 2670 Regular Languages Equivalence between Regular Expressions and GNFAs

Proposition For any GNFA G, CONVERT (G) is equivalent to G.

Proof, Continued.

I Let w = w1w2 ... wn be a string accepted by G. Then there exists a sequence qstart , q1, q2,..., qaccept demonstrating that G accepts w. Note that for each i, wi ∈ L(Ri ), where Ri = δ(qi−1, qi ). I State qrip is either in this sequence, or it’s not.

1. If not, then the sequence qstart , q1, q2,..., qaccept demonstrates that 0 0 G accepts w, since for each qi and qi+1, δ (qi , qi+1) = R ∪ S, where δ(qi , qi+1) = R and S is some other regular expression.

2. If qrip is in the sequence qstart , q1, q2,..., qaccept , then the sequence with all occurrences of qrip removed constitutes an accepting computation path for G 0. This is clear from the construction of G 0.

CSCI 2670 Regular Languages Equivalence between Regular Expressions and GNFAs

Proposition For any GNFA G, CONVERT (G) is equivalent to G.

Proof, Continued. 0 I [A similar argument in the opposite direction shows that if G accepts w, then G accepts w.] 0 0 I So, for any string w, G accepts w if and only if G does. That is, G and G are equivalent. 0 0 I By the inductive hypothesis, CONVERT (G ) and G are equivalent. 0 I Since CONVERT (G ) is CONVERT (G) (they return the same regular expression), it follows that G and CONVERT (G) are equivalent.

CSCI 2670 Regular Languages Equivalence between Regular Expressions and GNFAs

Given the previous proposition, the following holds: Lemma If a language is regular, then it is described by a regular expression.

Given this lemma and the previous lemma, the following theorem holds. Theorem A language is regular if and only if some regular expression describes it.

CSCI 2670 Regular Languages Converting DFAs to Regular Expressions

The Examples in the next several slides indicate how DFAs can be converted into regular expressions.

Example 1 (Pg. 75): Convert the following 2 state DFA into a regular expression.

CSCI 2670 Regular Languages Converting DFAs to Regular Expressions

First we convert to a GNFA by adding a new start and accept state, each with appropriate ε edges.

I draw an ε-edge from the new qstart to the old start state. I draw an ε-edge from each old accept state to qaccept .

CSCI 2670 Regular Languages Converting DFAs to Regular Expressions

I In the GNFA, we must ensure exactly 1 edge connects each pair from (Q − qaccept ) × (Q − qstart ). I If multiple edges from qi to qj exist, combine them into a single edge using ∪.

I If no edge from qi to qj already exists, add an edge labeled ∅. I Unofficially, the ∅ edges are typically omitted.

CSCI 2670 Regular Languages Converting DFAs to Regular Expressions

I After the GNFA is constructed, we begin removing nodes. I Here, node 2 has been removed.

CSCI 2670 Regular Languages Converting DFAs to Regular Expressions

I Here, node 1 has been removed. I Since the resulting machine has only 2 states, we stop.

CSCI 2670 Regular Languages Converting DFAs to Regular Expressions

Example 2 (Pg. 76): Convert the following 3 state DFA into a regular expression.

CSCI 2670 Regular Languages Converting DFAs to Regular Expressions

Add the new start state and accept state. Combine multiple edges between nodes, if needed.

Then start removing nodes (start with node 1)...

CSCI 2670 Regular Languages Converting DFAs to Regular Expressions

Node 1 removed.

CSCI 2670 Regular Languages Converting DFAs to Regular Expressions

Node 2 removed.

CSCI 2670 Regular Languages Converting DFAs to Regular Expressions

Node 3 removed. Multiple edges should be combined.

CSCI 2670 Regular Languages Converting DFAs to Regular Expressions

Node 3 removed. Multiple combined.

CSCI 2670 Regular Languages The Pumping Lemma

I Consider DFA M over Σ = {a, b, c, d} (with edges leading to a trap state omitted). ∗ I M recognizes the language ab(cb) d. I The computation sequence for abcbcbd is

1 →a 2 →b 3 →c 2 →b 3 →c 2 →b 3 →d 4.

I A cycle exists in the graph. I The pattern cb can be repeated infinitely (“pumped”), yielding another string in the language.

CSCI 2670 Regular Languages The Pumping Lemma

I All regular languages have this property. I Any string in the language over a certain length p (the pumping length) has a nonempty substring that can be pumped.

Theorem If A is a regular language, then there exists an integer p such that for all s ∈ A with |s| ≥ p, s may be divided into pieces s = xyz such that 1. for each i ≥ 0, xy i z ∈ A, 2. |y| > 0, and 3. |xy| ≤ p.

I This theorem is useful when showing that a language is not regular. I Technique: Use a proof by contradiction. I Assume A is regular I Use the pumping lemma to show that a string both is and is not in A. I conclude that A is not regular.

CSCI 2670 Regular Languages The Pumping Lemma

Example The language B = {0n1n|n ≥ 0} is not regular.

Proof. ??

CSCI 2670 Regular Languages The Pumping Lemma

Example The language B = {0n1n|n ≥ 0} is not regular.

Proof. Suppose that B is regular and so has pumping length p. Let w be any string of B of length at least p. Then w = xyz such that 1. for each i ≥ 0, xy i z ∈ B, 2. |y| > 0, and 3. |xy| ≤ p. It cannot be that y consists solely of 1s or solely of 0s, since xy 0z = xz ∈ B and parity of 1s and 0s must be maintained. So y must contain as many 0s as 1s, and because of the definition of B it must be of the form 0m1m, where m > 0, and x must consist solely of 0s and z must consist solely of 1s. However, since xy 2z ∈ B, and y has the form 0m1m, it must be that x0m1m0m1mz ∈ B. But this string clearly is not of the form 0n1n and so cannot be in B. A contradiction! And so B cannot be regular.

CSCI 2670 Regular Languages The Pumping Lemma

Example The language B = {0n1n|n ≥ 0} is not regular.

Alternative Proof. Suppose that B is regular and so has pumping length p. Let w be the string 0p1p which is clearly in B. Given that |xy| ≤ p, it must be that x and y consist solely of 0s. By the pumping lemma, xyyz ∈ B. However, this string clearly has more 0s than 1s and so can’t be in B.A contradiction! And so B cannot be regular.

CSCI 2670 Regular Languages The Pumping Lemma

Example The language C = {w| w has an equal number of 0s and 1s} is not regular.

Proof. ??

CSCI 2670 Regular Languages The Pumping Lemma

Example The language C = {w| w has an equal number of 0s and 1s} is not regular.

Proof. Suppose that C is regular and so has pumping length p. Let w be the string 0p1p. Since |w| ≥ p, w = xyz such that 1. for each i ≥ 0, xy i z ∈ C, 2. |y| > 0, and 3. |xy| ≤ p. Given the third condition above, since w = 0p1p, it must be that y consists solely of 0s. Given the first condition of the pumping lemma, it must be that xz ∈ C. However, since y consists solely of 0s, xz clearly has more 1s than 0s and so can’t be in language C. A contradiction! And so C cannot be regular.

CSCI 2670 Regular Languages The Pumping Lemma

Example The language C = {w| w has an equal number of 0s and 1s} is not regular.

Alternative Proof. Regular languages are closed under intersection (the proof of this proceeds similarly to the proof that they are closed under union). Given this result, If we assume C is regular, then C ∩ 0∗1∗ is regular (0∗1∗ is clearly regular). However C ∩ 0∗1∗ = {0n1n|n ≥ 0}, which we just showed to be nonregular. And so C cannot be regular, either.

CSCI 2670 Regular Languages The Pumping Lemma

Example The language F = {ww| w ∈ {0, 1}∗} is not regular.

Proof. ??

CSCI 2670 Regular Languages The Pumping Lemma

Example The language F = {ww| w ∈ {0, 1}∗} is not regular.

Proof. Suppose F is regular and let w = 0p1. As such, s = ww ∈ A. Since |s| > p, s can be split into s = xyz such that all conditions of the pumping lemma apply. From the 3rd condition, |xy| ≤ p and so x must have the form 0i for some i ≥ 0, y must have the form 0k for some k > 0, and z must have the form 0j 10i 0k 0j 1 for some j ≥ 0. So, s = 0i 0k 0j 10i 0k 0j 1. Since y can be pumped, 0i 02k 0j 10i 0k 0j 1 ∈ F . However, this is clearly not of the form ww and so cannot be in F .A contradiction, and so F cannot be regular.

CSCI 2670 Regular Languages The Pumping Lemma

Example i j 1. The language A1 = {0 1 | i > j} is not regular.

2. The language A2 = {w| w is a } is not regular. (A palindrome is a string that reads the same forward and backward.) n n n 3. The language A3 = {0 1 2 | n ≥ 0} is not regular. 2n 2n 4. The language A4 = {a | n ≥ 0} is not regular. (Here,a means a string of 2na0s.)

Proof. ??

CSCI 2670 Regular Languages