CS 236 Language and Computation Course Notes Sect 2.2: The and Regular Languages

Anton Setzer (Based on a book draft by J. V. Tucker and K. Stephenson) Dept. of Computer Science, Swansea University

http://www.cs.swan.ac.uk/∼csetzer/lectures/ languageComputation/09/index.html

December 12, 2009

CS 236 Sect. 2.2 1/ 65 2.2.1. Chomsky Hierarchy (12.1)

2.2.2. Regular Languages (12.2)

2.2.3. Regular Expressions (13.8)

CS 236 Sect. 2.2 2/ 65 2.2.1. Chomsky Hierarchy (12.1) Chomsky Hierarchy

The :::::::::::Chomsky :::::::::::hierarchy is the classification of grammars by means of 4 properties of its production rules:

I Unrestricted grammars.

I The limit of grammars.

I Context-sensitive grammars.

I It’s usually an accident if a grammar of a language is context-sensitive. C and C++ have some context-sensitive aspects (dealt with by selecting correct strings after the parsing).

I Context-free grammars.

I Easy to understand and supported by parse generators. I In language design one aims at languages having an underlying context-free grammar.

I Regular grammars.

I Simple to parse. Used for dividing the input stream of characters into tokens.

CS 236 Sect. 2.2.1 3/ 65 2.2.1. Chomsky Hierarchy (12.1) Unrestricted Grammars

Let in the following four definitions G = (T , N, S, P) be a grammar. Definition

Any grammar G is of :::::Type::0 or ::::::::::::::unrestricted, so any production

u −→ v

for u ∈ (T ∪ N)+, v ∈ (T ∪ N)∗ are allowed.

CS 236 Sect. 2.2.1 4/ 65 2.2.1. Chomsky Hierarchy (12.1) Context-Sensitive Grammars

Definition

Any grammar G is of :::::Type::1 or :::::::::::::::::::context-sensitive, if all its productions have the form uAv −→ uwv where A ∈ N is a nonterminal, which rewrites to a non-empty string w ∈ (T ∪ N)+, but only where A is in the context of strings u, v ∈ (T ∪ N)∗. Furthermore a production A −→  is allowed, but only if A does not occur in the right hand side of any production.

CS 236 Sect. 2.2.1 5/ 65 2.2.1. Chomsky Hierarchy (12.1) Context-Free Grammars

Definition

Any grammar G is of :::::Type::2 or ::::::::::::::context-free, if all its productions have the form A −→ w where A ∈ N is a nonterminal, which rewrites to a string w ∈ (T ∪ N)∗.

CS 236 Sect. 2.2.1 6/ 65 2.2.1. Chomsky Hierarchy (12.1) Regular Grammars

Definition

1. A grammar G is :::::::::::left-linear, iff all its productions have the form

A −→ Ba or A −→ a or A −→ 

2. A grammar G is ::::::::::::right-linear, iff all its productions have the form

A −→ aB or A −→ a or A −→ 

3. A grammar G is of :::::Type::3 or ::::::::regular, iff it is left-linear or right-linear In the above we have A, B ∈ N are nonterminal and a ∈ T . Note that in a either all productions must be left-linear or all productions must be right-linear, so no mixing of the left-linear and right-linear is allowed. CS 236 Sect. 2.2.1 7/ 65 2.2.1. Chomsky Hierarchy (12.1) A Hierarchy of Languages

Definition ∗ A language L ⊆ T is ::::::::::::::unrestricted, :::::::::::::::::::context-sensitive, ::::::::::::::context-free,

or ::::::::regular, iff there exists a grammar G of the relevant type such that L(G) = L.

Remark For any L we have L regular ⇒ L context-free ⇒ L context-sensitive ⇒ L unrestricted.

CS 236 Sect. 2.2.1 8/ 65 2.2.1. Chomsky Hierarchy (12.1) Hierarchy of Languages

We have that

I every regular grammar is context-free.

I every context-sensitive grammar is an . However not every context-free grammar is context sensitive, since context-sensitive languages allow only productions A −→  if A does not occur at the right hand side of a production. (Otherwise all unrestricted languages would be context-sensitive). However one can construct from a context-free grammar a context-free grammar of the same language, which has only productions A −→ , if A does not occur on the right hand side of a production. This grammar is therefore context-sensitive as well.

CS 236 Sect. 2.2.1 9/ 65 2.2.1. Chomsky Hierarchy (12.1) Hierarchy of Languages

context context-free sensitive unrestricted regular

CS 236 Sect. 2.2.1 10/ 65 2.2.1. Chomsky Hierarchy (12.1) Examples of Equivalent Grammars

We give grammars of each type for defining the language

2n La := {ai | i is even }

CS 236 Sect. 2.2.1 11/ 65 2.2.1. Chomsky Hierarchy (12.1) Unrestricted Grammar for La2n

2n grammar G unrestricted,a

terminals a

nonterminals S

start symbol S

productions S −→  S −→ aa a −→ aaa

CS 236 Sect. 2.2.1 12/ 65 2.2.1. Chomsky Hierarchy (12.1) Context-Sensitive Grammar for La2n

2n grammar G context−sensitive,a

terminals a

nonterminals S, T

start symbol S

productions S −→  S −→ aa S −→ aaT aT −→ aTaa aT −→ aaa

CS 236 Sect. 2.2.1 13/ 65 2.2.1. Chomsky Hierarchy (12.1) Context-Free Grammar for La2n

2n grammar G context−free,a

terminals a

nonterminals S

start symbol S

productions S −→  S −→ aSa

CS 236 Sect. 2.2.1 14/ 65 2.2.1. Chomsky Hierarchy (12.1) Regular Grammar for La2n

2n grammar G regular,a

terminals a

nonterminals S, A

start symbol S

productions S −→  S −→ aA A −→ aS

CS 236 Sect. 2.2.1 15/ 65 2.2.1. Chomsky Hierarchy (12.1) Example 1 (Grammars of the Levels of the Chomsky Hierarchy)

grammar G

terminals a, b

nonterminals S

start symbol S

productions S −→ aSa, S −→ bSb, S −→ 

L(G) = ? G is of which type?

CS 236 Sect. 2.2.1 16/ 65 2.2.1. Chomsky Hierarchy (12.1) Example 2

grammar G

terminals a

nonterminals S

start symbol S

productions S −→ a, S −→ aS

L(G) = ? G is of which type?

CS 236 Sect. 2.2.1 17/ 65 2.2.1. Chomsky Hierarchy (12.1) Example 3

grammar G

terminals a, b

nonterminals S

start symbol S

productions S −→ ab, S −→ aSb

L(G) = ? G is of which type?

CS 236 Sect. 2.2.1 18/ 65 2.2.1. Chomsky Hierarchy (12.1) Example 4

n n n grammar G a b c

terminals a, b

nonterminals S

start symbol S

productions S −→ aSBC, S −→ aBC, CB −→ HB, HB −→ HC, HC −→ BC, aB −→ ab, bB −→ bb, bC −→ bc, cC −→ cc. L(G) = {anbncn | n ≥ 1}. G is of which type?

CS 236 Sect. 2.2.1 19/ 65 2.2.1. Chomsky Hierarchy (12.1) Examples

context context-free sensitive unrestricted regular n n n {a b | {a | n ≥ 1} n ≥ 1} {anbncn | n ≥ 1}

CS 236 Sect. 2.2.1 20/ 65 2.2.2. Regular Languages (12.2)

2.2.1. Chomsky Hierarchy (12.1)

2.2.2. Regular Languages (12.2)

2.2.3. Regular Expressions (13.8)

CS 236 Sect. 2.2.2 21/ 65 2.2.2. Regular Languages (12.2) Finite languages are regular

grammar G ab,aabb,aaabbb

terminals a, b

nonterminals S

start symbol S

productions S −→ ab S −→ aabb S −→ aaabbb The above grammar is not regular, since there can only be one terminal in the right hand string. But we can amend this:

CS 236 Sect. 2.2.2 22/ 65 2.2.2. Regular Languages (12.2) Finite languages are regular

grammar G ab,aabb,aaabbb

terminals a, b

nonterminals S, S1, S2, S3, S4, S5, S6, S7, S8, S9

start symbol S

productions S −→ aS1, S1 −→ b S −→ aS2, S2 −→ aS3, S3 −→ bS4, S4 −→ b S −→ aS5, S5 −→ aS6, S6 −→ aS7, S7 −→ bS8, S8 −→ bS9, S9 −→ b

CS 236 Sect. 2.2.2 23/ 65 2.2.2. Regular Languages (12.2) Observation

The above can be generalised to the following Lemma

1. Assume a grammar G which has only productions of the form

A −→ Bw or A −→ w 0

for some w ∈ T +, w 0 ∈ T ∗,A, B ∈ N. Then L(G) = L(G 0) for some left-linear grammar G 0. 2. Assume a grammar G which has only productions of the form

A −→ wB or A −→ w 0

for some w ∈ T +, w 0 ∈ T ∗,A, B ∈ N. Then L(G) = L(G 0) for some right-linear grammar G 0.

CS 236 Sect. 2.2.2 24/ 65 2.2.2. Regular Languages (12.2) Proof

I In (2) replace I Productions A −→ a1a2 ··· anB with n ≥ 2 by A −→ a1A1, A1 −→ a2A2 ,..., An−1 −→ anB for some new nonterminals Ai . I Productions A −→ a1a2 ··· an with n ≥ 2 by A −→ a1A1, A1 −→ a2A2 ,..., An−1 −→ an for some new nonterminals Ai .

I (1) is proved similarly.

CS 236 Sect. 2.2.2 25/ 65 2.2.2. Regular Languages (12.2) Lemma

Lemma All finite languages are regular.

Proof: Extend the example above.

CS 236 Sect. 2.2.2 26/ 65 2.2.2. Regular Languages (12.2) A Left-Linear Grammar for ambn

The following left-linear grammar generates {ambn | m, n ≥ 1}. m n grammar G left−linear,a b

terminals a, b

nonterminals S, T

start symbol S

productions S −→ Sb S −→ Tb T −→ Ta T −→ a

CS 236 Sect. 2.2.2 27/ 65 2.2.2. Regular Languages (12.2) A Right-Linear Grammar for ambn

The following right-linear grammar generates {ambn | m, n ≥ 1}: m n grammar G right−linear,a b

terminals a, b

nonterminals S, T

start symbol S

productions S −→ aS S −→ aT T −→ bT T −→ b

CS 236 Sect. 2.2.2 28/ 65 2.2.2. Regular Languages (12.2) Right-Linear Grammar for Numbers

Here is a right-linear grammars for numbers without leading zeros. We use “|” as for BNF. grammar G Number terminals 0, 1,..., 9 nonterminals Number, Digits start symbol Number productions Number −→ 0 Number −→ 1 Digits | 2 Digits | · · · | 9 Digits Digits −→ 0 Digits | 1 Digits | · · · | 9 Digits Digits −→ 

CS 236 Sect. 2.2.2 29/ 65 2.2.2. Regular Languages (12.2) Right-Linear Grammar for Numbers

Why didn’t we use the following as in the section on BNF?

grammar G Number terminals 0, 1,..., 9 nonterminals Number, Digit, NonZeroDigit, Digits start symbol Number productions Number −→ Digit | NonZeroDigit Digits Digits −→ Digit | Digit Digits Digit −→ 0 | NonZeroDigit NonZeroDigit −→ 1 | 2 | · · · | 9

Answer:

CS 236 Sect. 2.2.2 30/ 65 2.2.2. Regular Languages (12.2) Right-Linear Grammar for Post Codes

The next grammar generates the postcodes of the form SA1 8PP or in general LLd dLL for digits d and capital letters L without any leading zeros. We use the notation | as in BNF. We write xy for blank

CS 236 Sect. 2.2.2 31/ 65 2.2.2. Regular Languages (12.2) Right-Linear Grammar for Post Codes

grammar G Postcode

terminals 0, 1,..., 9, A, B,..., Z, xy nonterminals postcode, letter2, digit1, blank1, digit2, letter3, letter4 start symbol postcode productions postcode −→ A letter2 | B letter2 | · · · | Z letter2 letter2 −→ A digit1 | B digit1 | · · · | Z digit1 digit1 −→ 0 blank1 | 1 blank1 | · · · | 9 blank1 blank1 −→ xy digit2 digit2 −→ 0 letter3 | 1 letter3 | · · · | 9 letter3 letter3 −→ A letter4 | B letter4 | · · · | Z letter4 letter4 −→ A | B | · · · | Z

CS 236 Sect. 2.2.2 32/ 65 2.2.2. Regular Languages (12.2) Example Derivation

Postcode Here is a derivation of SA2xy8PP ∈ L(G ): postcode ⇒ S letter2 ⇒ SA digit1 ⇒ SA1 blank1 ⇒ SA1xy digit2 ⇒ SA1xy8 letter3 ⇒ SA1xy8P letter4 ⇒ SA1xy8PP

CS 236 Sect. 2.2.2 33/ 65 2.2.2. Regular Languages (12.2) Easier Proof that Postcodes are Regular

Can you give an easier proof that the language of postcodes is regular (both left-linear and right-linear)?

CS 236 Sect. 2.2.2 34/ 65 2.2.2. Regular Languages (12.2) Adding Silent Productions

We can generalise the lemma about generalising regular languages by allowing as well productions of the form A −→ B: Lemma

1. Assume a grammar G which has only productions of the form

A −→ Bw or A −→ w

for some w ∈ T ∗,A, B ∈ N. Then L(G) = L(G 0) for some left-linear grammar G 0. 2. Assume a grammar G which has only productions of the form

A −→ wB or A −→ w

for some w ∈ T ∗,A, B ∈ N. Then L(G) = L(G 0) for some right-linear grammar G 0.

CS 236 Sect. 2.2.2 35/ 65 2.2.2. Regular Languages (12.2) Multi-step Right-Linear/Left-Linear/Regular Grammars

We call grammars as above :::::::::::multistep :::::::::::::::::::::::::::::::::right-linear/left-linear/regular

::::::::::::grammars.

CS 236 Sect. 2.2.2 36/ 65 2.2.2. Regular Languages (12.2) Proof

In a first step we omit all transitions A −→ B for A, B ∈ N: Let G = (N, T , S, P) be a grammar having such transitions. We form a grammar G 0 having no such transitions as follows, defined as follows: grammar G 0 terminals N nonterminals T start symbol S ∗ 0 0 ∗ productions A → w if A ⇒G A → w for some A, A ∈ N, w ∈ T ∗ 0 0 A → wB if A ⇒G A → wB for some A, A , B ∈ N, w ∈ T ∗

CS 236 Sect. 2.2.2 37/ 65 2.2.2. Regular Languages (12.2) Proof

So in G 0 we just jump over all silent transitions A −→ B in G. We can in fact decide whether A ⇒∗ A0, since such a derivation must have 0 the form A = A or A = A1 ⇒ A2 ⇒ · · · ⇒ An = A for some Ai ∈ N. And if such derivation exists then a derivation exists in which all Ai are distinct (omit loops). Therefore n can be resricted to the number of elements in N, and therefore there are only finitely many possible derivations, which we can enumerate. For each of them we can check whether it is in fact a derivation, and therefore determine all possible derivaitons A ⇒∗ A0.

CS 236 Sect. 2.2.2 38/ 65 2.2.2. Regular Languages (12.2) Proof

Now one can easily see that for w ∈ T ∗

∗ ∗ S ⇒G w iff S ⇒G 0 w

CS 236 Sect. 2.2.2 39/ 65 2.2.2. Regular Languages (12.2) Proof

We have now obtained a grammar which fulfills the assumption of the first lemma in this Section. So the languages are definable by left-linear or right-linear grammars. Remark The left/right-linear grammars as in the previous lemma can be computed from the corresponding multistep left/right-linear grammars.

CS 236 Sect. 2.2.2 40/ 65 2.2.2. Regular Languages (12.2) Derivations in Regular Grammars

Theorem

(a) Let G = (N, T , S, P) be a left-linear grammar, A ∈ N, w ∈ (N ∪ T )∗,A ⇒∗ w. Then the derivation of A ⇒∗ w is

A ⇒ A1a1 ⇒ A2a2a1 ⇒ · · · ⇒ Anan ··· a2a1 = w (1) or A ⇒ A1a1 ⇒ A2a2a1 ⇒ · · · ⇒ Anan ··· a2a1 (2) ⇒ an+1an ··· a2a1 = w or A ⇒  (3)

for productions Ai −→ Ai+1ai+1 (in (1), (2)), An −→ an+1 (in (2)) and A →  (in (3))

CS 236 Sect. 2.2.2 41/ 65 2.2.2. Regular Languages (12.2) Derivations in Regular Grammars

Theorem

(b) Let G = (N, T , S, P) be a right-linear grammar, A ∈ N, w ∈ (N ∪ T )∗,A ⇒∗ w. Then the derivation of A ⇒∗ w is

A ⇒ a1A1 ⇒ a1a2A2 ⇒ · · · ⇒ a1a2 ··· anAn = w (1) or A ⇒ a1A1 ⇒ a1a2A2 ⇒ · · · ⇒ a1a2 ··· anAn (2) ⇒ a1a2 ··· anan+1 = w or A ⇒  (3)

for productions Ai −→ ai+1Ai+1 (in (1) and (2)), An −→ an+1 (in (2)) and A →  (in (3)).

CS 236 Sect. 2.2.2 42/ 65 2.2.2. Regular Languages (12.2) Proof

The above are the only derivations possible, noting that A ⇒  only occurs if A does not occur on the left hand side of a production.

CS 236 Sect. 2.2.2 43/ 65 2.2.2. Regular Languages (12.2) Mixing of Left- and Right-Linear

Remark In a regular grammar we are not allowed to mix left-linear and right-linear grammars. Otherwise we would obtain truely context-free languages.

CS 236 Sect. 2.2.2 44/ 65 2.2.2. Regular Languages (12.2) Example (Mixing Left/Right-Linear Rules)

The following grammar generates the language L(G) = ? which (as we will later) is context-free but not regular. grammar G

terminals a, b

nonterminals S, T

start symbol S

productions S −→ ab S −→ aT T −→ Sb

CS 236 Sect. 2.2.2 45/ 65 2.2.3. Regular Expressions (13.8)

2.2.1. Chomsky Hierarchy (12.1)

2.2.2. Regular Languages (12.2)

2.2.3. Regular Expressions (13.8)

CS 236 Sect. 2.2.3 46/ 65 2.2.3. Regular Expressions (13.8) Operators for Forming Languages

Definition ∗ Let L1, L2, L ⊆ T be languages over the alphabet T .

1. The ::::::::::::::::concatenation L:::::1.L2 of L1 and L2 is defined as

L1.L2 := {w1w2 | w1 ∈ L1, w2 ∈ L2}

2. The union L | L of L and L is defined as ::::::: ::::::1 2 1 2

L1 | L2 := L1 ∪ L2

The union is sometimes denoted by +: . ∗ 3. The :::::::::iteration or :::::::::::::Kleene-star L:: of L is defined as

∗ L := {w1w2 ··· wn | n ≥ 0, w1,..., wn ∈ L}

NoteCS that236  ∈ L∗. Sect. 2.2.3 47/ 65 2.2.3. Regular Expressions (13.8) Regular Expressions

Definition

Let T be an alphabet. We define the set of ::::::::regular:::::::::::::::expressions over T inductively, where each regular expression will be a language L ⊆ T ∗.

I ∅, {} are regular expressions.

I If a ∈ T then {a} is a regular expression. 0 0 0 ∗ I If L, L are regular expression, so are L.L , L | L , L .

CS 236 Sect. 2.2.3 48/ 65 2.2.3. Regular Expressions (13.8) Examples of Regular Expressions

I The set of non-zero digits is defined as

NonzeroDigit = {1} | {2} | · · · | {9}

I The set of digits is defined as

Digit = {0} | NonZeroDigit

I The set of numbers without leading zero is

Number = {0}|(NonZeroDigit.(Digit∗))

I The set of capital letters is defined by

CapitalLetter = {A} | {B} | · · · | {Z}

CS 236 Sect. 2.2.3 49/ 65 2.2.3. Regular Expressions (13.8) Examples of Regular Expressions

I The set of postcodes can be defined as

postcode = CapitalLetter.CapitalLetter.Digit.{xy}. Digit.CapitalLetter.CapitalLetter

CS 236 Sect. 2.2.3 50/ 65 2.2.3. Regular Expressions (13.8) Regular Expressions in Programming

I Regular Expressions occur very often in programming. I They occur in

I Linux/Unix (command grep/egrep), I in scripting languages (Perl, Python, Ruby), I (one of the main innovations of Ruby over Python was an improved notation ∼ for matching of regular expresions),

I in SQL,

I are supported in most programming languages by libraries.

CS 236 Sect. 2.2.3 51/ 65 2.2.3. Regular Expressions (13.8) Notations for Regular Expressions

I One usually writes a instead of {a}. In order to avoid ambiguities, one has to make T distinct from operations on regular expressions, and writes \( for the symbol ( in the alphabet, similar for \), \|, \∗. 0 0 I One writes LL for L.L .

I One writes [a1 ··· an] for {a1} | · · · | {an}.

I One writes [a − z] for [a, b, c,... z] similarly for [0 − 9]. ∗ + ∗ I One writes L∗ for L , L or L+ for L.L (so + L := {w1,..., wn | n ≥ 1, w1,..., wn ∈ L}).

I Lots of other useful operators for constructing regular exprssions have been defined.

I Each language has its own set and of regular expressions (using often different notations), and its own syntax. Sometimes operators are introduced which go beyond regular languages.

CS 236 Sect. 2.2.3 52/ 65 2.2.3. Regular Expressions (13.8) Example Use of Regular Expressions

I Assume you have files called logiccomputationch1.tex, logiccomputationch2.tex, logiccomputationch3.tex ,. . . Concatenation all of them into one file: cssetzer@cs-svr1:> cat logiccomputation[0-9].tex > logiccomputationall.tex

I Process lines in a file containing entries separated by “,”, do something if the first field is a student number (a string consisting of digits only). Python code file = open(filename) regExpStud = re.compile(’^[0-9]*$’) for line in file: a = line.split(’,’) if regExpStud.match(a[0]): print a[1][:-1] #cut off trailing ’\n’ file.close()

CS 236 Sect. 2.2.3 53/ 65 2.2.3. Regular Expressions (13.8) Closure of Regular Languages

In order to show that all regular expressions are regular we first show the following Lemma Let G, G 0 be both left-linear grammars or both right-linear grammars. Then we can define a left-linear or right-linear grammars Gi s.t. 0 1. L(G1) = L(G) | L(G ), 0 2. L(G2) = L(G).L(G ), ∗ 3. L(G3) = L(G) .

CS 236 Sect. 2.2.3 54/ 65 2.2.3. Regular Expressions (13.8) Proof

Assume in 1./2./3.

G = (T , N, S, P) , G 0 = (T 0, N0, S0, P0) .

After renaming of nonterminals we can assume N ∩ N0 = ∅. Let S00 be a new symbol not in N ∪ N0 ∪ T ∪ T 0. We define multi-step left/right-linear grammars with those properties, from which one can construct ordinary (one-step) left/right-linear grammars with those properties. We only carry out the proof for right-linear grammars.

CS 236 Sect. 2.2.3 55/ 65 2.2.3. Regular Expressions (13.8) Proof of 1.

We define G1 as follows: grammar G1

terminals T ∪ T 0

nonterminals N ∪ N0 ∪ {S00}

start symbol S00

productions S00 −→ S S00 −→ S0 P P0

CS 236 Sect. 2.2.3 56/ 65 2.2.3. Regular Expressions (13.8) Proof of 1.

0 So G1 has the productions from G and G plus S00 −→ S and S00 −→ S0 .

Derivations in G1 have the form S00 ⇒ S ⇒∗ w and S00 ⇒ S0 ⇒∗ w 0 for derivations ∗ S ⇒G w and 0 ∗ 0 S ⇒G 0 w So for w 00 ∈ (T ∪ T 0)∗ we have 00 ∗ 00 ∗ 00 0 ∗ 00 S ⇒ w iff S ⇒ w or S ⇒ 0 w , G1 G G so L(G 00) = L(G) ∪ L(G 0). CS 236 Sect. 2.2.3 57/ 65 2.2.3. Regular Expressions (13.8) Proof of 2.

We define G2 as follows: grammar G2

terminals T ∪ T 0

nonterminals N ∪ N0

start symbol S

productions A −→ aA0 for A −→ aA0 ∈ P (A, A0 ∈ N, a ∈ T ) A −→ aS0 for A −→ a ∈ P (A ∈ N, a ∈ T ) P0

CS 236 Sect. 2.2.3 58/ 65 2.2.3. Regular Expressions (13.8) Proof of 2.

So G2 has 0 I the productions from G ,

I the productions of the form A −→ aA from G and 0 I productions A −→ aS , if A −→ a is a production from G.

A derivation in G2 starts with a derivation

S ⇒ a1A1 ⇒ a1a2A2 ⇒ a1a2a3A3 ⇒ · · · ⇒ a1a2 ··· an−1An−1 0 ⇒ a1a2 ··· anS

for derivations in G of the form

S ⇒ a1A1 ⇒ a1a2A2 ⇒ a1a2a3A3 ⇒ · · · ⇒ a1a2 ··· an−1An−1 ⇒ a1a2 ··· an .

CS 236 Sect. 2.2.3 59/ 65 2.2.3. Regular Expressions (13.8) Proof of 2.

Then this is followed by a derivation

0 a1a2 ··· anS ⇒ a1a2 ··· anb1B1 ⇒ a1a2 ··· anb1b2B2 ⇒ · · · ⇒ a1a2 ··· anb1b2 ··· bm−1Bm−1 ⇒ a1a2 ··· anb1b2 ··· bm ,

for a derivation in G 0 of the form

0 S ⇒ b1B1 ⇒ b1b2B2 ⇒ · · · ⇒ b1b2 ··· bm−1Bm−1 ⇒ b1b2 ··· bm

Therefore S ⇒∗ w for some w ∈ (T ∪ T 0)∗ if and only if S ⇒∗ w 0 and G2 G1 S0 ⇒∗ w 00 for some w 0, w 00 s.t. w = ww 00. G2 0 So L(G2) = L(G).L(G ).

CS 236 Sect. 2.2.3 60/ 65 2.2.3. Regular Expressions (13.8) Proof of 3.

We define G3 as follows: grammar G3

terminals T

nonterminals N

start symbol S

productions S −→ , A −→ aA0 for A −→ aA0 ∈ P (A, A0 ∈ N, a ∈ T ) A −→ aS for A −→ a ∈ P (A ∈ N, a ∈ T )

CS 236 Sect. 2.2.3 61/ 65 2.2.3. Regular Expressions (13.8) Proof of 3.

Derivations in G3 are S ⇒  or they start similarly as for concatenation with S ⇒∗ wS for a derivation in G S ⇒∗ w and w ∈ N+. In the latter case it can continue either (using S −→ ) with wS ⇒ w or with wS ⇒∗ ww 0S for a derivation in G S ⇒∗ w 0 Again in the latter case we can continue (using S −→ ) with ww 0S → ww 0 or with ww 0S ⇒∗ ww 0w 00S for a derivation in G S ⇒∗ w 00 CS 236 Sect. 2.2.3 62/ 65 etc. 2.2.3. Regular Expressions (13.8) Proof of 3.

We obtain that in G3 we have

S ⇒∗ w

if there exist derivations in G of ∗ I S ⇒ w1 ∗ I S ⇒ w2

I ··· ∗ I S ⇒ wn

s.t. w = w1w2 ··· wn. So we get

∗ L(G3) = {w1w2 ··· wn | n ≥ 0, w1,..., wn ∈ L(G)} = L(G)

CS 236 Sect. 2.2.3 63/ 65 2.2.3. Regular Expressions (13.8) Regular Expressions are Regular

Lemma Let L be a regular Expression. Then there exist both left-linear and right-linear grammars G, G 0 s.t.

L(G) = L(G 0) = L

CS 236 Sect. 2.2.3 64/ 65 2.2.3. Regular Expressions (13.8) Proof

Induction on the definition of regular expressions. Case 1: L = ∅, {}, {a} (where a ∈ T ). Then L is finite, therefore definable by a left/right-linear grammar. ∗ Case 2: L = L1 | L2 or L = L1.L2 or L = L1. By IH Li are defined by left/right-linear grammars Gi . By the last lemma it follows that L can be defioned by a left/right-linear grammar.

CS 236 Sect. 2.2.3 65/ 65