Introduction to Languages and Grammars Alphabets and Languages

Home , Chomsky normal form, Linear Grammar

An alphabet is a finite non-empty set. Let S and T be alphabets. S • T = { s t | s ε S, t ε T } (We’ll often write ST for S•T.) λ = empty string, string of length one S0 = { λ } S1 = S Sn = S (n-1) • S, n > 1 S+ = S 1 U S 2 U S 3 U . . . S* = S 0 U S + A language L over an alphabet S is a subset of S*. How Many Languages Are There?

• How many languages over a particular alphabet are there? Uncountably infinitely many! • Then, any finite method of describing languages can not include all of them. • Formal language theory gives us techniques for defining some languages over an alphabet. Why should I care about ?

• Concepts of syntax and semantics used widely in computer science: • Basic compiler functions • Development of computer languages • Exploring the capabilities and limitations of algorithmic problem solving Methods for Defining Languages

• Grammar – Rules for defining which strings over an alphabet are in a particular language • Automaton (plural is automata) – A mathematical model of a computer which can determine whether a particular string is in the language Definition of a Grammar

• A grammar G is a 4 tuple G = (N, Σ, P, S), where – N is an alphabet of nonterminal symbols – Σ is an alphabet of terminal symbols – N and Σ are disjoint – S is an element of N ; S is the start symbol or initial symbol of the grammar – P is a set of productions of the form α -> β where • α is in (N U Σ)* N (N U Σ)* • β is in (N U Σ)* Definition of a Language Generated by a Grammar

We define => by γ α δ => γ δ β if α -> β is in P, and γ and δ are in (N U Σ)*

=>+ is the transitive closure of =>

=>* is the reflexive transitive closure of =>

The language L generated by grammar G = (N, Σ, P, S), is defined by

L = L(G) = { x | S *=> x and x is in Σ* } Classes of Grammars (The Chomsky Hierarchy )

• Type 0, Phrase Structure (same as basic grammar definition) • Type 1, Context Sensitive – (1) α -> β where α is in (N U Σ)* N (N U Σ)*, • β is in (N U Σ)+, and length( α) ≤ length( β) – (2) γ A δ -> γ β δ where A is in N, β is in (N U Σ)+, and • γ and δ are in (N U Σ)* • Type 2, Context Free – A -> β where A is in N, β is in (N U Σ)* • Linear – A-> x or A -> x B y, where A and B are in N and x and y are in Σ* • Type 3, Regular Expressions – (1) left linear A -> B a or A -> a, where A and B are in N and a is in Σ – (2) right linear A -> a B or A -> a, where A and B are in N and a is in Σ Comments on the Chomsky Hierarchy (1)

• Definitions (1) and (2) for context sensitive are equivalent. • Definitions (1) and (2) for regular expressions are equivalent. • If a grammar has productions of all three of the forms described in definitions (1) and (2) for regular expressions, then it is a linear grammar. • Each definition of context sensitive is a restriction on the definition of phrase structure. • Every context free grammar can be converted to a context sensitive grammar with satisfies definition (2) which generates the same language except the language generated by the context sensitive grammar cannot contain the empty string λ. • The definition of linear grammar is a restriction on the definition of context free. • The definitions of left linear and right linear are restrictions on the definition of linear. Comments on the Chomsky Hierarchy

• Every language generated by a left linear grammar can be generated by a right linear grammar, and every language generated by a right linear grammar can be generated by a left linear grammar. • Every language generated by a left linear or right linear grammar can be generated by a linear grammar. • Every language generated by a linear grammar can be generated by a context free grammar. • Let L be a language generated by a context free grammar. If L does not contain λ, then L can be generated by a context sensitive grammar. If L contains λ, then L-{λ} can be generated by a context sensitive grammar. • Every language generated by a context sensitive grammar can be generated by a phrase structure grammar. Example : A Left Linear Grammar for Identifiers

• S -> S a • S => a • S -> S b • S -> S 1 • S => S 1 => a 1 • S -> S 2 • S -> a • S => S 2 => S b 2 • S -> b => S 1 b 2 => a 1 b 2 Example : A Right Linear Grammar for Identifiers

• S -> a T T -> 1 T • S => a • S -> b T T -> 2 T • S -> a T -> a • S => a T => a 1 • S -> b T -> b • T -> a T T -> 1 • T -> b T T -> 2 • S => a T => a 1 T => a 1 b T => a 1 b 2 Example: A Right Linear Grammar for {a n bm cp | n, m, p > 0}

• S -> a A S => a A => a a A • A -> a A => a a a A • A -> b B => a a a b B • B -> b B • B -> c C => a a a b c C • B -> c => a a a b c c • C -> c C • C -> c Example : A Linear Grammar for { a n bn | n >0 }

• S -> a S b S => a S b • S -> a b => a a S b b => a a a S b b b => a a a a b b b b Example : A Linear Grammar for {a n bm cm dn | n > 0} (Context Free Grammar)

• S -> a S b S => a S b • S -> a T b => a a S b b • T -> c T d => a a a T b b b • T -> c d => a a a c T d b b b => a a a c c d d b b b Another Example : A Context Free Grammar for {a n bn cm dm | n > 0}

• S -> R T S => R T • R -> a R b => a R b T • R -> a b => a a R b b T • T -> c T d => a a a b b b T • T -> c d => a a a b b b c T d => a a a b b b c c d d A Context Free Grammar for Expressions

• S -> E • S => E => E + T – => E - T + T • E -> E + T F -> ( E ) F -> a – => T - T + T • E -> E - T – => F - T + T F -> b – => a - T + T • E -> T F -> c – => a - T * F + T • T -> T * F F -> d – => a - F * F + T • T -> T / F F -> e – => a - b * F + T – => a - b * c + T • T -> F – => a - b * c + F – => a - b * c - d A Context Sensitive Grammar for {a n bn cn | n >0}

• S -> a S B C S => a S B C • S -> a B C => a a B C B C • a B -> a b => a a b C B C • b B -> b b => a a b B C C • C B -> B C => a a b b C C • b C -> b c => a a b b c C • c C -> c c => a a b b c c The Chomsky Hierarchy and the Block Diagram of a Compiler

Type 3 Type 2 Int. Source Object tokens tree Inter- code Code language Scanner Parser mediate Optimizer language program Code Generator program Generator

Symbol Error Error Type 1 Table Handler messages Manager

Symbol Table Automata

• Turing machine (Tm) • Linear bounded automaton (lba) • 2-stack pushdown automaton (2pda) • (1-stack) pushdown automaton (pda) • 1 turn pushdown automaton • finite state automaton (fsa) Recursive Definition

Primitive regular expressions: ∅, λ, α Given regular expressions r 1 and r2

r1 + r2 r1 ⋅ r2 Are regular expressions r1 * ()r1 Example (a + b)⋅ a*

L((a + b)⋅ a*) = L((a + b))L(a *) = L(a + b)L(a *) = (L(a)∪ L(b))(L(a))* = ({a}∪{b})({a})* = {a,b}{λ,a,aa ,aaa ,... } = {a,aa ,aaa ,..., b,ba ,baa ,... } Example (a + b)⋅ a*

L((a + b)⋅ a*) = L((a + b))L(a *) = L(a + b)L(a *) = (L(a)∪ L(b))(L(a))* = ({a}∪{b})({a})* = {a,b}{λ,a,aa ,aaa ,... } = {a,aa ,aaa ,..., b,ba ,baa ,... } Example

• Regular expression r = (aa )*(bb )*b

L(r)={a2nb2mb : n,m ≥ 0} Example

Regular expression r = (0 +1)*00 (0 +1)*

L(r) = { all strings with at least two consecutive 0 } Example

• Regular expression r = (1+ 01 )*(0 + λ)

L(r) = { all strings without two consecutive 0 } Equivalent Regular Expressions

• Definition:

Regular expressions r 1 and r2

are equivalent if L (r1) = L(r2) Example

L = { all strings without two consecutive 0 } r1 = (1+ 01 )*(0 + λ)

r2 = (1*011 *) *(0 + λ) +1*(0 + λ)

r and r L(r ) = L(r ) = L 1 2 1 2 are equivalent regular expr. Linear Grammars

Grammars with at most one variable at the right side of a production

S → aSb S → Ab Examples: S → λ A → aAb A → λ A Non-Linear Grammar Grammar G : S → SS S → λ S → aSb S → bSa

L(G) = {w: na (w) = nb (w)}

Number of a in string w Another Linear Grammar

S → A Grammar G : A → aB | λ B → Ab

L(G) ={anbn : n ≥ 0} Right-Linear Grammars

• All productions have form: A → xB or A → x

S → abS • Example: string of S → a terminals Left-Linear Grammars

All productions have form: A → Bx or A → x

S → Aab Example: string of A → Aab | B terminals B → a Regular Grammars

A regular grammar is any right-linear or left-linear grammar

Examples: G 1 G2 S → abS S → Aab S → a A → Aab | B B → a Observation

Regular grammars generate regular languages G Examples: 2 G1 S → Aab S → abS A → Aab | B S → a B → a

L(G1) = (ab )*a L(G2) = aab (ab )* Closure under language operations

• Theorem. The set of regular languages is closed under the all the following operations. In other words if L 1 and L 2 are regular, then so is:

– Union: L 1 ∪ L2 – Intersection: L 1 ∩ L2 c – Complement: L 1 = Σ* \ L1 – Difference: L 1 \L2 – Concatenation: L 1L2 – Kleene star: L 1* Regular expressions versus regular languages

• We defined a regular language to be one that is accepted by some DFA (or NFA). • We can prove that a language is regular by this definition if and only if it corresponds to some regular expression. Thus, – 1) Given a DFA or NFA, there is some regular expression to describe the language it accepts – 2) Given a regular expression, we can construct a DFA or NFA to accept the language it represents Application: Lexical-analyzers and lexical-analyzer generators

• Lexical analyzer: – Input: Character string comprising a computer program in some language – Output: A string of symbols representing tokens– elements of that language • Ex in C++ or Java : – Input: if (x == 3) y = 2; – Output (sort of): if-token, expression-token, variable- name-token, assignment-token, numeric-constant token, statement-separator-token. Lexical-analyzer generators

• Input: A list of tokens in a programming language, described as regular expressions • Output: A lexical analyzer for that language • Technique: Builds an NFA recognizing the language tokens, then converts to DFA. Regular Expressions: Applications and Limitations • Regular expressions have many applications: – Specification of syntax in programming languages – Design of lexical analyzers for compilers – Representation of patterns to match in search engines – Provide an abstract way to talk about programming problems (language correpsonds to inputs that produce output yes in a yes/no programming problem) • Limitations – There are lots of reasonable languages that are not regular – hence lots of programming problems that can’t be solved using power of a DFA or NFA CFL

• Context-Free Languages R {anbn} {ww }

Regular Languages Context-Free Languages

Context-Free Pushdown Grammars Automata

stack

automaton Example A context-free grammar G : S → aSb S → λ

A derivation: S ⇒ aSb ⇒ aaSbb ⇒ aabb } 0 ≥ n (((( )))) (((( : n b n aSb λ a { → → = S S ) G ( L Describes parentheses: Describes aSa bSb λ → → → abba S S S ⇒ G abSba ⇒ Example aSa ⇒ S A context-free grammar : grammar context-free A derivation: A }*} b , a { ∈ w : R aSa bSb λ → → → ww { S S S = ) G ( L SS aSb λ ab → → → ⇒ S S S abS G ⇒ aSbS Example ⇒ SS ⇒ S A derivation: A A context-free grammar : grammar context-free A abab SS aSb λ ⇒ → → → S S S abaSb ⇒ G abS ⇒ aSbS ⇒ SS ⇒ S A derivation: A A context-free grammar : grammar context-free A Definition: Context-Free Languages

A language L is context-free if and only if there is a context-free grammar G with L = L(G) Derivation Order

1. S → AB 2. A → aaA 4. B → Bb .3 A → λ .5 B → λ Leftmost derivation: 1 2 3 4 5 S ⇒ AB ⇒aaAB ⇒aaB ⇒aaBb ⇒aab

Rightmost derivation: 1 4 5 2 3 S ⇒ AB ⇒ ABb ⇒ Ab ⇒aaAb ⇒aab λ | A bBb aAB → → → A abbBbbB B S ⇒ abAb ⇒ abAbB ⇒ abBb abbbb ⇒ abbbb ⇒ aA ⇒ abBbB ⇒ ⇒ aAB abbBbb aAB abbbbB ⇒ ⇒ ⇒ ⇒ S S Leftmost derivation: Leftmost Rightmost derivation: Rightmost S → AB A → aaA | λ B → Bb | λ

S ⇒ AB ⇒ aaAB ⇒ aaABb S

A B

a a A B b S → AB A → aaA | λ B → Bb | λ

S ⇒ AB ⇒ aaAB ⇒ aaABb ⇒ aaBb S

A B

a a A B b

λ S → AB A → aaA | λ B → Bb | λ

S ⇒ AB ⇒ aaAB ⇒ aaABb ⇒ aaBb ⇒ aab Derivation Tree S

A B

a a A B b

λ λ S → AB A → aaA | λ B → Bb | λ

S ⇒ AB ⇒ aaAB ⇒ aaABb ⇒ aaBb ⇒ aab Derivation Tree S

A B yield

a a A B b aa λλb = aab λ λ Sometimes, derivation order doesn’t matter Leftmost: S ⇒ AB ⇒ aaAB ⇒ aaB ⇒ aaBb ⇒ aab

Rightmost: S ⇒ AB ⇒ ABb ⇒ Ab ⇒ aaAb ⇒ aab S Same derivation tree A B

a a A B b

λ λ E → E + E | E ∗ E | (E) | a a + a ∗ a

⇒ ⇒ ⇒ E E E + E a + E a + E ∗ E ⇒ a + a ∗ E ⇒ a + a *a E + E leftmost derivation a E ∗ E

a a E → E + E | E ∗ E | (E) | a a + a ∗ a

⇒ ⇒ ⇒ E E ∗ E E + E ∗ E a + E ∗ E E ⇒ a + a ∗ E ⇒ a + a ∗a E ∗ E leftmost derivation

E + E a

a a E → E + E | E ∗ E | (E) | a a + a ∗ a

Two derivation trees E E

E + E E ∗ E a E ∗ E E + E a

a a a a The grammar E → E + E | E ∗ E | (E) | a is ambiguous:

string a + a ∗ a has two derivation trees

E E

E + E E ∗ E

a E ∗ E E + E a

a a a a Definition:

A context-free grammar G is ambiguous

if some string w ∈ L ( G ) has:

two or more derivation trees

A context-free grammar G is ambiguous

if some string w ∈ L ( G ) has:

two or more leftmost derivations (or rightmost) 2 + 2∗ 2 = 6 2 + 2∗ 2 = 8

6 8 E E 2 4 4 2 E + E E ∗ E 2 2 2 2 2 E ∗ E E + E 2

2 2 2 2 • Ambiguity is bad for programming languages

• We want to remove ambiguity

New non-ambiguous grammar: E → E +T E → T T → T ∗ F T → F F → (E) F → a E ⇒ E + T ⇒ T + T ⇒ F + T ⇒ a + T ⇒ a + T ∗ F ⇒ a + F ∗ F ⇒ a + a ∗ F ⇒ a + a ∗ a E a + a ∗ a E → E +T E + T E → T T → T ∗ F T T ∗ F T → F F F a F → (E) F → a a a

E + T

T T ∗ F

F F a

a a

Copyright © 2006 Addison-Wesley. All rights reserved. 1-67 The grammar G : E → E +T E → T T → T ∗ F T → F F → (E) F → a is non-ambiguous: Every string w ∈ L ( G ) has a unique derivation tree Another Ambiguous Grammar

IF_STMT→ if EXPR then STMT | if EXPR then STMT else STMT

IF_STMT

if expr1 then STMT

if expr2 then stmt1 else stmt2

IF_STMT

if expr1 then STMT else stmt2

if expr2 then stmt1 Inherent Ambiguity

• Some context free languages • have only ambiguous grammars n n m n m m Example: L = {a b c } ∪ {a b c }

S S

S1 S2

c S1 a S2 Simplification of Context Free Grammar A Substitution Rule

Equivalent grammar S → aB S → aB | ab A → aaA Substitute A → aaA A → abBc B → b A → abBc | abbc B → aA B → aA B → b

S → aB | ab | aaA A → aaA Equivalent A → abBc | abbc | abaAc grammar In general: A → xBz

B → y1

Substitute B → y1

equivalent A → xBz | xy z 1 grammar Nullable Variables

λ − production : A → λ

Nullable Variable: A ⇒K⇒ λ

Example Grammar: S → aMb M → aMb M → λ

Nullable variable

Unit Production: A → B

(a single variable in both sides)

Observation: A → A

Is removed immediately

S → aA A → a A → B B → A B → bb

Final grammar S → aA | aB | aA S → aA | aB A → a A → a B → bb B → bb

Some derivations never terminate... S ⇒ A ⇒ aA ⇒ aaA ⇒K⇒ aa KaA ⇒K

S → A A → aA A → λ B → bA Useless Production

Not reachable from S

then variable A is useful

otherwise, variable A is useless A production A → x is useless if any of its variables is useless

S → aSb

S → λ Productions Variables S → A useless useless A → aA useless useless B → C useless useless C → D useless Removing Useless Productions

Example Grammar:

S → aS | A| C A → a B → aa C → aCb

S → aS | A| C Round 1: {A,B}

A → a S → A B → aa C → aCb Round 2: {A,B,S} Keep only the variables that produce terminal symbols: {A,B,S} (the rest variables are useless)

S → aS | A| C A → a S → aS | A B → aa A → a C → aCb B → aa Remove useless productions Second: Find all variables reachable from S

Use a Dependency Graph

S → aS | A A → a S A B B → aa not reachable Keep only the variables reachable from S (the rest variables are useless)

Final Grammar S → aS | A S → aS | A A → a A → a B → aa

Remove useless productions Removing All

•Step 1: Remove Nullable Variables

•Step 2: Remove Unit-Productions

•Step 3: Remove Useless Variables

Each productions has form:

A → BC or A → a

variable variable terminal

S → AS S → AS S → a S → AAS A → SA A → SA A → b A → aa

Chomsky Not Chomsky Normal Form Normal Form

S → ABa • Example: A → aab B → Ac

Not Chomsky Normal Form

S → ABT a S → ABa A → TaTaTb A → aab B → AT c B → Ac Ta → a Tb → b Tc → c Introduce intermediate variable: V1

S → AV 1 S → ABT a V1 → BT a A → TaTaTb A → TaTaTb B → AT c B → AT c Ta → a Ta → a Tb → b Tb → b Tc → c Tc → c Introduce intermediate variable: V2

S → AV 1 S → AV 1 V1 → BT a V1 → BT a A → TaV2 A → TaTaTb V2 → TaTb B → AT c B → AT c Ta → a Ta → a Tb → b Tb → b Tc → c Tc → c Final grammar in Chomsky Normal Form: S → AV 1 V1 → BT a A → TaV2 Initial grammar V2 → TaTb S → ABa B → AT c A → aab Ta → a B → Ac Tb → b Tc → c In general:

From any context-free grammar (which doesn’t produce λ ) not in Chomsky Normal Form

we can obtain: An equivalent grammar in Chomsky Normal Form

First remove:

Nullable variables

Unit productions

Add production Ta → a

In productions: replace a with Ta

New variable: Ta L Replace any production A → C1C2 Cn

with A → C1V1 V1 → C2V2 K

Vn−2 → Cn−1Cn

K New intermediate variables: V1,V2, ,Vn−2 Observations

• Chomsky normal forms are good for parsing and proving theorems

• It is very easy to find the Chomsky normal form for any context-free grammar

All productions have form:

L A → a V1V2 Vk k ≥ 0

symbol variables

S → cAB S → abSb A → aA | bB | b S → aa B → b

Greinbach Not Greinbach Normal Form Normal Form

• Greinbach normal forms are very good for parsing

• It is hard to find the Greinbach normal form of any context-free grammar

Context-free languages are closed under: Union

L1 is context free L1 ∪ L2 L2 is context free is context-free

n n L1 ={a b } S1 → aS 1b | λ

R L2 ={ww } S2 → aS 2a | bS 2b | λ

Union n n R L ={a b }∪{ww } S → S1 | S2 In general: For context-free languages L1, L2 with context-free grammars G1, G2 and start variables S1, S2

The grammar of the union L1 ∪ L2 has new start variable S and additional production S → S1 | S2 Concatenation

Context-free languages are closed under: Concatenation

L1 is context free L1L2 L2 is context free is context-free

n n L1 ={a b } S1 → aS 1b | λ

R L2 ={ww } S2 → aS 2a | bS 2b | λ

Concatenation n n R L = {a b }{ ww } S → S1S2 In general:

For context-free languages L1, L2 with context-free grammars G1, G2 and start variables S1, S2

The grammar of the concatenation L1L2 has new start variable S and additional production S → S1S2 Star Operation

Context-free languages are closed under: Star-operation

* L is context free L is context-free

Language Grammar

L ={anbn} S → aSb | λ

Star Operation

n n L = {a b }* S1 → SS 1 | λ

For context-free language L with context-free grammar G and start variable S

The grammar of the star operation L* has new start variable S1 and additional production S1 → SS 1 | λ Intersection

Context-free languages are not closed under: intersection

L1 is context free L1 ∩ L2

L2 is context free not necessarily context-free

Intersection n n n L1 ∩ L2 ={a b c } NOT context-free Complement

Context-free languages are not closed under: complement

L is context free L not necessarily context-free

Copyright © 2006 Addison-Wesley. All rights reserved. 1-124 Example n n m n m m L1 ={a b c } L2 ={a b c } Context-free: Context-free: S → AC S → AB A → aAb | λ A → aA | λ C → cC | λ B → bBc | λ Complement n n n L1 ∪ L2 = L1 ∩ L2 ={a b c } NOT context-free The intersection of a context-free language and a regular language is a context-free language

L1 context free L1 ∩ L2 L2 regular context-free Summary

Non-recursively enumerable

Recursively-enumerable Recursive

Context-sensitive

Context-free

Regular Who is Noam Chomsky Anyway?

• Philosopher of Languages • Professor of Linguistics at MIT • Constructed the idea that language was not a learned “behavior”, but that it was cognitive and innate; versus stimulus- response driven • In an effort to explain these theories, he developed the Chomsky Hierarchy Chomsky Hierarchy

Language Grammar Machine Example Regular Grammar Deterministic or Regular • Right-linear Nondeterministic a* Language grammar Finite-state • Left-linear acceptor grammar

Context-free Context-free Nondeterministic annnbnnn Language grammar Pushdown automaton

Context- Context-sensitive Linear-bounded annnbnnncnnn sensitive grammar automaton

Recursively Unrestricted Turing machine Any computable enumerable grammar function Chomsky Hierarchy

• Comprises four types of languages and their associated grammars and machines. • Type 3: Regular Languages • Type 2: Context-Free Languages • Type 1: Context-Sensitive Languages • Type 0: Recursively Enumerable Languages • These languages form a strict hierarchy