Great Theoretical Ideas in CS V. Adamchik CS 15-251 Outline Lecture 21 Carnegie Mellon University DFAs Finite Automata Regular Languages 0n1n is not regular Union Theorem Kleene’s Theorem NFAs Application: KMP

11 1 Deterministic Finite Automata 0 0,1 1

A machine so simple that you can 0111 111 1 ϵ understand it in just one minute 0 0 1  The machine processes a string and accepts it if the process ends in a double circle The unique string of length 0 will be denoted by ε and will be called the empty or null string

accept states (F) Anatomy of a Deterministic Finite start state (q0) 11 0 Automaton 0,1 1 The singular of automata is automaton. 1 The alphabet Σ of a finite automaton is the 0111 111 1 ϵ where the symbols come from, for 0 0 example {0,1}

transitions 1 The language (M) of a finite automaton is the set of strings that it accepts states L(M) = {x∈Σ: M accepts x} The machine accepts a string if the process It’s also called the ends in an accept state (double circle) “language decided/accepted by M”.

1 The Language L(M) of Machine M The Language L(M) of Machine M

0 0 0 0,1 1 q 0 q1 q0 1 1 What language does this DFA decide/accept? L(M) = All strings of 0s and 1s

The language of a finite automaton is the set of strings that it accepts L(M) = { w | w has an even number of 1s}

M = (Q, Σ, , q0, F) Q = {q0, q1, q2, q3} Formal definition of DFAs where Σ = {0,1} A finite automaton is a 5-tuple M = (Q, Σ, , q0, F) q0 Q is start state Q is the of states F = {q1, q2} Q accept states : Q Σ → Q transition function Σ is the alphabet

: Q Σ → Q is the transition function q 1 0 1 0 1 0,1 q0 Q is the start state q0 q0 q1 1 q q1 q2 q2 F Q is the set of accept states 0 M q2 0 0 q2 q3 q2 q q q 1 3 0 2 L(M) = the language of machine M q3 = set of all strings machine M accepts

EXAMPLE Determine the language An automaton that accepts all recognized by and only those strings that contain 001

1 0,1

0,1 0 1 0

0 0 1 {0} {00} {001}

1 L(M)={1,11,111, …}

2 Membership problem Determine the language decided by Determine whether some word belongs to the language.

0 0,1

0 1 0,1

1

L(M)={1, 01}

Regular Languages DFA Membership problem

A language over Σ is a set of strings over Σ Determine whether some word belongs to the language. A language L ⊆ Σ is regular if it is recognized by a deterministic finite automaton Theorem: The DFA Membership Problem is A language L ⊆ Σ is regular if there is solvable in linear time. a DFA which decides it. Let M = (Q, Σ, , q0, F) and w = w1...wm. L = { w | w contains 001} is regular Algorithm for DFA M: p := q0; L = { w | w has an even number of 1s} is regular for i := 1 to m do p := (p,wi); if pF then return Yes else return No.

Theorem: L = {0n1n : n∈ℕ} is not regular Are all languages regular? Notation: If a∈Σ is a symbol and n∈N then an denotes the string aaa∙∙∙a (n times).

E.g., a3 means aaa, a5 means aaaaa, 1 0 a means a, a means ϵ, etc. Theorem: Any finite language is regular Thus L = {ϵ, 01, 0011, 000111, 00001111, …}.

3 Theorem: L = {0n1n : n∈ℕ} is not regular L = strings where the number of occurrences of 01 is equal to the number of occurrences of 10 Wrong Intuition: 1 0 For a DFA to decide L, it seems like it needs 1 to “remember” how many 0’s it sees at the 0 0 beginning of the string, so that it can 1 0 “check” there are equally many 1’s. 0

But a DFA has only finitely many states — 1 1 shouldn’t be able to handle arbitrary n. M accepts only the strings with an equal number of 01’s and 10’s! For example, 010110

How to prove a language is not Theorem: L = {0n1n : n∈ℕ} is not regular regular… Full proof: Assume for contradiction there is a DFA M with L(M) = L. Suppose M is a DFA deciding L with, say, k states.

i Argue (usually by Pigeonhole) there are two Let ri be the state M reaches after processing 0 . strings x and y which reach the same state in M. By Pigeonhole, there is a repeat among

r0, r1, r2, …, rk. So say that rs = rt for some s ≠ t. Show there is a string z such that xz∈L but yz∉L. Contradiction, since M accepts either both (or s s s neither.) Since 0 1 ∈ L, starting from rs and processing 1 causes M to reach an accepting state.

Theorem: L = {0n1n : n∈ℕ} is not regular Regular Languages

Full proof:

So on input 0s1s ∈ L, M will reach an accepting state. Definition: A language L ⊆ Σ is regular if there is a DFA which decides it. Consider input 0t1s ∉ L, s≠t.

t M will process 0 , reach state rt = rs Questions: 1. Are all languages regular? then M will process 1s, and reach an accepting state. 2. Are there other ways to tell if L is regular? Contradiction!

4 Equivalence of two DFAs Union Theorem

Definition: Two DFAs M and M over the same 1 2 Given two languages, L and L , define alphabet are equivalent if they 1 2 the union of L1 and L2 as accept the same language: L(M1) = L(M2). L1 L2 = { w | w L1 or w L2 }

Given a few equivalent machines, we are Theorem: The union of two regular naturally interested in the smallest one languages is also a . with the least number of states.

Theorem: The union of two regular Union Theorem languages is also a regular language L1 = strings with qeven 0 Proof (Sketch): Let even # of 1’s M1

M1 = (Q1, Σ, 1, q0, F1) be finite automaton for L1 L = strings x with and 2 1 1 |x| div. by 3 M2 = (Q2, Σ, 2, q0, F2) be finite automaton for L2

qodd 0 We want to construct a finite automaton

M = (Q, Σ, , q0, F) that recognizes L = L1 L2 M 2 0,1 0,1 Idea: Run both M and M at the same time. p0 p1 p2 1 2 0,1

Union Theorem Union Theorem

qeven 0 qeven 0 M1 M1 Input: 101001 Input: 101001 1 1 1 1

qodd 0 qodd 0

M M 2 0,1 0,1 2 0,1 0,1

p0 p1 p2 p0 p1 p2 0,1 0,1

5 Union Theorem Union Theorem

qeven 0 qeven 0 M1 M1 Input: 101001 Input: 101001 1 1 1 1

qodd 0 qodd 0

M M 2 0,1 0,1 2 0,1 0,1

p0 p1 p2 p0 p1 p2 0,1 0,1

Union Theorem Union Theorem

qeven 0 qeven 0 M1 M1 Input: 101001 Input: 101001 1 1 1 1

qodd 0 qodd 0

M M 2 0,1 0,1 2 0,1 0,1

p0 p1 p2 p0 p1 p2 0,1 0,1

Union Theorem Union Theorem Make a DFA keeping qeven 0 qeven 0 M1 track of both at once. M1 Input: 101001 1 1 1 1 Accept.

qodd 0 qodd 0

M M 2 0,1 0,1 2 0,1 0,1

p0 p1 p2 p0 p1 p2 0,1 0,1

6 Union Theorem The Regular Operations Q = pairs of states, one from M and one from M 1 2 Union: A B = { w | w A or w B } = { (q1, q2) | q1 Q1 and q2 Q2 } = Q Q Intersection: A B = { w | w A and w B } 1 2 0 0 0 Negation: A = { w | w A } q p q p q p even, 0 even, 1 even, 2 R Reverse: A = { w1 …wk | wk …w1 A } 1 1 : A B = { vw | v A and w B }

Star: A* = { w …w | k ≥ 0 and each w A } qodd, p0 0 qodd, p1 0 qodd, p2 1 k i

1 1

The Kleene : A* The Kleene closure: A*

Star: A* = { w1 …wk | k ≥ 0 and each wi A } What is A* of A={0,1}?

From the definition of the concatenation, All binary strings we definite An, n =0, 1, 2, … recursively A0 = {ε} An+1 = An A What is A* of A={11}?

A* is a set consisting of of arbitrary many strings from A. All binary strings of an even number of 1s A* UAk k 0

Regular Languages Are Closed The Kleene Theorem (1956) Under The Regular Operations

An axiomatic system for regular languages Every regular language over Σ can be constructed from ∅ and {a}, a ∈ Σ, using only Vocabulary: Languages over alphabet Σ the operations union, concatenation Axioms: ∅, {a} for each a∈Σ and . Deduction rules:

Given L1, L2, can obtain L1 ⋃ L2

Given L1, L2, can obtain L1 ⋅ L2 * Given L, can obtain L

7 Reverse Nondeterministic finite automaton

R Reverse: A = { w1 …wk | wk …w1 A } There is another type How to construct a DFA for the reversal machine in which there a of a language? may be several possible a 1 0,1 next states. Such qk The direction in which we machines called read a string should be 0 nondeterministic. a q irrelevant. 0 q1

If we flip transitions 1 0,1 around we might not get 0 Allows transitions from qk on the same a DFA. q0 q 1 symbol to many states

Nondeterministic finite automaton Nondeterministic finite automaton (NFA) (NFA)

An NFA is defined using the same Nondeterminism can arise from two different notations M = (Q, Σ, , I, F) sources: as DFA except the initial states I and -Transition nondeterminism the transition function assigns a set of -Initial state nondeterminism states to each pair Q Σ of state and . input.

Note, every DFA is automatically also NFA.

NFA for {0k | k is a multiple of 2 or 3} Find the language recognized by this

0 NFA

0 0 s1 s 0 0 3 ε 1 1 s0 0,1 0 ε 0 0 1 s2 s4

0 L = {0n, 0n01, 0n11 | n = 0, 1, 2…}

8 What does it mean that for an NFA to What does it mean that for a NFA to recognize a string? recognize a string?

0 0 s1 s 0 3 Here we are going formally define this. 1

1 * s0 0,1 For a state q and string w, (q, w) is the set of states that the NFA can reach when it reads the string w starting at the state q. 0 1 s2 s4 Thus for NFA= (Q, Σ, , q0, F), the function *: Q x Σ -> 2Q Since each input symbol xj (for j>1) takes the previous state to a set of states, we shall use a is defined by *(q, y x ) =  (p,x ) union of these states. k p *(q,y) k

Find the language recognized by this Nondeterministic finite automaton NFA

0 Theorem. 1 1 If the language L is recognized by an NFA,

1 0 then L is also recognized by a DFA. s0

In other words, 0 1 if we ask if there is a NFA that is not equivalent to any DFA. The answer is No. L = 1* (01, 1, 10) (00)*

Nondeterministic finite automaton NFA vs. DFA

Theorem (Rabin, Scott 1959). Advantages. For every NFA there is an equivalent DFA. Easier to construct and manipulate. Sometimes exponentially smaller. For this they won the Turing Award. Sometimes algorithms much easier.

Drawbacks Acceptance testing slower. CMU prof. Sometimes algorithms more complicated. emeritus

Rabin Scott

9 Pattern Matching Pattern Matching

Input: Text T, length n. Pattern P, length k. Input: Text T of length k, string/pattern P of length n Output: Does P occur in T? Problem: Does pattern P appear inside text T? Naïve method: Automata solution:

a1, a2, a3, a4, a5, …, an The language P is regular!

Cost: Roughly O(n k) comparisons There is some DFA MP which decides it. Once you build M , feed in T: takes time O(n). may occur in images and DNA P unlikely in English text

DFA Construction Build DFA from pattern a a b a a a b b

The alphabet is {a, b}. The pattern is a a b a a a b b.

To create a DFA we consider all prefixes b ε, a, aa, aab, aaba, aabaa, aabaaa, aabaaab, a aabaaabb 0 1

These prefixes are states. The initial state is ε. The pattern is the accepting state.

DFA Construction DFA Construction a a b a a a b b a a b a a a b b

b b a b 0 a 1 a 2 0 a 1 a 2 3

b b

10 DFA Construction DFA Construction a a b a a a b b a a b a a a b b

b

b a b a b a b 0 1 2 3 4 0 1 2 3 4 5 a a a a a a b b b b

DFA Construction DFA Construction a a b a a a b b a a b a a a b b

b b

b a b a b b b 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 a a a a a a a a a a b b b b b b

a

DFA Construction The Knuth-Morris-Pratt Algorithm (1976) a a b a a a b b 1970 Cook published a paper about a possibility of b existence of a linear time algorithm a b a Knuth and Pratt developed an algorithm b a b 0 1 3 6 a a 2 a 4 a 5 b 7 8 b b Morris discovered the same algorithm b a Pittsburgh native, CMU professor.

11 The KMP Algorithm - Motivation Languages DFAs Algorithm compares the The regular operations pattern to the text in n n left-to-right, but shifts a b a a b x 0 1 is not regular the pattern more Union Theorem intelligently than the a b a a b a Kleene’s Theorem brute-force algorithm. j NFAs When a mismatch a b a a b a Application: KMP occurs, we compute the length of the longest No need to Resume Here’s What prefix of P that is a repeat these comparing You Need to proper suffix of P. comparisons here Know…

12