<<

CS 172: and Complexity Regular Expressions

Sanjit A. Seshia EECS, UC Berkeley

Acknowledgments: L.von Ahn, L. Blum, M. Blum The Picture So Far

DFA NFA

Regular language

S. A. Seshia 2 Today’s Lecture

DFA NFA

Regular expression

S. A. Seshia 3 Regular Expressions

• What is a regular expression?

S. A. Seshia 4 Regular Expressions

• Q. What is a regular expression? • A. It’s a “textual”/ “algebraic” representation of a regular language – A DFA can be viewed as a “pictorial” / “explicit” representation

• We will prove that a regular expressions (regexps) indeed represent regular languages

S. A. Seshia 5 Regular Expressions: Definition σ is a regular expression representing { σσσ} ( σσσ ∈∈∈ ΣΣΣ ) ε is a regular expression representing { ε} ∅ is a regular expression representing ∅∅∅

If R 1 and R 2 are regular expressions representing L 1 and L 2 then:

(R 1R2) represents L 1⋅⋅⋅L2

(R 1 ∪∪∪ R2) represents L 1 ∪∪∪ L2

(R 1)* represents L 1*

S. A. Seshia 6 Operator Precedence 1. ***

2. ( often left out; ⋅⋅⋅ a ··· b  ab )

3. ∪∪∪

S. A. Seshia 7 Example of Precedence

R1*R 2 ∪∪∪ R3 = ( ())R1* R2 ∪∪∪ R3

S. A. Seshia 8 What’s the regexp?

{ w | w has exactly a single 1 } 0*10*

S. A. Seshia 9 What language does ∅∅∅* represent?

{ε}

S. A. Seshia 10 What’s the regexp?

{ w | w has length ≥ 3 and its 3rd symbol is 0 }

ΣΣΣ2 0 ΣΣΣ*

Σ = (0 ∪∪∪ 1)

S. A. Seshia 11 Some Identities Let R, S, T be regular expressions

• R ∪∪∪∅∅∅ = ?

• R ···∅∅∅ = ?

• Prove: R ( S ∪∪∪ T ) = R S ∪∪∪ R T (what’s the proof idea?)

S. A. Seshia 12 Some Applications of Regular Expressions • String matching & searching – Utilities like , , … – Search in editors: , … • Programming Languages – design: /yacc • Computer Security – Virus signatures

S. A. Seshia 13 Virus Signature as String

Sequence of words, one … for each instruction: pop ecx i0 jecxz SFModMark i1 i0 mov esi, ecx i2 i1 mov eax, 0d601h i3 pop edx i4 i2 pop ecx i0 … i3

Chernobyl virus i4 code fragment i0 virus!

S. A. Seshia 14 Virus Signature as Regexp

… Sequence of words doesn’t nop work! nop pop ecx i0 nop i1 i0 jecxz SFModMark i2 nop nop i1 mov esi, ecx i3 nop i4 i2 nop nop i0 mov eax, 0d601h i3 pop edx nop nop pop ecx i4 … i0 Simple obfuscated Chernobyl virus! virus code fragment S. A. Seshia 15 Equivalence Theorem

A language is regular ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ if and only if some regular expression describes it

S. A. Seshia 16 Part I (“if part”)

Some regular expression R describes a language ⇒⇒⇒ That language is regular

There exists NFA N such that R describes L(N)

S. A. Seshia 17 Given regular expression R, we show there exists NFA N such that R represents L(N) Proof idea?

S. A. Seshia 18 Given regular expression R, we show there exists NFA N such that R represents L(N) Proof Idea: Induction on the length of R : Base Cases (R has length 1): σσσ R = σσσ

R = ε

R = ∅∅∅

S. A. Seshia 19 Inductive Step : Assume R has length k > 1 and that any regular expression of length < k represents a language that can be recognized by an NFA

What might R look like?

R = R 1 ∪∪∪ R2

R = R 1R2

R = (R 1)*

(remember: we have NFAs for R 1 and R 2)

S. A. Seshia 20 Part I (“if part”)

Some regular expression R describes a language ⇒⇒⇒ That language is regular

There exists NFA N such that R describes L(N)

DONE !

S. A. Seshia 21 An Example

Transform (1(0 ∪∪∪ 1))* to an NFA

ε 1 1,0

ε

S. A. Seshia 22 Part II (“only if part”)

A language is regular ⇒⇒⇒ Some regular expression R describes it

Turn DFA into equivalent regular expression

S. A. Seshia 23 Proof Sketch

1. DFA  Generalized NFA • NFA with edges labeled by regexps, 1 start state, and 1 accept state 2. GNFA with k states  GNFA with 2 states • k > 2; delete states but maintain equivalence 3. 2-state GNFA  regular expression R R

S. A. Seshia 24 GNFA Example & Definition 01*0

A GNFA is a tuple (Q, Σ, δ, qstart , qaccept ) • Q – of states • Σ – finite alphabet (not regexps)

• qstart – initial state (unique, no incoming edges) • ε transitions to old start state

• qaccept – accepting state (unique, no outgoing edges) • ε transitions from old accept states

• δ : (Q \ qaccept ) x (Q \ qstart )  R R – set of all regexps over Σ. Example: Any string matching 01 *0 can cause the transition. S. A. Seshia 25 Step 1: DFA to GNFA

a a, b b

What’s the corresponding GNFA?

S. A. Seshia 26 Step 1: DFA to GNFA

ε ε ε qstart DFA qaccept ε

Add unique and distinct start and accept states

Edges with multiple labels  regexp labels

If internal states (q 1, q 2) don’t have an edge between them, add one labeled with ∅∅∅

S. A. Seshia 27 Step 2: Eliminate states from GNFA

While machine has more than 2 states: Pick an internal state, rip it out and re- label the arrows with regular expressions to account for the missing state

0 0

1

S. A. Seshia 28 Step 2: Eliminate states from GNFA

While machine has more than 2 states: Pick an internal state, rip it out and re- label the arrows with regular expressions to account for the missing state

01*0

S. A. Seshia 29 a a ∪∪∪ b

ε b ε q0 q1 q2 q3

S. A. Seshia 30 a ∪∪∪ b

a*b ε q0 q2 q3

S. A. Seshia 31 (a*b)(a ∪∪∪b)* q0 q3

δδδ(δ(((q0,q 3) = (a*b)(a ∪∪∪b)*

S. A. Seshia 32 Formally: Add qstart and qaccept and create GNFA G Run CONVERT(G) to eliminate states & get regexp: If #states = 2 return the expression on the arrow

going from qstart to qaccept If #states > 2 ?

S. A. Seshia 33 Formally: Add qstart and qaccept to create G Run CONVERT(G) :

If #states > 2

select qrip ∈∈∈Q different from qstart and qaccept

define Q′′′ = Q – {qrip } define δδδ′δ′′′ as:

δ′δ′δ′ (qi,q j) = δδδ(qi,q rip )δδδ(q rip ,q rip )* δδδ(qrip ,q j) ∪∪∪δδδ (qi,q j)

return CONVERT(G ′′′) /* recursion */

(what does this look like, pictorially?) S. A. Seshia 34 Prove: CONVERT(G) is equivalent to G Proof by induction on k (number of states in G) Base Case:  k = 2 Inductive Step : Assume claim is true for k-1 states Prove that G and G ′′′ are equivalent By the induction hypothesis, G ′′′ is equivalent to CONVERT(G ′′′)

S. A. Seshia 35 The Complete Picture

DFA NFA

Regular Regular language expression

S. A. Seshia 36 Which language is regular?

C = { w | w has equal number of 1s and 0s} NOT REGULAR

D = { w | w has equal number of occurrences of 01 and 10} REGULAR!

S. A. Seshia 37 Next Steps

• Read Sipser 1.4 in preparation for next lecture

S. A. Seshia 38