Regular Expressions

CS 172: Computability and Complexity Regular Expressions Sanjit A. Seshia EECS, UC Berkeley Acknowledgments: L.von Ahn, L. Blum, M. Blum The Picture So Far DFA NFA Regular language S. A. Seshia 2 Today’s Lecture DFA NFA Regular Regular language expression S. A. Seshia 3 Regular Expressions • What is a regular expression? S. A. Seshia 4 Regular Expressions • Q. What is a regular expression? • A. It’s a “textual”/ “algebraic” representation of a regular language – A DFA can be viewed as a “pictorial” / “explicit” representation • We will prove that a regular expressions (regexps) indeed represent regular languages S. A. Seshia 5 Regular Expressions: Definition σ is a regular expression representing { σσσ} ( σσσ ∈∈∈ ΣΣΣ ) ε is a regular expression representing { ε} ∅ is a regular expression representing ∅∅∅ If R 1 and R 2 are regular expressions representing L 1 and L 2 then: (R 1R2) represents L 1⋅⋅⋅L2 (R 1 ∪∪∪ R2) represents L 1 ∪∪∪ L2 (R 1)* represents L 1* S. A. Seshia 6 Operator Precedence 1. *** 2. ( often left out; ⋅⋅⋅ a ··· b ab ) 3. ∪∪∪ S. A. Seshia 7 Example of Precedence R1*R 2 ∪∪∪ R3 = ( ())R1* R2 ∪∪∪ R3 S. A. Seshia 8 What’s the regexp? { w | w has exactly a single 1 } 0*10* S. A. Seshia 9 What language does ∅∅∅* represent? {ε} S. A. Seshia 10 What’s the regexp? { w | w has length ≥ 3 and its 3rd symbol is 0 } ΣΣΣ2 0 ΣΣΣ* Σ = (0 ∪∪∪ 1) S. A. Seshia 11 Some Identities Let R, S, T be regular expressions • R ∪∪∪∅∅∅ = ? • R ···∅∅∅ = ? • Prove: R ( S ∪∪∪ T ) = R S ∪∪∪ R T (what’s the proof idea?) S. A. Seshia 12 Some Applications of Regular Expressions • String matching & searching – Utilities like grep, awk, … – Search in editors: emacs, … • Programming Languages – Perl – Compiler design: lex/yacc • Computer Security – Virus signatures S. A. Seshia 13 Virus Signature as String Sequence of words, one … for each instruction: pop ecx i0 jecxz SFModMark i1 i0 mov esi, ecx i2 i1 mov eax, 0d601h i3 pop edx i4 i2 pop ecx i0 … i3 Chernobyl virus i4 code fragment i0 virus! S. A. Seshia 14 Virus Signature as Regexp … Sequence of words doesn’t nop work! nop pop ecx i0 nop i1 i0 jecxz SFModMark i2 nop nop i1 mov esi, ecx i3 nop i4 i2 nop nop i0 mov eax, 0d601h i3 pop edx nop nop pop ecx i4 … i0 Simple obfuscated Chernobyl virus! virus code fragment S. A. Seshia 15 Equivalence Theorem A language is regular ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ if and only if some regular expression describes it S. A. Seshia 16 Part I (“if part”) Some regular expression R describes a language ⇒⇒⇒ That language is regular There exists NFA N such that R describes L(N) S. A. Seshia 17 Given regular expression R, we show there exists NFA N such that R represents L(N) Proof idea? S. A. Seshia 18 Given regular expression R, we show there exists NFA N such that R represents L(N) Proof Idea: Induction on the length of R : Base Cases (R has length 1): σσσ R = σσσ R = ε R = ∅∅∅ S. A. Seshia 19 Inductive Step : Assume R has length k > 1 and that any regular expression of length < k represents a language that can be recognized by an NFA What might R look like? R = R 1 ∪∪∪ R2 R = R 1R2 R = (R 1)* (remember: we have NFAs for R 1 and R 2) S. A. Seshia 20 Part I (“if part”) Some regular expression R describes a language ⇒⇒⇒ That language is regular There exists NFA N such that R describes L(N) DONE ! S. A. Seshia 21 An Example Transform (1(0 ∪∪∪ 1))* to an NFA ε 1 1,0 ε S. A. Seshia 22 Part II (“only if part”) A language is regular ⇒⇒⇒ Some regular expression R describes it Turn DFA into equivalent regular expression S. A. Seshia 23 Proof Sketch 1. DFA Generalized NFA • NFA with edges labeled by regexps, 1 start state, and 1 accept state 2. GNFA with k states GNFA with 2 states • k > 2; delete states but maintain equivalence 3. 2-state GNFA regular expression R R S. A. Seshia 24 GNFA Example & Definition 01*0 A GNFA is a tuple (Q, Σ, δ, qstart , qaccept ) • Q – set of states • Σ – finite alphabet (not regexps) • qstart – initial state (unique, no incoming edges) • ε transitions to old start state • qaccept – accepting state (unique, no outgoing edges) • ε transitions from old accept states • δ : (Q \ qaccept ) x (Q \ qstart ) R R – set of all regexps over Σ. Example: Any string matching 01 *0 can cause the transition. S. A. Seshia 25 Step 1: DFA to GNFA a a, b b What’s the corresponding GNFA? S. A. Seshia 26 Step 1: DFA to GNFA ε ε ε qstart DFA qaccept ε Add unique and distinct start and accept states Edges with multiple labels regexp labels If internal states (q 1, q 2) don’t have an edge between them, add one labeled with ∅∅∅ S. A. Seshia 27 Step 2: Eliminate states from GNFA While machine has more than 2 states: Pick an internal state, rip it out and re- label the arrows with regular expressions to account for the missing state 0 0 1 S. A. Seshia 28 Step 2: Eliminate states from GNFA While machine has more than 2 states: Pick an internal state, rip it out and re- label the arrows with regular expressions to account for the missing state 01*0 S. A. Seshia 29 a a ∪∪∪ b ε b ε q0 q1 q2 q3 S. A. Seshia 30 a ∪∪∪ b a*b ε q0 q2 q3 S. A. Seshia 31 (a*b)(a ∪∪∪b)* q0 q3 δδδ(δ(((q0,q 3) = (a*b)(a ∪∪∪b)* S. A. Seshia 32 Formally: Add qstart and qaccept and create GNFA G Run CONVERT(G) to eliminate states & get regexp: If #states = 2 return the expression on the arrow going from qstart to qaccept If #states > 2 ? S. A. Seshia 33 Formally: Add qstart and qaccept to create G Run CONVERT(G) : If #states > 2 select qrip ∈∈∈Q different from qstart and qaccept define Q′′′ = Q – {qrip } define δδδ′δ′′′ as: δ′δ′δ′ (qi,q j) = δδδ(qi,q rip )δδδ(q rip ,q rip )* δδδ(qrip ,q j) ∪∪∪δδδ (qi,q j) return CONVERT(G ′′′) /* recursion */ (what does this look like, pictorially?) S. A. Seshia 34 Prove: CONVERT(G) is equivalent to G Proof by induction on k (number of states in G) Base Case: k = 2 Inductive Step : Assume claim is true for k-1 states Prove that G and G ′′′ are equivalent By the induction hypothesis, G ′′′ is equivalent to CONVERT(G ′′′) S. A. Seshia 35 The Complete Picture DFA NFA Regular Regular language expression S. A. Seshia 36 Which language is regular? C = { w | w has equal number of 1s and 0s} NOT REGULAR D = { w | w has equal number of occurrences of 01 and 10} REGULAR! S. A. Seshia 37 Next Steps • Read Sipser 1.4 in preparation for next lecture S. A. Seshia 38.

Regular Expressions

Use Perl Regular Expressions in SAS® Shuguang Zhang, WRDS, Philadelphia, PA

Lecture 18: Theory of Computation Regular Expressions and Dfas

Perl Regular Expressions Tip Sheet Functions and Call Routines

Unicode Regular Expressions Technical Reports

Regular Expressions with a Brief Intro to FSM

Context-Free Grammar for the Syntax of Regular Expression Over the ASCII

PHP Regular Expressions

A Quick Guide to PERL Regular Expressions Second Edition © 2006 Transliteration: Translate Operator Tr/// EXPR =~ Tr/SEARCHLIST/REPLACELIST/Cds

CSCI 3434: Theory of Computation Lecture 4: Regular Expressions

10 Patterns, Automata, and Regular Expressions

Regular Expressions for Perl, C, PHP, Python, Java, and .NET

Regular Languages and Finite Automata for Part IA of the Computer Science Tripos