Regular Expressions

 Another means to describe languages Regular Expressions accepted by Finite Automata.

 In some books, regular languages, by definition, are described using regular expressions.

Specifying Languages Regular Languages

 Recall: how do we specify languages?  A describes a  If language is finite, you can list all of its strings. language using only the operations  L = {a, aa, aba, aca} of:  Descriptive:  Union  L = {x | na(x) = nb(x)}  Using basic Language operations  * *  L= {aa, ab} ∪ {b}{bb}  Kleene Star  Regular languages are described using this last method

Kleene Star Operation Regular Expressions

 The set of strings that can be obtained by  Regular expressions are the mechanism concatenating any number of elements of a by which regular languages are * language L is called the Kleene Star, L described: " * i 0 1 2 3 4 L = L = L ! L ! L ! L ! L ...  Take the “set operation” definition of the Ui=0 language and:  Replace ∪ with + * 0  Note that since, L contains L , λ is an  Replace {} with () * element of L  And you have a regular expression

1 Regular expressions Regular Expression

{λ} λ  Recursive definition of regular languages / expression over Σ : {011} 011 1. ∅ is a and its regular {0,1} 0 + 1 expression is ∅

{0, 01} 0 + 01 2. {λ} is a regular language and λ is its regular expression {110}*{0,1} (110)*(0+1) 3. For each a ∈ Σ, {a} is a regular language and {10, 11, 01}* (10 + 11 + 01)* its regular expression is a {0, 11}*({11}* ∪ {101, λ}) (0 + 11)*((11)* + 101 + λ)

Regular Expression Regular Expressions

4. If L1 and L2 are regular languages with regular  Some shorthand expressions r1 and r2 then  If we apply precedents to the operators, we can -- L L is a regular language with regular 1 ∪ 2 relax the full parenthesized definition: expression (r1 + r2)  Kleene star has highest precedent -- L1L2 is a regular language with regular  Concatenation had mid precedent expression (r1r2) -- book uses (r1•r2 ) *  + has lowest precedent -- L1 is a regular language with regular expression *  Thus (r1 ) * * -- Regular expressions can be parenthesized to  a + b c is the same as (a + ((b )c)) * *  (a + b) is not the same as a + b indicate operator precidence I.e. (r1)

Only languages obtainable by using rules 1-4 are regular languages.

Regular Expressions Regular Expressions

 More shorthand  Even more shorthand

 Equating regular expressions.  Sometimes you might see in the book: n  Two regular expressions are considered equal if  r where n indicates the number of they describe the same language of r (e.g. r6) * * * +  1 1 = 1  r to indicate one or more concatenations of r. * *  (a + b) ≠ a + b

 Note that this is only shorthand! 6 +  r and r are not regular expressions.

2 Regular Expressions Regular Expressions

 Important thing to remember  Questions?  A regular expression is not a language

 A regular expression is used to describe a language.

 It is incorrect to say that for a language L, *  L = (a + b + c)

 But it’s okay to say that L is described by *  (a + b + c)

Examples of Regular Languages Examples of Regular Languages

 All finite languages can be described by  All finite languages can be described regular expressions using regular expressions  A finite language L can be expressed as  Can anyone tell me why? the union of languages each with one string corresponding to a string in L

 Example:

 L = {a, aa, aba, aca}

 L = {a} ∪ {aa} ∪ {aba} ∪ {aca}

 Regular expression: (a + aa + aba + aca)

Examples of Regular Languages Examples of Regular Languages

* *  L = {x ∈ {0,1} | |x| is even}  L = {x ∈ {0,1} | x does not end in 01 }  Any string of even length can be obtained by concatenating strings length 2.  If x does not end in 01, then either

 Any concatenation of strings of length 2 will be  |x| < 2 or even  x ends in 00, 10, or 11 *  L = {00, 01, 10, 11}  A regular expression that describes L is: *  + 0 + 1 + (0 + 1) (00 + 10 + 11)  Regular expressions describing L: ε *  (00 + 01 + 10 + 11) *  ((0 + 1)(0 + 1))

3 Useful properties of regular Examples of Regular Languages expressions

*  L = {x ∈ {0,1} | x contains an odd  Commutative number of 0s }  L + M = M + L   Express x = yz Associative i j  (L + M) + N = L + (M + N)  y is a string of the form y=1 01  (LM)N = L(MN)  In z, there must be an even number of additional 0s or z = (01k01m)*  Identities * * * * *  ∅ + L = L + ∅ = L  x can be described by (1 01 )(01 01 )  λL = L λ = L

 Questions?  ∅L = L ∅ = ∅

Useful properties of regular Useful properties of regular expressions expressions

 Distributed  Closures * * *  (L ) = L  L (M + N) = LM + LN *  ∅ = λ  (M + N)L = ML + NL *  λ = λ  Idempotent + *  L = LL  L + L = L * +  L = L + λ

 Questions?

Practical uses for regular expressions Practical uses for regular expressions

 grep  How a compiler works

 Global (search for) Regular Expressions and Print lexer Stream parser Parse of tokens codegen  Finds patterns of characters in a text file. Tree Object code  grep man foo.txt Source  grep [ab]*c[de]? foo.txt file

4 Practical uses for regular expressions Practical uses for regular expressions

 How a compiler works  How a compiler works

 The Lexical Analyzer (lexer) reads source  Tokens can be described using regular code and generates a stream of tokens expressions!  What is a token?

 Identifier

 Keyword

 Number

 Operator

 Punctuation

Examples of Regular Languages Examples of Regular Languages

 L = set of valid C keywords  L = set of valid C identifiers

 This is a  A valid C identifier begins with a letter or _  A valid C identifier contains letters,  L can be described by numbers, and _  if + then + else + while + do + goto + break + switch + …  If we let:  l = {a , b , … , z , A , B , … , Z}

 d = {1 , 2 , … , 9 , 0}

 Then a regular expression for L: *  (l + _)(l + d + _)

Practical uses for regular expressions Summary

 Regular languages can be expressed using only the  lex set operations of union, concatenation, Kleene Star.  Program that will create a lexical analyzer.  Regular languages

 Input: set of valid tokens  Means of describing: Regular Expression  Machine for accepting: Finite Automata  Tokens are given by regular expressions.  Practical uses

 Text search (grep)

 Compilers / Lexical Analysis (lex)  Questions?

 Questions?

5 For next time The bottom line

 Chicken or the egg?  Regular expressions and finite automata are

 Which came first, the regular expression or the equivalent in their ability to describe finite automata? languages.  McCulloch/Pitts -- used finite automata to model neural  Every regular expression has a FA that accepts the networks (1943) language it describes  Kleene (mid 1950s) -- Applied to regular sets  The language accepted by an FA can be described  Ken Thompson/ Bell Labs folk (1970s) -- QED / ed / grep / lex / awk / … by some regular expression.

 Recall:  The Kleene Theorem! (1956)

 Princeton dudes (1937)  But that’s next time….

6