<<

To Ponder

Computer Science and Engineering n The Ohio State University What is a language? Regular Expressions

Computer Science and Engineering n College of Engineering n The Ohio State University

Lecture 7 Language

Computer Science and Engineering n The Ohio State University o Definition: a set of strings o Examples n ℒ = cat, dog, ish n ℒ = �� � and � are hex digits n ℒ = ��� … � � > 0 ∧ ∀ � = � o Activity: For each ℒ above, find n ℒ (the cardinality of the set) n max � ∈ℒ Programming Languages

Computer Science and Engineering n The Ohio State University o Q: Are , Java, Ruby, Python, … languages in this formal sense? Programming Languages

Computer Science and Engineering n The Ohio State University o Q: Are C, Java, Ruby, Python, … languages in this formal sense? o A: !

n ℒ!"#$ is the set of well-formed Ruby programs n What the interpreter () accepts n The syntax of the language o But what does one such string mean? n The semantics of the language n Not part of formal definition of "language" n But necessary to know to claim "I know Ruby" (RE)

Computer Science and Engineering n The Ohio State University o A formal mechanism for defining a language n Precise, unambiguous, well-defined o In math, a clear distinction between: n Characters in strings (the "alphabet") n Meta characters used in writing a RE o � ⋃ � ∗� � ⋃ � � ⋃ � � ⋃ � o In computer applications, there isn't n Is '*' a Kleene star or an asterisk? o (a|b)*a(a|b)(a|b)(a|b) Literals

Computer Science and Engineering n The Ohio State University o A literal represents a character from the alphabet o Some are easy: n f, i, s, h, … o Whitespace is hard (invisible!) n \t is a tab ( 0x09) n \n is a (ascii 0x0A) n \r is a (ascii 0x0D) o So the character '\' needs to be escaped! o \\ is a \ (ascii 0x5c) Basic Operators

Computer Science and Engineering n The Ohio State University o ( ) for grouping, | for choice o Examples n cat|dog|fish n (h|H)ello n R(uby|ails) n (G|g)r(a|e)y o These operators are meta-characters too n To represent the literal: \( \) \| n \(61(3|4)\) o Activity: For each RE above, write out the corresponding language explicitly (ie, as a set of strings) Character Class

Computer Science and Engineering n The Ohio State University o Set of possible characters n (0|1|2|3|4|5|6|7|8|9) is annoying! o Syntax: [ ] n Explicit list as [0123456789] n Range as [0-9] o Negate with ^ at the beginning n [^A-Z] a character that is not a capital letter o Activity: Write the language defined by n Gr[ae]y n 0[xX][0-9a-fA-F] n [Qq][^u] Character Class Shorthands

Computer Science and Engineering n The Ohio State University o Common n \d for digit, ie [0-9] n \s for whitespace, ie [ \t\r\n] n \w for word character, ie [0-9a-zA-Z_] o And negations too n \D, \S, \W (ie [^\d], [^\s], [^\w]) n Warning: [^\d\s] ≠ [\D\S] o POSIX standard (& Ruby) includes n [[:alpha:]] alphabetic character n [[:lower:]] lowercase alphabetic character (١ n [[:digit:]] decimal digit (! Eg n [[:xdigit:]] digit n [[:space:]] whitespace including Wildcards

Computer Science and Engineering n The Ohio State University o A . matches any character (almost) n Includes space, tab, punctuation, etc! n But does not include newline o So add . to list of meta-characters n Use \. for a literal period o Examples n Gr.y n buckeye\.\d o Problem: What is RE for OSU email address for everyone named Smith? n Answer is not: smith\.\d@osu\.edu Repetition

Computer Science and Engineering n The Ohio State University o Applies to preceding character or ( ) group n ? means 0 or 1 time n * means 0 or more times (unbounded) n + means 1 or more times (unbounded) n {k} means exactly k times n {a,b} means k times, for a ≤ k ≤ b o More meta-characters to escape! n \? \* \+ \{ \} Examples

Computer Science and Engineering n The Ohio State University o colou?r o smith\.[1-9]\d*@osu\.edu o 0[xX](0|[1-9a-fA-F][0-9a-fA-F]*) o .*\.jpe?g Your Turn

Computer Science and Engineering n The Ohio State University o (Language consisting of) strings that: n Contain only letters, numbers, and _ n Start with a letter n Do not contain 2 consecutive _'s n Do not end with _ o Exemplars and counter-exemplars: n EOF, 4Temp, Test_Case3, _class, a4_Sap_X, S__T_2 o Write the corresponding RE Your Turn

Computer Science and Engineering n The Ohio State University o (Language consisting of) strings that: n Contain only letters, numbers, and _ n Start with a letter n Do not contain 2 consecutive _'s n Do not end with _ o Exemplars and counter-exemplars: n EOF, 4Temp, Test_Case3, _class, a4_Sap_X, S__T_2 o Write the corresponding RE Your Turn

Computer Science and Engineering n The Ohio State University o (Language consisting of) strings that: n Contain only letters, numbers, and _ n Start with a letter n Do not contain 2 consecutive _'s n Do not end with _ o Exemplars and counter-exemplars: n EOF, 4Temp, Test_Case3, _class, a4_Sap_X, S__T_2 o Write the corresponding RE [a-zA-Z](_[a-zA-Z0-9]|[a-zA-Z0-9])* Finite State Automota (FSA)

Computer Science and Engineering n The Ohio State University o An FSA is an "accepting rule" n Finite set of states n Transition function (relation) between states based on next character in string o DFA vs NFA

n Start state (s0) n Set of accepting states o An FSA "accepts" a string if you can

start in s0 and end up in an accepting state, consuming 1 character per step Example

Computer Science and Engineering n The Ohio State University o What language is defined by this FSA?

1 0 1

S0 S1

0 Example

Computer Science and Engineering n The Ohio State University o What language is defined by this FSA? o A. Binary strings (0's and 1's) with an even number of 0's

1 0 1

S0 S1

0 Your Turn

Computer Science and Engineering n The Ohio State University o (Language consisting of) strings that: n Contain only letters, numbers, and _ n Start with a letter n Do not contain 2 consecutive _'s n Do not end with _ o Exemplars and counter-exemplars: n EOF, 4Temp, Test_Case3, _class, a4_Sap_X, S__T_2 o Write the corresponding FSA Solution

Computer Science and Engineering n The Ohio State University Solution

Computer Science and Engineering n The Ohio State University

a-zA-Z0-9

a-zA-Z

S0 S1

a-zA-Z0-9 _

S2 Fundamental Results

Computer Science and Engineering n The Ohio State University o Expressive power of RE is the same as FSA o Expressive power of RE is limited n Write a RE for "strings of balanced parens" o ()(()()), ()(), (((()))), … o (((, ())((), … n Can not be done! (impossibility result) o Take CSE 3321… REs in Practice

Computer Science and Engineering n The Ohio State University o REs often used to find a "match" n A substring s within a longer string such that s is in the language defined by the RE (CSE|cse) ?3901 o Possible uses: n Report matching substrings and locations n Replace match with something else o Practical aspects of using REs this way n Anchors n Greedy vs lazy matching Anchors

Computer Science and Engineering n The Ohio State University o Used to specify where matching string should be with respect to a line of text o Newlines are natural breaking points n ^ anchors to the beginning of a line n $ anchors to the end of a line n Ruby: \A \z for beginning/end of string o Examples ^Hello World$ \A[Tt]he ^[^\d].\.jpe?g end\.\z Greedy vs Lazy

Computer Science and Engineering n The Ohio State University o Repetition (+ and *) allows multiple matches to begin at same place n Example: <.*>

Title

Title

o The match selected depends on whether the repetition matching is n greedy, ie matches as much as possible n lazy, ie matches as little as possible o Default is typically greedy o For lazy matching, use *? or +? Regular Expressions in Ruby

Computer Science and Engineering n The Ohio State University o Instance of a class (Regexp) pattern = Regexp.new('^Rub.') o But literal notation is common: /pattern/ /[aeiou]*/ %r{hello+} #no need to escape / o Match operator =~ (negated as !~) n Operands: String and Regexp (in either order) n Returns index of first match (or nil if not present) 'hello world' =~ /o/ #=> 4 /or/ =~ 'hello' #=> nil o Case equality, Regexp === String, à Boolean o Options post-pended: /pattern/options n i ignore case n x ignore whitespace & comments ("free spacing") Strings and Regular Expressions

Computer Science and Engineering n The Ohio State University o Find all matches as an array a.scan /[[:alpha:]]/ o Delimeter for spliting string into array a.split /[aeiou]/ o Substitution: sub and gsub (+/- !) n Replace first match vs all ("globally") a = "the quick brown fox" a.sub /[aeiou]/, '@' #=> "th@ quick brown fox" a.gsub /[aeiou]/, '@' #=> "th@ q@@ck br@wn f@x" Your Turn (Regular Expressions)

Computer Science and Engineering n The Ohio State University o Check if phone number in valid format phone = "614-292-2900" #not ok phone = "(614) 292-2900" #ok

format = ? if phone ? format #well-formatted … Your Turn (Regular Expressions)

Computer Science and Engineering n The Ohio State University o Check if phone number in valid format phone = "614-292-2900" #not ok phone = "(614) 292-2900" #ok

format = /\A\(\d{3}\) \d{3}-\d{4}\z/ if phone =~ format #well-formatted … Summary

Computer Science and Engineering n The Ohio State University o Language: A set of strings o RE: Defines a language n Recipe for making elements of language o Literals n Distinguish characters and metacharacters o Character classes n Represent 1 character in RE o Repetition o FSA n Expressive power same as RE