<<

CS 211 Regular Expressions

2-1 Processing Input

• If we know how to read in a line of input, what else might we want to do with it? • Analyze it in some way, based on some pattern • Extract certain values out of it, based on some pattern

• We can create regular expressions to identify patterns, and then use them to extract the relevant info out of the pattern. • A represents a pattern • Can be used to "match" a particular string → e.g., with Scanner’s findInLine() method

• Java represents a regular expression with a

Regular Expressions: appendix H in the text. CREATING REGULAR EXPRESSIONS

3 Special Symbols: Repetition

symbol meaning . any single character * zero or more of the previous thing + one or more of the previous thing ? zero or one of the previous thing {n} n of the previous thing {n,m} n-to-m of the previous thing {n,} n-or-more of the previous thing any non-special char matches itself

Each of these determines how many of the one thing on its left to repeat. • a* matches zero or more a's. • xyz+ matches an x, a y, and then one or more z's.

4 A Simple Method for Testing Regular Expressions public static void main (String[] args) { Scanner sc = new Scanner(System.in); String str = ""; String regex = "k.t+y"; System.out.println("\nPattern: "+regex+"\n"); while (!str.equals("quit")) { System.out.println("next input please: "); str = sc.nextLine(); System.out.println("\n\nstring: \""+str+"\"\nPattern: " +regex+"\nmatches: "+str.matches(regex)+"\n"); } }

5 Practice Problems • What are some strings that each pattern would match, and would not match? Test them out in live code.

1. k.tty 2. a* 3. b+* 4. ba{2,5} 5. go{1,}al! 6. a+b?c*

6 Special Symbols: grouping

grouping pattern meaning (pattern) parentheses group things

matches pattern a, a | b or pattern b, exactly

Parentheses are useful for grouping things to be repeated with our repetition symbols.

7 Practice Problems

• What are some strings that each pattern would match, and would not match? Test them out in live code.

1. ba(na)* 2. b(a|e|i|o|u)t 3. (un)?sure 4. d(o|i)*t 5. ((c|l|r|d)(o|a))+

• How would you represent the following regular expressions? 1. "tweedledee" or "tweedledum" 2. "I like you", "I like like you", "I like like like you", etc. 3. An accusation in the game "Clue" ("it was Col. Mustard in the Library with the Wrench!")

8 Special Symbols: "character classes"

"character class" pattern meaning [chars] any single char between []'s [a-z] any single char from a-to-z. [^a-z] any single char not from a-to-z. [abc[def]] union of [abc], [def] (one char from either) [aeiou&&[a-m]] intersection of [aeiou], [a-m] (one char, that must be in both)

Many more character classes can be found at: http://docs.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html

9 Practice Problems • What are some strings that each pattern would match, and would not match? Test them out in live code.

1. b[aeiou]t 2. a[^aeiou]e 3. 5[a-z[A-Z]]8 4. [aeiou&&[a-m]] 5. [a-zA-Z] 6. [0-9]+

• How would you represent the following regular expressions? 1. a positive odd integer 2. 3-5 consonants 3. three letters that are not in your initials

10 Special Symbols: Boundaries

boundary representation meaning ^ beginning of line* $ end of input/line.† \b word boundary (between chars: one is letter, one is not) \B not a word boundary

* note: second usage of the ^ symbol! This one is always outside []'s.

† to use $ with a scanner's findInLine method, we need an embedded flag, (?m), in order for it to behave as we'd expect. (see next slide).

It instructs Java to consider the end of a line as the "end of input", allowing $ to match not just the end of the stream but the end of any line. example: (?m)the\send$ instead of the\send$ 11 Embedded Flag Expressions Different modes can be used in regular expressions. • Different languages indicate them for regular exprs in different ways Java: we can add embedded flag expressions at pattern beginning:

flag name meaning (?d) unix lines only \n means newline (for ., ^, $) (?i) case insensitive case insensitive (?x) comments allow embedded comments/whitespace in pattern (?m) multi-line let ^, $ match at newline, not just end of entire input (?s) dotall toggle whether . matches \n or not (?u) case make 'case-insensitive' unicode-consistent

(?i)hello matches hELlo (?m)the end$ can be found within "the end\n…or is it?" 2-12 Practice Problems • What are some strings that would contain each given pattern? Test them out in live code.

1. \ban\b 2. \Bable\b 3. \bpre\B 4. \Bati\B

• How would you represent the following regular expressions?

1. a string containing a word with prefix "sub" 2. a string containing an integer 3. a string containing a real number

13 Special Symbols: Pre-defined groups

boundary pattern meaning representation \d [0-9] any single digit char \D [^0-9] any single non-digit char \s [ \t\n\f\r] any whitespace char * \S [^ \t\n\f\r] any non-whitespace char* \w [a-zA-Z0-9_] any identifier char (any 'word' char) \W [^a-zA-Z0-9_] any non-identifier char

* note: there is a space char in this. Other whitespace chars also, but their unicode representations were omitted here. 14 Practice Problems • What are some strings that would contain each given pattern? Test them out in live code.

1. \d+ 2. \d\d\d\s+\w+\sstreet 3. [1-9]\d+

• How would you represent the following regular expressions?

1. a phone number of the form (123)123-1234 2. an even number 3. a legal Java identifier 4. a Java identifier without uppercase letters

15 Special Symbols: everything else

boundary representation meaning \★ represents ★ instead of its special meaning † any non-special char matches itself the is used to escape any special character, so that we can match the character itself. a* matches zero or more a's a\* matches an a followed by a star

\b "matches" the gap between characters, instead of a particular character.

\bhe\b would match within "if he is" → wouldn't match within "if she is" or "anthem".

† here, ★ could be [,],*,+,?,{,},and so on. It's a placeholder for the special symbols, and ★ would not show up in a regular expression itself. 16 Representing Regular Expressions in Java • We use a String literal to represent a regular expression in Java. • This means that " must be escaped: \" • This also means the \ must also be escaped! \\" (represents ")

• Suggested conversion: write the regExp on paper, carefully represent each character correctly inside the String, one at a time:

regular expression Java String an example matching String representation (without the surrounding quotes) \(\d{3}\) "\\(\\d{3}\\)" (456) I "hate" airquotes "I \"hate\" airquotes" I "hate" airquotes \\d means digits "\\\\d means digits" \d means digits abc\n123 "abc\\n123" abc\n123

17 Extracting Strings via Scanner

// getting to each of the pattern from the standard input Scanner sc = new Scanner (System.in);

// define the regular expression. String regex = "kit{2,4}y";

// test with "here, kitty kitty kitty!

// try to get the first line. String temp = sc.findInLine(regex);

// keep using and getting lines. // null is returned upon failure to match while (temp != null){ System.out.println(temp); temp = sc.findInLine(regex); }

18 Using Regular Expressions in Java

• We could mostly use a few methods from the String class: public boolean matches (String regex) → does this String contain a match of the regex parameter? public String replace (String target, String replacement) → in this String, replace all occurrences of target with replacement. No actual regex here, just a straight-up substring search. (replaceAll allows for finding a regex and replacing with other string). public String[] split (String regex) {..} → in this String, using regex as a , return all substrings that weren't part of the regex matches as an array of Strings.

19 Java Pattern class java.util.regex.Pattern contains a few classes and useful methods. public static boolean matches(String regex,CharSequence† input{..} → tells us if regex can be found within input. public String[] split (CharSequence input) {..} → returns String array of things "between" the input. public static Pattern compile (String regex) {..} → stores the compiled version of a pattern from a regex String. → can significantly speed up Pattern usage within a loop

† CharSequence is an interface that sort of represents Strings. Pretend it says String.

2-20 Using the Pattern and Matcher classes • going from String to calculations on regular expressions is costly, so we can save that "regex compilation" in a Pattern object:

Pattern pat = Pattern.compile(regExpString);

• We can then get a Matcher from a Pattern:

Matcher mat = pat.matcher(stringToLookThrough);

• Finally, we can check for a match:

if (mat.matches()) { …

2-21 Capture Groups

• After a match is performed, each parenthesized portion from a successful match can be individually grabbed: Pattern pat = Pattern.compile("(\\d),(\\d),(\\d)"); Matcher mat = pat.matcher("3,4,5"); if (mat.matches()){ int first = Integer.parseInt(mat.group(1)); int second = Integer.parseInt(mat.group(2)); int third = Integer.parseInt(mat.group(3)); //always the whole thing String wholeMatch = mat.group(0); System.out.println(wholeMatch+" -> " + (first*2)+","+(second*2)+","+(third*2)); } else { System.out.println("no match.");

} 2-22 Capture Groups, Back-references • A regular expression can have a back-reference to a capture group. Just escape the group number:

Pattern pat = Pattern.compile("(\\d) is \\1"); String regexp = "9 is 9"; Matcher mat = pat.matcher(regexp); if (mat.matches()){ int first = Integer.parseInt(mat.group(1)); //always the whole thing String wholeMatch = mat.group(0); System.out.println(wholeMatch+".first(only grp):" +first); } else { System.out.println("no match."); }

2-23 Capture Groups (summary) • Note that the method group(int g) always returns a String.

• Capture groups let us access the parts of a match for quick use of all the parts of our match.

• back-references let us have different parts of one pattern that must match each other while still being as expressive as any other regular expression. Pretty cool!

2-24