<<

Pattern Matching Pattern Matching and String search. Search for given string in a large text . Regular expression.

Natural and compact way to express multiple text patterns.

Quintessential programmer's tool.

Grep : fragile X syndrome is a common cause of mental retardation. Regular expressions Human genome contains triplet repeats of CGG or AGG, starting with Nondeterministic finite state automata GCG and ending with CTG.

Number of repeats is variable, and correlated with syndrome.

Use regular expression to specify pattern: GCG(CGG|AGG)*CTG.

Reference: Chapter 7.1, Introduction to Computer Science, R. Sedgewick and K. Wayne.

Princeton University • COS 226 • Algorithms and Data Structures • Spring 2004 • Kevin Wayne • http://www.Princeton.EDU/~cos226 2

More Pattern Matching Applications Regular Expressions: Basic Operations

Test if a string matches some pattern. Regular expression.

Process natural language. Compact notation to specify a set of . Scan for virus signatures. Core operations. Search for information using Google.

Access information in digital libraries. Operation RE No Java file containing certain string. every other Retrieve information from Lexis/Nexis. Concatenation aabaab aabaab string Search-and-replace in a word processors. babbaaa abba Filter text (SpamAssassin, NetNanny, Carnivore, AdAware). Wildcard .abba.. aabbaab abbabb Validate data-entry fields (dates, email, URL, credit card). aa every other Search for markers in human genome using PROSITE patterns. Union aa | baab baab string aa  Closure ab*a Parse text files. abbbba ababa Compile a Java program. aaaab every other a(a|b)aab Crawl and index the Web. abaab string Read in data stored in some file format. Grouping a  (ab)*a Automatically create Java documentation from Javadoc comments. ababa aa

3 4 Regular Expressions: Examples Using Regular Expressions

Regular expression examples. Additional operations typically added for convenience.

Notation is surprisingly expressive. Ex: [a-e]+ is shorthand for (a|b|c|d|e)(a|b|c|d|e)*.

Regular Expression Yes No  b a* | (a*ba*ba*ba*)* bbb bb Operation Regular Expression Yes No multiple of three b’s abbbaababb baabbbaa a  Any single character ..oo..oo. spoonfood choochoo a | a.*a aba ab begins and ends with a abbaabba ba One or a(bc)+de abcbcde ade abba  Character classes [a-e]+ decade Upper45 .*abba.* bbabbabb abb Exactly N times [a-e]{6} decade ade contains the substring abba abbaabba bbaaba Negations [^aeiou]{6} rhythm decade

5 6

Regular Expressions in Java Regular Expressions in Java

Validity checking. Is text in the set described by the pattern? Broadly applicable programmer's tool.

Many other languages also support extended regular expressions.

public class Validator { Built into grep, , emacs, Perl, PHP, Python, JavaScript. public static void main(String[] args) { String pattern = args[0]; String text = args[1]; print all lines containing NEWLINE grep NEWLINE */*.java System.out.println(text.matches(pattern)); occurs in any file with a .java extension } } egrep '^[qwertyuiop]*[zxcvbnm]*$' dict.txt | egrep '...... '

% java Validator "0*10*10*10*" 010000011 true legal Java identifier PERL. Practical Extraction and Report Language. % java Validator "[$_A-Za-z][$_A-Za-z0-9]*" ident123 true replace all occurrences of from legal email address (simplified) perl -pi -e 's|from|to|g' input.txt with to in the file input.txt % java Validator "[a-z]+@([a-z]+\.)+(edu|com)" [email protected] true Social Security numbers perl -ne 'print if /^[A-Z][A-Za-z]*$/' dict.txt % java Validator "[0-9]{3}-[0-9]{2}-[0-9]{4}" 166-11-5655 true

7 8 Regular Expression Caveats Perl RE for Valid RFC822 Email Addresses

(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?: Writing a RE is like writing a program. \r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\

](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?: Need to learn syntax. (?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n) ?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ Can be easier to write than read. \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t] )*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])* )(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*) *:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r \n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t ]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\]( ?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(? "Sometimes you have a programming problem and :\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)? [ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]| it seems like the best solution is to use regular \\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|" (?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ expressions; now you have two problems." ".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[ \]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|( ?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([ ^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\ ]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\ r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\] |\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\ .|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(? :[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\". \[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\] ]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ ".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[ \["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t ])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+| \Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\ ]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[" ()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<> @,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@, ;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ ".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\". \[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[ "()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]) +|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z |(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*))*)?;\s*)

Reference: http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html 9 10

Pattern Matching in Google Pattern Matching in TiVo

Google supports * for full word wildcard and | for union. TiVo WishList has very limited pattern matching.

Reference: page 76, Hughes DirectTV TiVo manual

11 12 Engineering Grep Deterministic Finite State Automata

Generalized regular expression print. DFA review.

First implemented in 1973 by for text-to-speech.

Quintessential programmer's tool.

Approach to develop grep algorithm.

Define class of abstract machines.

Write simulator for machine.

Write translator from REs to machines.

Example of essential paradigm in computer science. int pc = 0; while (!tape.isEmpty()) { Build intermediate abstractions. boolean bit = tape.read(); Pick the right ones! if (pc == 0) { if (bit) pc = 0; else pc = 1; } else if (pc == 1) { if (bit) pc = 1; else pc = 2; } Solve important practical problem. else if (pc == 2) { if (bit) pc = 2; else pc = 0; } } if (pc == 0) System.out.println("accepted"); else System.out.println("rejected");

13 14

Duality Nondeterministic Finite State Automata

RE. Concise way to describe a set of strings. NFA.

DFA. Machine to recognize whether a given string is in a given set. Finite state automata.

May have 0, 1, or more transitions for each input symbol. Kleene's theorem (1956): for any DFA, there exists a RE that May have -transitions. describes the same set of strings; for any RE, there exists a DFA that Accept if any sequence of transitions leads to accept state. recognizes the same set of strings.

0* | (0*10*10*10*)*

multiple of three 1's yes: 111, 00011, 101001011 Stephen Kleene no: 110, 00011011, 00110

Good news: to match RE build DFA and simulate DFA on input string. Bad news: the DFA can be exponentially large. Consequence: need more efficient abstract machine.

15 16 Simulating an NFA NFA Simulation

How to simulate an NFA? Maintain list of all possible states that NFA could be in after reading in the first i symbols.

17 18

NFA Representation NFA: Java Implementation

NFA representation. Maintain several graphs, one for each symbol in the alphabet, plus one for . public class NFA { private int START = 0; // start state private int ACCEPT = 1; // accept state private int N = 2; // number of states private String ALPHABET = "01"; // RE alphabet private int EPS = ALPHABET.length(); // symbols in alphabet private G[];

public NFA(String re) { G = new Graph[EPS + 1]; for (int i = 0; i <= EPS; i++) G[i]=new Graph(); build(0, 1, re); }

private void build(int from, int to, String re) { } public boolean simulate(Tape tape) { }

1-graph -graph 0-graph

19 20 NFA Simulation NFA Simulation: Java Implementation

How to simulate an NFA?

Maintain List of all possible states that NFA could be in after public boolean simulate(Tape tape) { states reachable from reading in the first i symbols. List pc = G[EPS].reachable(START); start by -transitions Use Graph adjacency and reachability ops to update. // simulate NFA using input from tape while (!tape.isEmpty()) { char c = tape.read(); int i = ALPHABET.indexOf(c); all possible states after reading in c next = neighbors states reachable List next = G[i].neighbors(pc); pc of pc in G[c] from next in G[] updated pc pc = G[EPS].reachable(next); follow -transitions }

while (!pc.isEmpty()) check if end in an if (pc.remove() == ACCEPT) return true; accept state return false; }

21 22

NFA Simulation Running Extended NFA

Input: Text with N characters, NFA with M transitions. Some extended NFAs.

Running time. O(M N) 0 Bottleneck = 1 graph reachability per input character. a Can be substantially faster in practice if few -transitions.

2 Note: Easy to extend graph search to handle multiple sources. ba* Implicit assumption: alphabet size is a small constant. 0

b 3 ab* | ba*

1 1

23 24 Converting from an RE to an NFA: Basic Transformations Converting from an RE to an NFA

Create NFA for ab* | a*b.

from R to 0 start a 0

a 2 from a*b from 0 0

c 2 c ba* b from 3 from from from ab* | a*b ab* ba* c R b* R S c * S S R | S R

1 1 to to 1 1 to to to to

union closure concatenation

25 26

NFA Construction: Java Implementation Grep Running Time

private void build(int from, int to, String re) { Input: Text with N characters, RE with M characters. if (re.length() == 0) G[EPSILON].insert(from, to);

else if (re.length() == 1) { single char Claim. The number of edges in the NFA is 2M. char c = re.charAt(0); for (int i = 0; i < EPSILON; i++) Single character: consumes 1 symbol, creates 1 edge. if (c == ALPHABET.charAt(i)||c == '.') G[i].insert(from, to); Wildcard character: consumes 1 symbol, creates 2 edges. } Concatenation: consumes 1 symbols, creates 0 edges. int or = re.indexOf('|'); union Union: consumes 1 symbol, creates 1 edges. else if (or > 0) { build(from, to, re.substring(0, or)); Closure: consumes one symbol, creates 2 edges. build(from, to, re.substring(or + 1)); union }

else if (re.charAt(1)=='*') { closure NFA simulation: O(MN) since NFA has 2M transitions. G[EPSILON].insert(from, N); NFA construction: Ours is O(M2) but not hard to O(M). build(N, N, re.substring(0, 1)); build(N++, to, re.substring(2)); } Surprising bottom line. Worst case cost for grep is the same as for else { concatenation elementary string match! build(from, N, re.substring(0, 1)); build(N++, to, re.substring(1)); } } closure

27 28 Industrial Strength Grep Implementation Converting from an RE to an NFA

To complete grep implementation. Transformations for parsing parentheses.

Parentheses.

Documentation.

Extend the alphabet.

Add character classes.

Add capturing capabilities. from from

Deal with meta characters. R Extend the closure operator. R from from from from Error checking and recovery.

Greedily match longest possible match. ( R ) R ( R* ) S S ( R ) S S

to to to to to to

grouping closure concatenation

29 30

Application: Harvester Application: Harvester

Harvesting : Print all occurrences of regexp from text file or URL. Harvesting info: Print all occurrences of regexp from text file or URL.

Pattern. Compiles RE to an NFA. Word puzzles. Matcher. Simulate the NFA. java Harvester ".*hh.*" dictionary.txt beachhead import java.util.regex.Pattern; highhanded withheld import java.util.regex.Matcher;

public class Harvester { public static void main(String[] args) { String regexp = args[0]; In in = new In(args[1]); Harvest email addresses for spam campaign. String text = in.readAll(); Pattern pattern = Pattern.compile(regexp); Matcher matcher = pattern.matcher(text); java Harvester "[a-z]+@([a-z]+\.)+(edu|com|net|tv)" http://www.princeton.edu while (matcher.find()) [email protected] System.out.println(matcher.group()); [email protected] simple email validator [email protected] } }

31 32 Application: Data File Parser Application: Web Crawler

Parsing input files: Internet movie database, NCBI genome file, . . . . Crawling the Web: Find all web page reachable from site s.

LOCUS AC146846 128142 bp DNA linear HTG 13-NOV-2003 DEFINITION Ornithorhynchus anatinus clone CLM1-393H9, Queue q = new Queue(); // queue of sites to crawl ACCESSION AC146846 HashSet visited = new HashSet(); // ST of visited websites VERSION AC146846.2 GI:38304214 KEYWORDS HTG; HTGS_PHASE2; HTGS_DRAFT. q.enqueue(s); // start crawl from site s SOURCE Ornithorhynchus anatinus (platypus) ORIGIN visited.add(s); 1 tgtatttcat ttgaccgtgc tgttttttcc cggtttttca gtacggtgtt agggagccac while (!q.isEmpty()) { 61 gtgattctgt ttgttttatg ctgccgaata gctgctcgat gaatctctgc atagacagct // a comment 121 gccgcaggga gaaatgacca gtttgtgatg acaaaatgta ggaaagctgt ttcttcataa String v =(String) q.dequeue(); ... System.out.println(v); 128101 ggaaatgcga cccccacgct aatgtacagc ttctttagat tg read in raw html // In in = new In(v); String input = in.readAll(); http://xxx.yyy.zzz String regexp = "[ ]*[0-9]+([actg ]*).*"; String regexp = "http://(\\w+\\.)*(\\w+)"; Pattern pattern = Pattern.compile(regexp); Pattern pattern = Pattern.compile(regexp); In in = new In(filename); Matcher matcher = pattern.matcher(input); String line; while (matcher.find()) { while ((line = in.readLine()) != null) { String w = matcher.group(); search using regular expression Matcher matcher = pattern.matcher(line); if (!visited.contains(w)) { if (matcher.find()) { just the RE part in parentheses visited.add(w); q.enqueue(w); if unvisited, mark as visited String s = matcher.group(1).replaceAll(" ", ""); and put on queue // do something with s } } replace this RE with this string } } } 33 34

Algorithmic Complexity Attacks Not-So-Regular Expressions

Warning: most everyday implementations ( grep, Java, Perl) do not Back-references. guarantee performance! \1 notation matches subexpression that was matched earlier.

Supported by most RE implementations.

java Validator "(a|aa)*b" aaaaaaaaaaaaaaaaaaaaaaaaaaaaaac 1.6 seconds java Validator "(a|aa)*b" aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac 3.7 seconds java Harvester "^(.*)\1$" dictionary.txt java Validator "(a|aa)*b" aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac 9.7 seconds java Validator "(a|aa)*b" aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac 23.2 seconds beriberi java Validator "(a|aa)*b" aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac 62.2 seconds couscous java Validator "(a|aa)*b" aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac 161.6 seconds

Some non-regular languages. SpamAssassin regular expression. All strings of the form ww for some string w: couscous, beriberi.

java RE "[a-z]+@[a-z]+([a-z\.]+\.)+[a-z]+" spammer@x...... All bitstring with an equal number of 0s and 1s: 10, 01110100.

All Watson-Crick complemented palindromes: atttcggaaat. Takes exponential time. . . . Spammer could use a pathological return email addresses to DOS a server running SpamAssassin. Remark. Pattern matching with back-references is NP-hard.

35 36 Context Summary

Abstract machines, languages, and nondeterminism. Programmer.

Basis of the theory of computation. REs are a powerful pattern matching tool.

Intensively studied since the 1930s. Implement regular expressions with NFAs.

Compiler: a program that translates from one language to another. Theoretician.

grep:RE  NFA. RE is a compact description of a set of strings.

javac: Java language  Java byte code. NFA is an abstract machine equivalent in power to RE.

DFAs and REs have limitations.

Abstract Machine NFA Computer You. Practical application of core CS principles.

Pattern Word in CFL Word in CFL

Parser Check if legal RE Check if legal Java program

Compiler Output NFA Output machine executable

Simulator Find match Run program in hardware

37 38