Regular Expressions and Pattern Matching james.wasmuth@ed.ac.uk
Regular Expression (regex): a separate language, allowing the construction of patterns. used in most programming languages. very powerful in Perl.
Pattern Match: using regex to search data and look for a match.
Overview: how to create regular expressions how to use them to match and extract data biological context So Why Regex?
Parse files of data and information: fasta embl / genbank format html (web-pages) user input to programs
Check format Find illegal characters (validation) Search for sequences motifs Simple Patterns place regex between pair of forward slashes (/ /). try: #!/usr/bin/perl while (
Can also match strings from files. genomes_desc.txt contains a few text lines containing information about three genomes. try: #!/usr/bin/perl open IN, “
Parses each line in turn. Looks for elegans anywhere in line $_ Flexible matching
There are many characters with special meanings – metacharacters. star (*) matches any number of instances /ab*c/ => 'a' followed by zero or more 'b' followed by 'c' => abc or abbbbbbbc or ac plus (+) matches at least one instance /ab+c/ => 'a' followed by one or more 'b' followed by 'c' => abc or abbc or abbbbbbbbbbbbbbc NOT ac question mark (?) matches zero or one instance /ab?c/ => 'a' followed by 0 or 1 'b' followed by 'c' => abc or ac More General Quantifiers
Match a character a specific number or range of instances {x} will match x number of instances. /ab{3}c/ => abbbc
{x,y} will match between x and y instances. /a{2,4}bc/ => aabc or aaabc or aaaabc
{x,} will match x+ instances. /abc{3,}/ => abccc or abccccccccc or abcccccccccccccccccccccccccccccccccccccccccccccc cccccccccccccccccccccccccccccccccccccccccccccccc ccccccccccccccccccccccc More metacharacters
dot (.) refers to any character even tab (\t) and space but not newline (\n). /a.*c/ => 'a' followed by any number of any characters followed by 'c' Escaping
But I want to use these symbols in my regex!?! to use a * , + , ? or . in the pattern when not a metacharacter, need to 'escape' them with a backslash. /C\. elegans/ => C. elegans only /C. elegans/ => Ca , Cb , C3 , C> , C. , etc...
The 'delimitor' of the regex, forward slash “/”, and the 'escape' character, backslash “\”, are also metacharacters. These need to be escaped if required in regex.
Important when trying to match URLs and email addresses. /joe\.bloggs\@darwin\.co\.uk/ /www\.envgen\.nox\.ac\.uk\/biolinux\.html/ Using metacharacters. The file nemaglobins.embl contains 21 embl database files that contain a globin protein within their sequence. try: #!/usr/bin/perl $count; open IN, “
Can group patterns in parentheses “()”.
Useful when coupled with quantifiers
/elegans+/ => eleganssssssssssssss
/(elegans)+/ => eleganselegans...elegans 1 2 n
/eleg(ans){4}/ => elegansansansans 1 2 3 4 Alternatives
Want either this pattern or that pattern.
Two ways:
1.) the vertical bar '|' either the left side matches or the right side matches /(human|mouse|rat)/ => any string with human or mouse or rat. Combine with previous examples: /Fugu( |\t)+rubripes/ matches if Fugu and rubripes are seperated by any mixture of spaces and tabs 2.) character class is a list of characters within '[]'. It will match any single character within the class.
/[wxyz1234\t]/ => any of the nine.
a range can be specified with '-' /[w-z1-4\t]/ => as above
to match a hyphen it must be first in the class /[-a-zA-Z]/ => any letter character and a hyphen
negating a character with '^' /[^z]/ => any character except z /[^abc]/ => any character except a or b or c Other Shortcuts
\d => any digit [0-9] \w => any “word” character [A-Za-z0-9_] \s => any white space [\t\n\r\f ]
\D => any character except a digit [^\d] \W => any character except a “word” character [^\w] \S => any character except a white space [^\s]
Can use any of these in conjunction with quantifiers, /\s*/ => any amount of white space Using alternatives to find a hydrophobic region... try: open IN, "< nippo_sigpept.fsa" or die; while (
Could also have used /(V|I|L|M|F|W|C|A){8,}/ Binding Operator
Revisited? So far matching against $_ The binding operator “=~”matches the pattern on right against the string on left. Usually add the m operator (optional).
$sumthing = 'Ascaris suum is a nematode'; if ($sumthing=~m/suum.*nematode/) { print “this organism infects pigs!\n”; } Anchors
/pattern/ will match anywhere in the string. Use anchors to hold pattern to a point in the string. caret “^” (shift 6) marks the beginning of string while dollar “$” marks end of a string. /^elegans/ => elegans only at start of string. Not C. elegans. /Canis$/ => Canis only at end of string. Not Canis lupus. /^\s*$/ => a blank line.
“$” ignores new line character “\n”.
N.B. compare use of “^” as an anchor with that in the character class. Anchors (2)
Word Boundary \b matches the start or end of a word.
/\bmus\b/ would match mus but not musculus
/la\b/ => Drosophila but not Plasmodium
/\btes/ => Comamonas testosteroni but not Pan troglodytes
\b ignores newline character. Be careful with full stops they're characters too! Memory Variables
Able to extract sections of the pattern match and store in a variable. Anything stored in parentheses “()” is written into a special variable. The first instance is $1, the second $2, the fourth $4 and so on.
Extract from file: Organism: Homo sapiens ...
Extract from Perl script: while ($line=
Able to replace a pattern within a string with another string. Use the “s” operator s/abc/xyz/ => find abc and replace with xyz
By default only the first instance of a match. Using 'g' modifier (global) will find and replace all instances.
$line = 'abccdcbabc'; $line =~ s/abc/xyz/g; print $line; #produces xyzcdcbxyz; 1 2
Run dna2rna.pl Now look at dna2rna.pl dna2rna.pl
#!/usr/bin/perl print "Enter DNA sequence\n"; while ($line =
Hints: The lines of interest are AC, OS, and SQ. Three regular expressions - one for each query. Use a series of if and elsif loops to search for regular expressions. Print when matched. Bonus point - remove the semi-colon from the accession id. Shout if need help.