<<

\W Any non-word . Warning \w != \S Search&Replace: substitution operator s/// A Quick Guide To =~ s/MOTIF/REPLACE/egimosx POSIX Character class Regular Expressions Example: correct typo for the word rabbit [[:class:]] class can be any of: $ex =~ s/rabit/rabbit/g; alnum alpha blank cntrl digit graph lower This is a Quick reference Guide for PERL regular expressions Here is the content of $ex: print punct space upper xdigit (also known as regexps or regexes). The ID sp:UBP5_RAT is similar to the rabbit AC tr:Q12345 These tools are used to describe text as “motifs” or “patterns” for matching, quoting, substituting or translitterating. Each Example: find and tag any TrEMBL AC Special characters (Perl, C, , Python...) define its $ex =~ s/tr:/trembl_ac=/g; \a alert (bell) Here is the content of $ex: own regular expressions although the syntax might differ from \b backspace details to extensive changes. In this guide we will concentrate The ID sp:UBP5_RAT is similar to the rabit AC trembl_ \e escape ac=Q12345 on the Perl regexp syntax, we assume that the reader has some \f feed preliminary knowledge of Perl programming. Options \n \r e evaluate REPLACE as an expression Perl uses a Traditional Nondeterministic Finite Automata \t horizontal tabulation (NFA) match engine. This means that it will compare each g global matches (matches all occurrences) element of the motif to the input string, keeping track of i case insensitive \nnn octal nnn the positions. The engine choose the first leftmost match m multiline, allow “^” and “$” to match with (\n) \xnn hexadecimal nn after greedy (i.e., longest possible match) quantifiers have o compile MOTIF only once \cX X matched. s single line, dot “.” matches new-line (\n) x ignore whitespace and allow comments “#” in MOTIF References Repetitions For more information on Perl regexps and other syntaxes Zero or one occurrence of the previous item. you can refer to O’Reilly’s book “Mastering Regular Quoting: quote and compile operator qr// ? Zero or more occurrences of the previous item. Expressions”. EXPR =~ qr/MOTIF/imosx * + One or more occurrences of the previous item. Examples: Example: reuse of a precompiled regexp Match at least times but no more than times the The following sentence will be used in all our examples: $myregexp = qr/\w{2,5}_\w{2,5}/; {n,m} n m previous item. The ID sp:UBP5_RAT is similar to the rabit AC tr:Q12345 if ($ex =~ m/$myregexp/) { print “SwissProtID\n”; } will match: {n,} Match n or more times The ID sp:UBP5_RAT is similar to the rabit AC tr:Q12345 {n} Match exactly n times Motif finding: match operator m// and as a result will print SwissProtID. {}? Non-greedy match (i.e., match the shortest string) EXPR =~ m/MOTIF/cgimosx EXPR =~ /MOTIF/cgimosx Options EXPR !~ m/MOTIF/cgimosx i case insensitive Anchors EXPR !~ /MOTIF/cgimosx m multiline, allow “^” and “$” to match with (\n) ^ or \A Match beginning of the string/line o compile MOTIF only once $ or \Z Match end of the string/line Examples: match any SwissProt ID for a rat protein s single line, dot “.” matches new-line (\n) \z End of string in any match mode if ($ex =~ m/\w{2,5}_RAT/) { print “Rat entry\n”; } ignore whitespace and allow comments in MOTIF \b Match word boundary will match x “#” \B Match non-word boundary The ID sp:UBP5_RAT is similar to the rabit AC tr:Q12345 and as a result print Rat entry. Character classes Capture & Grouping Options [...] Match any one character of a class [^...] Match any one character not in the (...) Group several characters together for later use or cg continue after a failure in /g . Match any character (except newline [^\n]) in non capture as a single unit g global matches (matches all occurrences) single-line mode (/s) | Match either subexpressions (equivalent to “OR”) i case insensitive \d Any digit. Equivalent to [0..9] or [[:digit:]] m multiline, allow “^” and “$” to match with (\n) \D Any non-digit. Example: match any database code in the list o compile MOTIF only once \s Any whitespace. [ \t\s\n\r\f\v] or [[:space:]] $ex =~ m/(sp:|tr:|rs:)/g; s single line, dot “.” matches new-line (\n) \S Any non-whitespace. will match: x ignore whitespace and allow comments “#” in MOTIF \w Any word character. [a-zA-Z0-9_] or [[:alnum:_]] The ID sp:UBP5_RAT is similar to the rabit AC tr:Q12345 \n Back reference. Match the same as the captured Example: reverse and complement a DNA sequence group number n that was previously matched in $DNA = AAATATTTCATCGTACAT; the same MOTIF. $revcom = reverse $DNA; $n Substring of captured group n $revcom =~ tr/ACGTacgt/TGCAtgca/;

Example: match several instances with back reference $ex =~ m/(the).+\1/i; will match: The ID sp:UBP5_RAT is similar to the rabit AC tr:Q12345 The transliteration will produce the following: print($DNA); AAATATTTCATCGTACAT print($revcom); Example: rename any tr:AC to trembl_AC= using a capture ATGTACGATGAAATATTT $ex =~ s/tr:([[:alnum:]]{6})/trembl_AC=$1/gi; will match: Options complement REPLACELIST The ID sp:UBP5_RAT is similar to the rabit AC trembl_ c AC=Q12345 d delete non-replaced characters s single replace of duplicated characters

UniCode matches Perl 5.8 supports 3.2. However it would be too long to describe all the properties in details here. For more information see “Mastering Regular Expressions”. Text-span modifiers \p{PROP} Matches a UniCode property Quote following until or end of \Q \E \P{PROP} Matches anything but a UniCode property motif (allow the use of scalars in regexp) \u Force next character to uppercase \l Force next character to lowecase This document was written and designed by Laurent Falquet \U Force all following characters to uppercase and Vassilios Ioannidis from the Swiss EMBnet node and being \L Force all following characters to lowercase distributed by P&PR Publications Committee of EMBnet. \E End a span started with \Q, \U or \L EMBnet - European Molecular Biology Network - is a bioinformatics support network of bioinformatics support Extended Regexp centers situated primarily in Europe. countries have a (?#...) Substring “...” is a comment national node which can provide training courses and other forms of help for users of bioinformatics software. (?=...) Positive lookahead. Match if exists next match (e.g., allow overlapping matches in global mode) You can find information about your national node from the Negative lookahead. Match if no next match (?!...) EMBnet site: (?<=...) Positive lookbehind. Fixed length only. Negative lookbehind. Fixed length only. (?

Transliteration is not - and does not use - a regular expression, but it is frequently associated with the regexp in PERL. Thus we decided to include it in this guide.