Regular Expressions • Unix Filters – Cat, Head, Tail, Tee, Wc Grep and Sed Intro – Cut, Paste – Find – Sort, Uniq – Comm, Diff, Cmp –Tr

Previously • Basic UNIX Commands Lecture 4 – Files: rm, cp, mv, ls, ln – Processes: ps, kill Regular Expressions • Unix Filters – cat, head, tail, tee, wc grep and sed intro – cut, paste – find – sort, uniq – comm, diff, cmp –tr Subtleties of commands Today • Executing commands with find • Regular Expressions • Specification of columns in cut – Allow you to search for text in files • Specification of columns in sort – grep command • Methods of input •Stream manipulation: – Standard in – sed – File name arguments • But first, a command we didn’t cover last time… – Special "-" filename • Options for uniq xargs find utility and xargs • Unix limits the size of arguments and environment • find . -type f -print | xargs wc -l that can be passed down to child – -type f for files • What happens when we have a list of 10,000 files – -print to print them out to send to a command? – xargs invokes wc 1 or more times • xargs solves this problem – Reads arguments as standard input • wc -l a b c d e f g – Sends them to commands that take file lists wc -l h i j k l m n o – May invoke program several times depending on size … of arguments • Compare to: find . -type f –exec wc -l {} \; cmd a1 a2 … a1 … a300 xargs cmd a100 a101 … cmd cmd a200 a201 … 1 What Is a Regular Expression? • A regular expression (regex) describes a set of possible input strings. Regular Expressions • Regular expressions descend from a fundamental concept in Computer Science called finite automata theory • Regular expressions are endemic to Unix – vi, ed, sed, and emacs – awk, tcl, perl and Python – grep, egrep, fgrep – compilers Regular Expressions regular expression c k s • The simplest regular expressions are a UNIX Tools rocks. string of literal characters to match. • The string matches the regular expression if match it contains the substring. UNIX Tools sucks. match UNIX Tools is okay. no match Regular Expressions Regular Expressions • A regular expression can match a string in • The . regular expression can be used to more than one place. match any character. regular expression regular expression a p p l e o . Scrapple from the apple. For me to poop on. match 1 match 2 match 1 match 2 2 Character Classes Negated Character Classes • Character classes [] can be used to match • Character classes can be negated with the any specific set of characters. [^] syntax. regular expression b [eor] a t regular expression b [^eo] a t beat a brat on a boat beat a brat on a boat match 1 match 2 match 3 match More About Character Classes Named Character Classes – [aeiou] will match any of the characters a, e, i, o, or u • Commonly used character classes can be – [kK]orn will match korn or Korn referred to by name (alpha, lower, upper, • Ranges can also be specified in character classes alnum, digit, punct, cntrl) – [1-9] is the same as [123456789] • Syntax [:name:] – [abcde] is equivalent to [a-e] – [a-zA-Z] [[:alpha:]] – You can also combine multiple ranges – [a-zA-Z0-9] [[:alnum:]] • [abcde123456789] is equivalent to [a-e1-9] – [45a-z] [45[:lower:]] – Note that the - character has a special meaning in a character class but only if it is used within a range, • Important for portability across languages [-123] would match the characters -, 1, 2, or 3 Anchors regular expression ^ b [eor] a t • Anchors are used to match at the beginning or end beat a brat on a boat of a line (or both). • ^ means beginning of the line match • $ means end of the line regular expression b [eor] a t $ beat a brat on a boat match ^word$ ^$ 3 regular expression y a * y Repetition I got mail, yaaaaaaaaaay! • The * is used to define zero or more occurrences of the single regular expression match preceding it. regular expression o a * o For me to poop on. match .* Repetition Ranges Subexpressions • Ranges can also be specified • If you want to group part of an expression so that – {}notation can specify a range of repetitions for the immediately preceding regex * or { } applies to more than just the previous character, use ( ) notation – {n} means exactly n occurrences – {n,} means at least n occurrences • Subexpresssions are treated like a single character – {n,m} means at least n occurrences but no – a* matches 0 or more occurrences of a more than m occurrences – abc* matches ab, abc, abcc, abccc, … • Example: – (abc)* matches abc, abcabc, abcabcabc, … – .{0,} same as .* – (abc){2,3} matches abcabc or abcabcabc – a{2,} same as aaa* grep Family Differences • grep comes from the ed (Unix text editor) search • grep - uses regular expressions for pattern command “global regular expression print” or matching g/re/p • fgrep - file grep, does not use regular expressions, • This was such a useful command that it was only matches fixed strings but can get search written as a standalone utility strings from a file • There are two other variants, egrep and fgrep that • egrep - extended grep, uses a more powerful set of comprise the grep family regular expressions but does not support • grep is the answer to the moments where you backreferencing, generally the fastest member of know you want the file that contains a specific the grep family phrase but you can’t remember its name • agrep – approximate grep; not standard 4 Protecting Regex Syntax Metacharacters • Regular expression concepts we have seen so • Since many of the special characters used in far are common to grep and egrep. regexs also have special meaning to the • grep and egrep have different syntax shell, it’s a good idea to get in the habit of – grep: BREs single quoting your regexs – egrep: EREs (enhanced features we will discuss) – This will protect any special characters from being operated on by the shell • Major syntax differences: – If you habitually do it, you won’t have to worry – grep: $ and $, \{ and \} about when it is necessary – egrep: ( and ), { and } Escaping Special Characters Egrep: Alternation • Even though we are single quoting our regexs so the • Regex also provides an alternation character for shell won’t interpret the special characters, some | matching one or another subexpression characters are special to grep (eg * and .) – (T|Fl)an will match ‘Tan’ or ‘Flan’ • To get literal characters, we escape the character with – ^(From|Subject): will match the From and a \ (backslash) Subject lines of a typical email message • It matches a beginning of line followed by either the characters • Suppose we want to search for the character sequence ‘From’ or ‘Subject’ followed by a ‘:’ 'a*b*' • Subexpressions are used to limit the scope of the – Unless we do something special, this will match zero or alternation more ‘a’s followed by zero or more ‘b’s, not what we want – At(ten|nine)tion then matches “Attention” or – ‘a\*b\*’ will fix this - now the asterisks are treated as “Atninetion”, not “Atten” or “ninetion” as would regular characters happen without the parenthesis - Atten|ninetion Egrep: Repetition Shorthands Egrep: Repetition Shorthands cont •The ‘?’ (question mark) specifies an optional character, the • The * (star) has already been seen to specify zero single character that immediately precedes it or more occurrences of the immediately preceding July? will match ‘Jul’ or ‘July’ character Equivalent to {0,1} • + (plus) means “one or more” Also equivalent to (Jul|July) abc+d will match ‘abcd’, ‘abccd’, or ‘abccccccd’ but •The *, ?, and + are known as quantifiers because they will not match ‘abd’ specify the quantity of a match • Quantifiers can also be used with subexpressions Equivalent to {1,} – (a*c)+ will match ‘c’, ‘ac’, ‘aac’ or ‘aacaacac’ but will not match ‘a’ or a blank line 5 Grep: Backreferences Practical Regex Examples • Sometimes it is handy to be able to refer to a • Variable names in C match that was made earlier in a regex – [a-zA-Z_][a-zA-Z_0-9]* • This is done using backreferences • Dollar amount with optional cents n is the backreference specifier, where n is a number – \ – \$[0-9]+(\.[0-9][0-9])? • Looks for nth subexpression • Time of day • For example, to find if the first word of a line is – (1[012]|[1-9]):[0-5][0-9] (am|pm) the same as the last: <h1> <H1> <h2> … – ^$[[:alpha:]]\{1,\}$ .* \1$ • HTML headers – <[hH][1-4]> –The $[[:alpha:]]\{1,\}$ matches 1 or more letters grep Family grep Examples • Syntax • grep 'men' GrepMe • grep 'fo*' GrepMe grep [-hilnv] [-e expression] [filename] • egrep 'fo+' GrepMe egrep [-hilnv] [-e expression] [-f filename] [expression] • egrep -n '[Tt]he' GrepMe [filename] • fgrep 'The' GrepMe • egrep 'NC+[0-9]*A?' GrepMe fgrep [-hilnxv] [-e string] [-f filename] [string] [filename] • fgrep -f expfile GrepMe – -h Do not display filenames • Find all lines with signed numbers – -i Ignore case $ egrep ’[-+][0-9]+\.?[0-9]*’ *.c – -l List only filenames containing matching lines bsearch. c: return -1; – -n Precede each matching line with its line number compile. c: strchr("+1-2*3", t-> op)[1] - ’0’, dst, convert. c: Print integers in a given base 2-16 (default 10) – -v Negate matches convert. c: sscanf( argv[ i+1], "% d", &base); – -x Match whole line only (fgrep only) strcmp. c: return -1; – -e expression Specify expression as option strcmp. c: return +1; – -f filename Take the regular expression (egrep) or • egrep has its limits: For example, it cannot match all lines that a list of strings (fgrep) from filename contain a number divisible by 7. Fun with the Dictionary Other Notes • /usr/dict/words contains about 25,000 words – egrep hh /usr/dict/words •Use /dev/null as an extra file name • beachhead • highhanded – Will print the name of the file that matched •withheld • grep test bigfile • withhold – This is a test.

Load more