Previously

• Basic Commands Lecture 4 – Files: , , , , – Processes: , Regular Expressions • Unix Filters – , , , , and intro – , , , ,

Subtleties of commands Today

• Executing commands with find • Regular Expressions • Specification of columns in cut – Allow you to search for text in files • Specification of columns in sort – grep command • Methods of input •Stream manipulation: – Standard in – sed – name arguments • But first, a command we didn’t cover last … – Special "-" filename • Options for uniq

find utility and xargs

• Unix limits the size of arguments and environment • find . - f -print | xargs wc -l that can be passed down to child – -type f for files • What happens when we have a list of 10,000 files – -print to print them out to send to a command? – xargs invokes wc 1 or times • xargs solves this problem – Reads arguments as standard input • wc -l a c d e f g – Sends them to commands that take file lists wc -l h i j k l m n o – May invoke program several times depending on size … of arguments • Compare to: find . -type f –exec wc -l {} \; cmd a1 a2 … a1 … a300 xargs cmd a100 a101 … cmd cmd a200 a201 …

1 What Is a ?

• A regular expression (regex) describes a set of possible input . Regular Expressions • Regular expressions descend from a fundamental concept in Computer Science called finite automata theory • Regular expressions are endemic to Unix – , , sed, and emacs – , tcl, and Python – grep, egrep, fgrep – compilers

Regular Expressions regular expression c k s

• The simplest regular expressions are a UNIX Tools rocks. string of literal characters to match. • The string matches the regular expression if match it contains the substring. UNIX Tools sucks.

match

UNIX Tools is okay.

no match

Regular Expressions Regular Expressions

• A regular expression can match a string in • The . regular expression can be used to more than one place. match any character.

regular expression regular expression a p p l e o .

Scrapple from the apple. For me to poop on.

match 1 match 2 match 1 match 2

2 Character Classes Negated Character Classes

• Character classes [] can be used to match • Character classes can be negated with the any specific set of characters. [^] syntax.

regular expression b [eor] a t regular expression b [^eo] a t

beat a brat on a boat beat a brat on a boat

match 1 match 2 match 3 match

More About Character Classes Named Character Classes – [aeiou] will match any of the characters a, e, i, o, or u • Commonly used character classes can be – [kK]orn will match korn or Korn referred to by name (alpha, lower, upper, • Ranges can also be specified in character classes alnum, digit, punct, cntrl) – [1-9] is the same as [123456789] • Syntax [:name:] – [abcde] is equivalent to [a-e] – [a-zA-Z] [[:alpha:]] – You can also combine multiple ranges – [a-zA-Z0-9] [[:alnum:]] • [abcde123456789] is equivalent to [a-e1-9] – [45a-z] [45[:lower:]] – Note that the - character has a special meaning in a character class but only if it is used within a range, • Important for portability across languages [-123] would match the characters -, 1, 2, or 3

Anchors regular expression ^ b [eor] a t

• Anchors are used to match the beginning or end beat a brat on a boat of a line (or both). • ^ means beginning of the line match • $ means end of the line regular expression b [eor] a t $

beat a brat on a boat

match

^word$ ^$

3 regular expression y a * y Repetition I got , yaaaaaaaaaay! • The * is used to define zero or more occurrences of the single regular expression match preceding it. regular expression o a * o

For me to poop on.

match

.*

Repetition Ranges Subexpressions • Ranges can also be specified • If you want to group part of an expression so that – {}notation can specify a range of repetitions for the immediately preceding regex * or { } applies to more than just the previous character, use ( ) notation – {n} means exactly n occurrences – {n,} means at least n occurrences • Subexpresssions are treated like a single character – {n,m} means at least n occurrences but no – a* matches 0 or more occurrences of a more than m occurrences – abc* matches ab, abc, abcc, abccc, … • Example: – (abc)* matches abc, abcabc, abcabcabc, … – .{0,} same as .* – (abc){2,3} matches abcabc or abcabcabc – a{2,} same as aaa*

grep Family Differences

• grep comes from the ed (Unix text editor) search • grep - uses regular expressions for pattern command “global regular expression print” or matching g/re/p • fgrep - file grep, does not use regular expressions, • This was such a useful command that it was only matches fixed strings but can get search written as a standalone utility strings from a file • There are two other variants, egrep and fgrep that • egrep - extended grep, uses a more powerful set of comprise the grep family regular expressions but does not support • grep is the answer to the moments where you backreferencing, generally the fastest member of know you want the file that contains a specific the grep family phrase but you can’t remember its name • – approximate grep; not standard

4 Protecting Regex Syntax Metacharacters • Regular expression concepts we have seen so • Since many of the special characters used in far are common to grep and egrep. regexs also have special meaning to the • grep and egrep have different syntax shell, it’s a good idea to get in the habit of – grep: BREs single quoting your regexs – egrep: EREs (enhanced features we will discuss) – This will protect any special characters from being operated on by the shell • Major syntax differences: – If you habitually do it, you won’t have to worry – grep: \( and \), \{ and \} about when it is necessary – egrep: ( and ), { and }

Escaping Special Characters Egrep: Alternation • Even though we are single quoting our regexs so the • Regex also provides an alternation character for shell won’t interpret the special characters, some | matching one or another subexpression characters are special to grep (eg * and .) – (T|Fl)an will match ‘Tan’ or ‘Flan’ • To get literal characters, we escape the character with – ^(From|Subject): will match the From and a \ (backslash) Subject lines of a typical email message • It matches a beginning of line followed by either the characters • Suppose we want to search for the character sequence ‘From’ or ‘Subject’ followed by a ‘:’ 'a*b*' • Subexpressions are used to limit the scope of the – Unless we do something special, this will match zero or alternation more ‘a’s followed by zero or more ‘b’s, not what we want – At(ten|nine)tion then matches “Attention” or – ‘a\*b\*’ will fix this - now the asterisks are treated as “Atninetion”, not “Atten” or “ninetion” as would regular characters happen without the parenthesis - Atten|ninetion

Egrep: Repetition Shorthands Egrep: Repetition Shorthands cont

•The ‘?’ (question mark) specifies an optional character, the • The * (star) has already been seen to specify zero single character that immediately precedes it or more occurrences of the immediately preceding ƒ July? will match ‘Jul’ or ‘July’ character ƒ Equivalent to {0,1} • + (plus) means “one or more” ƒ Also equivalent to (Jul|July) ƒ abc+d will match ‘abcd’, ‘abccd’, or ‘abccccccd’ but •The *, ?, and + are known as quantifiers because they will not match ‘abd’ specify the quantity of a match • Quantifiers can also be used with subexpressions ƒ Equivalent to {1,} – (a*c)+ will match ‘c’, ‘ac’, ‘aac’ or ‘aacaacac’ but will not match ‘a’ or a blank line

5 Grep: Backreferences Practical Regex Examples

• Sometimes it is handy to be able to refer to a • Variable names in C match that was made earlier in a regex – [a-zA-Z_][a-zA-Z_0-9]* • This is done using backreferences • Dollar amount with optional cents n is the backreference specifier, where n is a number – \ – \$[0-9]+(\.[0-9][0-9])? • Looks for nth subexpression • Time of day • For example, to find if the first word of a line is – (1[012]|[1-9]):[0-5][0-9] (am|pm) the same as the last:

… – ^\([[:alpha:]]\{1,\}\) .* \1$ • HTML headers – <[hH][1-4]> –The \([[:alpha:]]\{1,\}\) matches 1 or more letters

grep Family grep Examples • Syntax • grep 'men' GrepMe • grep 'fo*' GrepMe grep [-hilnv] [-e expression] [filename] • egrep 'fo+' GrepMe egrep [-hilnv] [-e expression] [-f filename] [expression] • egrep -n '[Tt]he' GrepMe [filename] • fgrep 'The' GrepMe • egrep 'NC+[0-9]*A?' GrepMe fgrep [-hilnxv] [-e string] [-f filename] [string] [filename] • fgrep -f expfile GrepMe – -h Do not display filenames • Find all lines with signed numbers – -i Ignore case $ egrep ’[-+][0-9]+\.?[0-9]*’ *.c – -l List only filenames containing matching lines bsearch. c: return -1; – -n Precede each matching line with its line number compile. c: strchr("+1-2*3", t-> op)[1] - ’0’, dst, convert. c: Print integers in a given base 2-16 (default 10) – - Negate matches convert. c: sscanf( argv[ i+1], "% d", &base); – -x Match whole line only (fgrep only) strcmp. c: return -1; – -e expression Specify expression as option strcmp. c: return +1; – -f filename Take the regular expression (egrep) or • egrep has its limits: For example, it cannot match all lines that a list of strings (fgrep) from filename contain a number divisible by 7.

Fun with the Dictionary Other Notes • /usr/dict/words contains about 25,000 words – egrep hh /usr/dict/words •Use /dev/null as an extra file name • beachhead • highhanded – Will print the name of the file that matched •withheld • grep bigfile • withhold – This is a test. • egrep as a simple spelling checker: Specify plausible • grep test /dev/null bigfile alternatives you know – bigfile:This is a test. egrep "n(ie|ei)ther" /usr/dict/words neither • Return code of grep is useful • How many words have 3 a’s one letter apart? – grep fred filename > /dev/null && rm filename – egrep a.a.a /usr/dict/words | wc –l •54 – egrep u.u.u /usr/dict/words • cumulus

6 This is one line of text input line o.*o regular expression Sed: Stream-oriented, Non-

x Ordinary characters match themselves Interactive, Text Editor ( and metacharacters excluded) fgrep, grep, egrep xyz Ordinary strings match themselves \m Matches literal character m • Look for patterns one line at a time, like grep ^ Start of line $ End of line . Any single character • Change lines of the file [xy^$x] Any of x, y, ^, $, or z [^xy^$z] Any one character other than x, y, ^, $, or z grep, egrep • Non-interactive text editor [a-z] Any single character in given range r* zero or more occurrences of regex r – Editing commands come in as script r1r2 Matches r1 followed by r2 \(r\) Tagged regular expression, matches r – There is an interactive editor ed accepts the same \n Set to what matched the nth tagged expression (n = 1-9) grep commands \{n,m\} Repetition r+ One or more occurrences of r • A Unix r? Zero or one occurrences of r r1|r2 Either r1 or r2 – Superset of previously mentioned tools (r1|r2)r3 Either r1r3 or r2r3 Quick (r1|r2)* Zero or more occurrences of r1|r2, e.g., r1, r1r1, egrep r2r1, r1r1r2r1,…) {n,m} Repetition Reference

Sed Architecture Conceptual overview

Input • All editing commands in a sed script are applied in order to each input line. scriptfile • If a command changes the input, subsequent Input line (Pattern Space) command address will be applied to the current (modified) line in the pattern space, not the original

Hold Space input line. • The original input file is unchanged (sed is a filter), and the results are sent to standard output (but can be redirected to a file). Output

Scripts Sed Flow of Control • A script is nothing more than a file of commands • sed then reads the next line in the input file and • Each command consists of up to two addresses restarts from the beginning of the script file and an action, where the address can be a regular • All commands in the script file are compared to, expression or line number. and potentially act on, all lines in the input file

script

address action command address action cmd 1 cmd 2 . . . cmd n address action address action address action print cmd output output script input only without -n

7