Searching Inside Files & Pattern Matching
SEARCHING INSIDE FILES & PATTERN MATCHING
Patricia J Riddle [email protected] Types of Searching
• Searching inside files – using regular expressions (this lecture)
• Searching for files – using recursive file find (next lecture)
Lecture 6 COMPSCI 215 1 REGULAR EXPRESSIONS
• Used by several different UNIX commands: . ed - a line-based text editor to create, display, modify, and otherwise manipulate text files . sed - a stream editor that reads input files, modifies the input in line with a list of commands, and writes to STDOUT . awk - a pattern-directed scanning and processing tool . grep - searches input files or STDIN for lines matching a given pattern and, by default, prints the matching lines . vi - a programmers text editor to edit all kinds of plain text (it is especially useful for editing programs) (but it is not as good as emacs!) . Provide a convenient and consistent way of specifying patterns to be matched Lecture 6 COMPSCI 215 2 The awk Command
• The awk command, named after its authors, Aho, Weinberger, and Kernigan, was the most powerful utility for text manipulation and report writing before the advent of perl – The POSIX awk also appears as nawk (never awk) in most systems and gawk (GNU awk) in Linux – It combines features of several filters, but also has two unique features: 1. It can identify and manipulate individual fields in a line 2. It is the only UNIX filter that can perform computations – Also, it accepts extended REs for pattern matching, has C-type programming constructs, and several built-in variables and functions
Lecture 6 COMPSCI 215 3 WILDCARDS
A very limited form of regular expressions recognised by the shell when you use filename substitution * (asterisk, or “splat”) specifies zero or more characters to match ? (question mark) specifies any single character […] specifies any character listed within the brackets [abc] (set) matches any one of the characters listed, i.e. a, b, or c [x-z] (range) matches any one character in the range x-z
. Be advised that the asterisk and the question mark are treated differently by these programs that by the shell ! Lecture 6 COMPSCI 215 4 Shortcomings of Wildcards
• Wildcards are great in specifying files in a directory, but not really powerful enough when searching inside files (Because they are evaluated by the shell before the command is run) • What we need is something to search inside files . egrep -n PATTERN [FILES…] ggim001$ egrep -n Ju.y * The_Relief_of_Tobruk.txt:23:(on a July estimate) did not return, and it was but a small consolation that The_Relief_of_Tobruk.txt:59:By 10 July General Freyberg (1) was able to point out to the New Zealand inter1.txt:193:July inter1.txt:645:July
Lecture 6 COMPSCI 215 5 REGULAR EXPRESSIONS
• A pattern that describes a set of strings • A tiny, highly specialised programming language – Sometimes a part of other programming languages such as Python, Perl or Java • Theoretical underpinnings - finite state automata • A regular expression (RE) has an elaborate character set overshadowing the shell’s wildcards – REs take care of usual query requirements
Lecture 6 COMPSCI 215 6 Regular Expressions Vs. Wildcards
• Alert! Regular expressions are NOT wildcards • Although they look similar, the meaning of wildcards are different to REs – Wildcards are expanded by the shell – REs are interpreted by some other program: often by grep (egrep, fgrep) but also awk, sed (stream editor) and perl (practical extraction and report language) – REs are interpreted even by Java by importing the java.util.regexp package
Lecture 6 COMPSCI 215 7 Fixed Strings
Regular expressions - • Most common case: most characters in a RE match themselves • Exceptions are called meta-characters: . * ? + [ ] { } | \ ( ) • To match a meta-character, escape it with a \ $ egrep ”p\.j\.delmas" cs_staff.lst if you look for the name p.j.delmas
Lecture 6 COMPSCI 215 8 A Single Character
• If not a meta-character, it matches itself • Bracketed structure […] . Matches a single character in the set . Example: The pattern [ra] matches either an r or an a . Can contain a range, both for alphabets and numerals . [a-zA-Z0-9] matches a single alphanumeric character . Can be complemented (negated) by putting a caret (^) as the first character (like the wildcard !) . [^a-zA-Z] matches a single non-alphabetic character . Fullstop, or period (.) matches any single character . First use of ^ Lecture 6 COMPSCI 215 9 Repeating Characters
A regular expression may be followed by one of several repetition operators: ? The preceding item will be matched zero or 1 times * The preceding item will be matched zero or more times + The preceding item will be matched one or more times \{n\} The preceding item is matched exactly n times \{n,\} The preceding item is matched n or more times \{n,m\} The preceding item is matched at least n times, but no more than m times
Lecture 6 COMPSCI 215 10 Repeating Characters
* (asterisk) refers to the immediately preceding character – Its interpretation totally differs from the * used by wildcards – It indicates that the previous character can occur many times, or not at all g* matches a null string and also: g gg ggg gggg … g+ matches g gg ggg gggg … .* matches a null string or any number of characters – The * has significance in a regular expression only if it is preceded by a character • If it is the first character in a regular expression, then it matches itself
Lecture 6 COMPSCI 215 11 Example: the grep Command
• The grep command allows you to search for one or more files for particular character patterns: grep pattern file(s) – Every line of each file that contains pattern is displayed at the terminal – If more than one file is specified to grep, each line is also immediately preceded by the name of the file (in order to identify the latter) – If the pattern does not exist in the specified file(s), the grep command simply displays nothing – It is generally a good idea to enclose the grep pattern inside a pair of single quotes to “protect” it from the shell $ grep * stars - the shell sees the asterisk * and automatically substitutes the names of all the files in your current directory $ grep '*' stars - the quotes remove its special meaning from the shell, so that the two arguments, * and stars, are passed to the grep
Lecture 6 COMPSCI 215 12 Specifying Pattern Locations Anchoring a pattern is necessary when it can occur in more than one place in a line ^ (caret) matches an empty string at the beginning of a line . ^pat matches the pattern pat at the beginning of a line $ (dollar) matches an empty string at the end of a line . pat$ matches the pattern pat at the end of a line ggim001$ egrep -n Ju.y * ggim001$ egrep -n ^Ju.y * The_Relief_of_Tobruk.txt:23:5816 (on inter1.txt:193:July a July estimate) did not return, and inter1.txt:645:July The_Relief_of_Tobruk.txt:59:By 10 inter2.txt:193:July July General Freyberg (1) was able to point out to the New Zealand inter1.txt:193:July inter1.txt:645:July inter2.txt:193:July inter4.txt:480: 2 July Second use of ^
Lecture 6 COMPSCI 215 13 Regular Expression Character Subset
* Matches zero or more occurrences of previous character . Matches a single character (like the ? Wildcard) [prq] Matches a single character p, r, or q
[c1-c2] Matches a single character in the range between c1 and c2 [^pqr] Matches a single character which is not p, q, or r ^pat Matches pattern pat at beginning of line pat$ Matches pattern pat at end of line Some of these symbols are also meaningful to the shell, so that the REs should be quoted – The expressions are interpreted by the command, and quoting ensures that the shell is not able to interfere
Lecture 6 COMPSCI 215 14 Large Regular Expressions
• Larger REs can be built from smaller REs in two ways: – Concatenation: The resulting RE matches any string formed by concatenating two substrings that respectively match the concatenated sub-expressions • Example: “[a-z][0-9]*” matches any string that begins with a lowercase letter followed by zero or more digits – Alternation: two REs may be joined by the infix operator |; the resulting RE matches any string matching either sub- expression • Example: [a-z]|[0-9]
Lecture 6 COMPSCI 215 15 A Few Examples
chap1 chap2 $ ls cha* text1 text2 chap1 chap2 $ cat [a-z]hap[0-9] text1 text2 $ cat [a-z]hap[0-9] | tr '[a-z]' '[A-Z]' TEXT1 TEXT2 $ cat [a-z]hap[0-9] | tr '[a-z]' '[A-QU-Z]' WE]W1 WE]W2 $ egrep -n '[a-z]' chap* chap1:1:text1 chap2:1:text2
Lecture 6 COMPSCI 215 16 Saving Matched Characters
• The characters matched within a regular expression are captured by enclosing the characters inside \(…\) • The captured characters are stored in “registers” 1, …, 9 and retrieved as \1, …, \9, respectively – Successive occurrences of the \(…\) construct get assigned to successive registers: ^\(…\)\(…\) - the first three characters on the line are stored into register 1, and the next three characters into register 2 – The RE ^\(.\) matches the first character in the line and stores it in register 1 – The RE ^\(.\).*\1$ matches all lines in which the first (^.) and the last character (\1$) on the line are the same • Here, the RE .* matches all the characters in-between
Lecture 6 COMPSCI 215 17 An Example
• List all the ordinary files in your directory created in July in increasing order of file size (assuming file names do not contain “ Jul ”): ls -l | grep '^-.* Jul .*' | sort -n -k 5 • The first command ls -l lists directory contents – the -l option displays the long format: file mode, number of links, owner name, group name, number of bytes in the file, abbreviated month, day- of-month, hour:minute file was last modified, and file name ggim001$ ls -l total 160 drwxr-xr-x 13 ggim001 ggim001 442 Jul 9 11:35 Exercises -rw-r--r-- 1 ggim001 ggim001 526 Jul 9 12:46 Test.txt drwxr-xr-x 17 ggim001 ggim001 578 Jun 27 15:01 bak-dir -rwxr-xr-x 1 ggim001 ggim001 294 Jun 27 14:57 bak.bash -rwxr-xr-x 1 ggim001 ggim001 425 Jun 26 18:26 calc.bash …
Lecture 6 COMPSCI 215 18 An Example
• The second command grep '^-.* Jul .*' allows you to search its input (i.e. the standard output of the first command ls -l) for a particular character pattern and output every line that contains the pattern • For the desired ordinary files, the first character in the line is “-” • Every desired line has to contain the month abbreviation:
Lecture 6 COMPSCI 215 19 An Example ggim001$ ls -l | grep '^-.* Jul .*' -rw-r--r-- 1 ggim001 ggim001 526 Jul 9 12:46 Test.txt -rw-r--r-- 1 ggim001 ggim001 372 Jul 4 15:39 right.txt -rwxr-xr-x 1 ggim001 ggim001 467 Jul 5 13:31 selfd.bash • The third command sort takes each input line and sorts it into ascending order – The -n option to sort specifies that the first field on the line is to be considered a number, and the data is to be sorted arithmetically – The -k 5 says to skip the first four fields on each line and then sort the data in the fifth field numerically – Fields are delimited by space or tab characters by default; to use a different delimiter, the -t option should be used (why t??)
Lecture 6 COMPSCI 215 20 An Example ggim001$ ls -l | grep '^-.* Jul .*' | sort -n -k 5 -rw-r--r-- 1 ggim001 ggim001 372 Jul 4 15:39 right.txt -rwxr-xr-x 1 ggim001 ggim001 467 Jul 5 13:31 selfd.bash -rw-r--r-- 1 ggim001 ggim001 526 Jul 9 12:46 Test.txt • One more example: ggim001$ ls -l | grep '^-.* Jul .*' | sort -n -k 5| cut -c35- 372 Jul 4 15:39 right.txt 467 Jul 5 13:31 selfd.bash 526 Jul 9 12:46 Test.txt • The command cut -cchars file cuts out various fields of data from a data file or the standard input (if file is not specified) -c1,5-10 to extract characters 1 and 5 through 10 -c35- to extract characters 35 through the end of the line
Lecture 6 COMPSCI 215 21 The grep Options
• The -v option allows you to find the lines that do not contain a specified pattern $ grep -v 'UNIX' intro.tx prints all lines that do not contain UNIX • The -l option lists just files that contain the specified pattern, not the matching lines from the files $ grep -l 'UNIX' * | wc -l counts the number of files in the current directory that contain the specified pattern – What are you counting if you use grep without the -l option? • The -n option: each line from the file that matches the specified pattern is preceded by its relative line number in the file
Lecture 6 COMPSCI 215 22 Interpreting Quote Characters
• The shell recognises four different types of quote characters: – The single quote character ' – The double quote character " – The backslash character \ – The back quote character ` (magic quote) • The single, double, and back quotes must occur in pairs, whereas the backslash character is unary in nature • When the shell encounters the first ', it ignores any special characters until it found the closing ' $ echo '<>|;(){}>>`&' <>|;(){}>>`& – Even the Enter key is ignored by the shell if it is enclosed in quotes
Lecture 6 COMPSCI 215 23 Interpreting Quote Characters
• Double quotes work similarly to single quotes, except that they are not as restrictive – The single quotes tell the shell to ignore all enclosed characters – The double quotes say to ignore most of enclosed characters; e.g. the dollar signs, back quotes, and backslashes are not ignored ggim001$ x=*.bash ggim001$ echo $x bak.bash calc.bash chegg.bash example.bash … ggim001$ echo '$x' $x ggim001$ echo "$x" *.bash ggim001$
Lecture 6 COMPSCI 215 24 Interpreting Quote Characters • The backslash quotes a single character immediately following it – One exception is when the backslash is used as the very last character on the line; then the shell treats it as a line continuation (it is most often used for typing long commands over multiple lines) ggim001$ echo one-two\ > -three one-two-three ggim001$ – You can use the backslash inside double quotes to remove the meaning of characters that otherwise would be interpreted inside these quotes (i.e. other backslashes, dollar signs, back quotes, newlines, and other double quotes) • The back quote tells the shell to execute the enclosed command $ echo The working directory is `pwd` The working directory is /afs/ec.auckland.ac.nz/users/p/r/prid013/unixhome $ Lecture 6 COMPSCI 215 25 Chocolate Egg Question
• What does this command do? grep " plant[a-z]* [ .]" *.tx
Lecture 6 COMPSCI 215 26