Quick viewing(Text Mode)

Searching Inside Files & Pattern Matching

Searching Inside Files & Pattern Matching

SEARCHING INSIDE FILES & PATTERN MATCHING

Patricia J Riddle [email protected] Types of Searching

• Searching inside files – using regular expressions (this lecture)

• Searching for files – using recursive file find (next lecture)

Lecture 6 COMPSCI 215 1 REGULAR EXPRESSIONS

• Used by several different commands: . - a line-based text editor to create, display, modify, and otherwise manipulate text files . - a stream editor that reads input files, modifies the input in line with a list of commands, and writes to STDOUT . - a pattern-directed scanning and processing tool . grep - searches input files or STDIN for lines matching a given pattern and, by default, prints the matching lines . - a programmers text editor to edit all kinds of plain text (it is especially useful for editing programs) (but it is not as good as emacs!) . Provide a convenient and consistent way of specifying patterns to be matched Lecture 6 COMPSCI 215 2 The awk

• The awk command, named after its authors, Aho, Weinberger, and Kernigan, was the powerful utility for text manipulation and report writing before the advent of – The POSIX awk also appears as nawk (never awk) in most systems and gawk (GNU awk) in – It combines features of several filters, but also has two unique features: 1. It can identify and manipulate individual fields in a line 2. It is the only UNIX that can perform computations – Also, it accepts extended REs for pattern matching, has C- programming constructs, and several built-in variables and functions

Lecture 6 COMPSCI 215 3 WILDCARDS

A very limited form of regular expressions recognised by the shell when you use filename substitution * (asterisk, or “splat”) specifies zero or characters to match ? (question mark) specifies any single character […] specifies any character listed within the brackets [abc] (set) matches any one of the characters listed, i.e. a, , or c [x-z] (range) matches any one character in the range x-z

. Be advised that the asterisk and the question mark are treated differently by these programs that by the shell ! Lecture 6 COMPSCI 215 4 Shortcomings of Wildcards

• Wildcards are great in specifying files in a directory, but not really powerful enough when searching inside files (Because they are evaluated by the shell before the command is run) • What we need is something to search inside files . egrep -n PATTERN [FILES…] ggim001$ egrep -n Ju.y * The_Relief_of_Tobruk.txt:23:(on a July estimate) did not return, and it was but a small consolation that The_Relief_of_Tobruk.txt:59:By 10 July General Freyberg (1) was able to point out to the New Zealand inter1.txt:193:July inter1.txt:645:July

Lecture 6 COMPSCI 215 5 REGULAR EXPRESSIONS

• A pattern that describes a set of • A tiny, highly specialised programming language – Sometimes a part of other programming languages such as Python, Perl or Java • Theoretical underpinnings - finite state automata • A (RE) has an elaborate character set overshadowing the shell’s wildcards – REs take care of usual query requirements

Lecture 6 COMPSCI 215 6 Regular Expressions Vs. Wildcards

• Alert! Regular expressions are NOT wildcards • Although they look similar, the meaning of wildcards are different to REs – Wildcards are expanded by the shell – REs are interpreted by some other program: often by grep (egrep, fgrep) but also awk, sed (stream editor) and perl (practical extraction and report language) – REs are interpreted even by Java by importing the java.util.regexp package

Lecture 6 COMPSCI 215 7 Fixed Strings

Regular expressions - • Most common case: most characters in a RE match themselves • Exceptions are called meta-characters: . * ? + [ ] { } | \ ( ) • To match a meta-character, escape it with a \ $ egrep ”p\.j\.delmas" cs_staff.lst if you look for the name p.j.delmas

Lecture 6 COMPSCI 215 8 A Single Character

• If not a meta-character, it matches itself • Bracketed structure […] . Matches a single character in the set . Example: The pattern [ra] matches either an r or an a . Can contain a range, both for alphabets and numerals . [a-zA-Z0-9] matches a single alphanumeric character . Can be complemented (negated) by putting a caret (^) as the first character (like the wildcard !) . [^a-zA-Z] matches a single non-alphabetic character . Fullstop, or period (.) matches any single character . First use of ^ Lecture 6 COMPSCI 215 9 Repeating Characters

A regular expression may be followed by one of several repetition operators: ? The preceding item will be matched zero or 1 times * The preceding item will be matched zero or more times + The preceding item will be matched one or more times \{n\} The preceding item is matched exactly n times \{n,\} The preceding item is matched n or more times \{n,m\} The preceding item is matched least n times, but no more than m times

Lecture 6 COMPSCI 215 10 Repeating Characters

* (asterisk) refers to the immediately preceding character – Its interpretation totally differs from the * used by wildcards – It indicates that the previous character can occur many times, or not at all g* matches a null string and also: g gg ggg gggg … g+ matches g gg ggg gggg … .* matches a null string or any number of characters – The * has significance in a regular expression only if it is preceded by a character • If it is the first character in a regular expression, then it matches itself

Lecture 6 COMPSCI 215 11 Example: the grep Command

• The grep command allows you to search for one or more files for particular character patterns: grep pattern (s) – Every line of each file that contains pattern is displayed at the terminal – If more than one file is specified to grep, each line is also immediately preceded by the name of the file (in order to identify the latter) – If the pattern does not exist in the specified file(s), the grep command simply displays nothing – It is generally a good idea to enclose the grep pattern inside a pair of single quotes to “protect” it from the shell $ grep * stars - the shell sees the asterisk * and automatically substitutes the names of all the files in your current directory $ grep '*' stars - the quotes remove its special meaning from the shell, so that the two arguments, * and stars, are passed to the grep

Lecture 6 COMPSCI 215 12 Specifying Pattern Locations Anchoring a pattern is necessary when it can occur in more than one place in a line ^ (caret) matches an empty string at the beginning of a line . ^pat matches the pattern pat at the beginning of a line $ (dollar) matches an empty string at the end of a line . pat$ matches the pattern pat at the end of a line ggim001$ egrep -n Ju.y * ggim001$ egrep -n ^Ju.y * The_Relief_of_Tobruk.txt:23:5816 (on inter1.txt:193:July a July estimate) did not return, and inter1.txt:645:July The_Relief_of_Tobruk.txt:59:By 10 inter2.txt:193:July July General Freyberg (1) was able to point out to the New Zealand inter1.txt:193:July inter1.txt:645:July inter2.txt:193:July inter4.txt:480: 2 July Second use of ^

Lecture 6 COMPSCI 215 13 Regular Expression Character Subset

* Matches zero or more occurrences of previous character . Matches a single character (like the ? Wildcard) [prq] Matches a single character p, r, or q

[c1-c2] Matches a single character in the range between c1 and c2 [^pqr] Matches a single character is not p, q, or r ^pat Matches pattern pat at beginning of line pat$ Matches pattern pat at end of line Some of these symbols are also meaningful to the shell, so that the REs should be quoted – The expressions are interpreted by the command, and quoting ensures that the shell is not able to interfere

Lecture 6 COMPSCI 215 14 Large Regular Expressions

• Larger REs can be built from smaller REs in two ways: – Concatenation: The resulting RE matches any string formed by concatenating two substrings that respectively match the concatenated sub-expressions • Example: “[a-z][0-9]*” matches any string that begins with a lowercase letter followed by zero or more digits – Alternation: two REs may be joined by the infix operator |; the resulting RE matches any string matching either sub- expression • Example: [a-z]|[0-9]

Lecture 6 COMPSCI 215 15 A Few Examples

chap1 chap2 $ cha* text1 text2 chap1 chap2 $ [a-z]hap[0-9] text1 text2 $ cat [a-z]hap[0-9] | '[a-z]' '[A-Z]' TEXT1 TEXT2 $ cat [a-z]hap[0-9] | tr '[a-z]' '[A-QU-Z]' WE]W1 WE]W2 $ egrep -n '[a-z]' chap* chap1:1:text1 chap2:1:text2

Lecture 6 COMPSCI 215 16 Saving Matched Characters

• The characters matched within a regular expression are captured by enclosing the characters inside \(…\) • The captured characters are stored in “registers” 1, …, 9 and retrieved as \1, …, \9, respectively – Successive occurrences of the \(…\) construct get assigned to successive registers: ^\(…\)\(…\) - the first three characters on the line are stored into register 1, and the next three characters into register 2 – The RE ^\(.\) matches the first character in the line and stores it in register 1 – The RE ^\(.\).*\1$ matches all lines in which the first (^.) and the last character (\1$) on the line are the same • Here, the RE .* matches all the characters in-between

Lecture 6 COMPSCI 215 17 An Example

• List all the ordinary files in your directory created in July in increasing order of file size (assuming file names do not contain “ Jul ”): ls -l | grep '^-.* Jul .*' | -n -k 5 • The first command ls -l lists directory contents – the -l option displays the long format: file mode, number of links, owner name, group name, number of bytes in the file, abbreviated month, day- of-month, hour:minute file was last modified, and file name ggim001$ ls -l total 160 drwxr-xr-x 13 ggim001 ggim001 442 Jul 9 11:35 Exercises -rw-r--r-- 1 ggim001 ggim001 526 Jul 9 12:46 .txt drwxr-xr-x 17 ggim001 ggim001 578 Jun 27 15:01 bak-dir -rwxr-xr-x 1 ggim001 ggim001 294 Jun 27 14:57 bak.bash -rwxr-xr-x 1 ggim001 ggim001 425 Jun 26 18:26 calc.bash …

Lecture 6 COMPSCI 215 18 An Example

• The second command grep '^-.* Jul .*' allows you to search its input (i.e. the standard output of the first command ls -l) for a particular character pattern and output every line that contains the pattern • For the desired ordinary files, the first character in the line is “-” • Every desired line has to contain the month abbreviation: Jul • The regular expression '^-.* Jul .*' gives the pattern to be found where the .* represents any character string between the - and Jul and afterwards

Lecture 6 COMPSCI 215 19 An Example ggim001$ ls -l | grep '^-.* Jul .*' -rw-r--r-- 1 ggim001 ggim001 526 Jul 9 12:46 Test.txt -rw-r--r-- 1 ggim001 ggim001 372 Jul 4 15:39 right.txt -rwxr-xr-x 1 ggim001 ggim001 467 Jul 5 13:31 selfd.bash • The third command sort takes each input line and sorts it into ascending order – The -n option to sort specifies that the first field on the line is to be considered a number, and the data is to be sorted arithmetically – The -k 5 says to skip the first four fields on each line and then sort the data in the fifth field numerically – Fields are delimited by space or tab characters by default; to use a different delimiter, the -t option should be used (why t??)

Lecture 6 COMPSCI 215 20 An Example ggim001$ ls -l | grep '^-.* Jul .*' | sort -n -k 5 -rw-r--r-- 1 ggim001 ggim001 372 Jul 4 15:39 right.txt -rwxr-xr-x 1 ggim001 ggim001 467 Jul 5 13:31 selfd.bash -rw-r--r-- 1 ggim001 ggim001 526 Jul 9 12:46 Test.txt • One more example: ggim001$ ls -l | grep '^-.* Jul .*' | sort -n -k 5| -c35- 372 Jul 4 15:39 right.txt 467 Jul 5 13:31 selfd.bash 526 Jul 9 12:46 Test.txt • The command cut -cchars file cuts out various fields of data from a data file or the standard input (if file is not specified) -c1,5-10 to extract characters 1 and 5 through 10 -c35- to extract characters 35 through the end of the line

Lecture 6 COMPSCI 215 21 The grep Options

• The - option allows you to the lines that do not contain a specified pattern $ grep -v 'UNIX' intro.tx prints all lines that do not contain UNIX • The -l option lists just files that contain the specified pattern, not the matching lines from the files $ grep -l 'UNIX' * | -l counts the number of files in the current directory that contain the specified pattern – What are you counting if you use grep without the -l option? • The -n option: each line from the file that matches the specified pattern is preceded by its relative line number in the file

Lecture 6 COMPSCI 215 22 Interpreting Quote Characters

• The shell recognises four different types of quote characters: – The single quote character ' – The double quote character " – The backslash character \ – The back quote character ` (magic quote) • The single, double, and back quotes must occur in pairs, whereas the backslash character is unary in nature • When the shell encounters the first ', it ignores any special characters until it found the closing ' $ '<>|;(){}>>`&' <>|;(){}>>`& – Even the Enter key is ignored by the shell if it is enclosed in quotes

Lecture 6 COMPSCI 215 23 Interpreting Quote Characters

• Double quotes work similarly to single quotes, except that they are not as restrictive – The single quotes tell the shell to ignore all enclosed characters – The double quotes say to ignore most of enclosed characters; e.g. the dollar signs, back quotes, and backslashes are not ignored ggim001$ x=*.bash ggim001$ echo $x bak.bash calc.bash chegg.bash example.bash … ggim001$ echo '$x' $x ggim001$ echo "$x" *.bash ggim001$

Lecture 6 COMPSCI 215 24 Interpreting Quote Characters • The backslash quotes a single character immediately following it – One exception is when the backslash is used as the very last character on the line; then the shell treats it as a line continuation (it is most often used for typing long commands over multiple lines) ggim001$ echo one-two\ > -three one-two-three ggim001$ – You can use the backslash inside double quotes to remove the meaning of characters that otherwise would be interpreted inside these quotes (i.e. other backslashes, dollar signs, back quotes, , and other double quotes) • The back quote tells the shell to execute the enclosed command $ echo The working directory is `` The working directory is /afs/ec.auckland.ac.nz/users/p/r/prid013/unixhome $ Lecture 6 COMPSCI 215 25 Chocolate Egg Question

• What does this command do? grep " plant[a-z]* [ .]" *.tx

Lecture 6 COMPSCI 215 26