
SEARCHING INSIDE FILES & PATTERN MATCHING Patricia J Riddle [email protected] Types of Searching • Searching inside files – using regular expressions (this lecture) • Searching for files – using recursive file find (next lecture) Lecture 6 COMPSCI 215 1 REGULAR EXPRESSIONS • Used by several different UNIX commands: . ed - a line-based text editor to create, display, modify, and otherwise manipulate text files . sed - a stream editor that reads input files, modifies the input in line with a list of commands, and writes to STDOUT . awk - a pattern-directed scanning and processing tool . grep - searches input files or STDIN for lines matching a given pattern and, by default, prints the matching lines . vi - a programmers text editor to edit all kinds of plain text (it is especially useful for editing programs) (but it is not as good as emacs!) . Provide a convenient and consistent way of specifying patterns to be matched Lecture 6 COMPSCI 215 2 The awk Command • The awk command, named after its authors, Aho, Weinberger, and Kernigan, was the most powerful utility for text manipulation and report writing before the advent of perl – The POSIX awk also appears as nawk (never awk) in most systems and gawk (GNU awk) in Linux – It combines features of several filters, but also has two unique features: 1. It can identify and manipulate individual fields in a line 2. It is the only UNIX filter that can perform computations – Also, it accepts extended REs for pattern matching, has C-type programming constructs, and several built-in variables and functions Lecture 6 COMPSCI 215 3 WILDCARDS A very limited form of regular expressions recognised by the shell when you use filename substitution * (asterisk, or “splat”) specifies zero or more characters to match ? (question mark) specifies any single character […] specifies any character listed within the brackets [abc] (set) matches any one of the characters listed, i.e. a, b, or c [x-z] (range) matches any one character in the range x-z . Be advised that the asterisk and the question mark are treated differently by these programs that by the shell ! Lecture 6 COMPSCI 215 4 Shortcomings of Wildcards • Wildcards are great in specifying files in a directory, but not really powerful enough when searching inside files (Because they are evaluated by the shell before the command is run) • What we need is something to search inside files . egrep -n PATTERN [FILES…] ggim001$ egrep -n Ju.y * The_Relief_of_Tobruk.txt:23:(on a July estimate) did not return, and it was but a small consolation that The_Relief_of_Tobruk.txt:59:By 10 July General Freyberg (1) was able to point out to the New Zealand inter1.txt:193:July inter1.txt:645:July Lecture 6 COMPSCI 215 5 REGULAR EXPRESSIONS • A pattern that describes a set of strings • A tiny, highly specialised programming language – Sometimes a part of other programming languages such as Python, Perl or Java • Theoretical underpinnings - finite state automata • A regular expression (RE) has an elaborate character set overshadowing the shell’s wildcards – REs take care of usual query requirements Lecture 6 COMPSCI 215 6 Regular Expressions Vs. Wildcards • Alert! Regular expressions are NOT wildcards • Although they look similar, the meaning of wildcards are different to REs – Wildcards are expanded by the shell – REs are interpreted by some other program: often by grep (egrep, fgrep) but also awk, sed (stream editor) and perl (practical extraction and report language) – REs are interpreted even by Java by importing the java.util.regexp package Lecture 6 COMPSCI 215 7 Fixed Strings Regular expressions - • Most common case: most characters in a RE match themselves • Exceptions are called meta-characters: . * ? + [ ] { } | \ ( ) • To match a meta-character, escape it with a \ $ egrep ”p\.j\.delmas" cs_staff.lst if you look for the name p.j.delmas Lecture 6 COMPSCI 215 8 A Single Character • If not a meta-character, it matches itself • Bracketed structure […] . Matches a single character in the set . Example: The pattern [ra] matches either an r or an a . Can contain a range, both for alphabets and numerals . [a-zA-Z0-9] matches a single alphanumeric character . Can be complemented (negated) by putting a caret (^) as the first character (like the wildcard !) . [^a-zA-Z] matches a single non-alphabetic character . Fullstop, or period (.) matches any single character . First use of ^ Lecture 6 COMPSCI 215 9 Repeating Characters A regular expression may be followed by one of several repetition operators: ? The preceding item will be matched zero or 1 times * The preceding item will be matched zero or more times + The preceding item will be matched one or more times \{n\} The preceding item is matched exactly n times \{n,\} The preceding item is matched n or more times \{n,m\} The preceding item is matched at least n times, but no more than m times Lecture 6 COMPSCI 215 10 Repeating Characters * (asterisk) refers to the immediately preceding character – Its interpretation totally differs from the * used by wildcards – It indicates that the previous character can occur many times, or not at all g* matches a null string and also: g gg ggg gggg … g+ matches g gg ggg gggg … .* matches a null string or any number of characters – The * has significance in a regular expression only if it is preceded by a character • If it is the first character in a regular expression, then it matches itself Lecture 6 COMPSCI 215 11 Example: the grep Command • The grep command allows you to search for one or more files for particular character patterns: grep pattern file(s) – Every line of each file that contains pattern is displayed at the terminal – If more than one file is specified to grep, each line is also immediately preceded by the name of the file (in order to identify the latter) – If the pattern does not exist in the specified file(s), the grep command simply displays nothing – It is generally a good idea to enclose the grep pattern inside a pair of single quotes to “protect” it from the shell $ grep * stars - the shell sees the asterisk * and automatically substitutes the names of all the files in your current directory $ grep '*' stars - the quotes remove its special meaning from the shell, so that the two arguments, * and stars, are passed to the grep Lecture 6 COMPSCI 215 12 Specifying Pattern Locations Anchoring a pattern is necessary when it can occur in more than one place in a line ^ (caret) matches an empty string at the beginning of a line . ^pat matches the pattern pat at the beginning of a line $ (dollar) matches an empty string at the end of a line . pat$ matches the pattern pat at the end of a line ggim001$ egrep -n Ju.y * ggim001$ egrep -n ^Ju.y * The_Relief_of_Tobruk.txt:23:5816 (on inter1.txt:193:July a July estimate) did not return, and inter1.txt:645:July The_Relief_of_Tobruk.txt:59:By 10 inter2.txt:193:July July General Freyberg (1) was able to point out to the New Zealand inter1.txt:193:July inter1.txt:645:July inter2.txt:193:July inter4.txt:480: 2 July Second use of ^ Lecture 6 COMPSCI 215 13 Regular Expression Character Subset * Matches zero or more occurrences of previous character . Matches a single character (like the ? Wildcard) [prq] Matches a single character p, r, or q [c1-c2] Matches a single character in the range between c1 and c2 [^pqr] Matches a single character which is not p, q, or r ^pat Matches pattern pat at beginning of line pat$ Matches pattern pat at end of line Some of these symbols are also meaningful to the shell, so that the REs should be quoted – The expressions are interpreted by the command, and quoting ensures that the shell is not able to interfere Lecture 6 COMPSCI 215 14 Large Regular Expressions • Larger REs can be built from smaller REs in two ways: – Concatenation: The resulting RE matches any string formed by concatenating two substrings that respectively match the concatenated sub-expressions • Example: “[a-z][0-9]*” matches any string that begins with a lowercase letter followed by zero or more digits – Alternation: two REs may be joined by the infix operator |; the resulting RE matches any string matching either sub- expression • Example: [a-z]|[0-9] Lecture 6 COMPSCI 215 15 A Few Examples chap1 chap2 $ ls cha* text1 text2 chap1 chap2 $ cat [a-z]hap[0-9] text1 text2 $ cat [a-z]hap[0-9] | tr '[a-z]' '[A-Z]' TEXT1 TEXT2 $ cat [a-z]hap[0-9] | tr '[a-z]' '[A-QU-Z]' WE]W1 WE]W2 $ egrep -n '[a-z]' chap* chap1:1:text1 chap2:1:text2 Lecture 6 COMPSCI 215 16 Saving Matched Characters • The characters matched within a regular expression are captured by enclosing the characters inside \(…\) • The captured characters are stored in “registers” 1, …, 9 and retrieved as \1, …, \9, respectively – Successive occurrences of the \(…\) construct get assigned to successive registers: ^\(…\)\(…\) - the first three characters on the line are stored into register 1, and the next three characters into register 2 – The RE ^\(.\) matches the first character in the line and stores it in register 1 – The RE ^\(.\).*\1$ matches all lines in which the first (^.) and the last character (\1$) on the line are the same • Here, the RE .* matches all the characters in-between Lecture 6 COMPSCI 215 17 An Example • List all the ordinary files in your directory created in July in increasing order of file size (assuming file names do not contain “ Jul ”): ls -l | grep '^-.* Jul .*' | sort -n -k 5 • The first command ls -l lists directory contents – the -l option displays the long format: file mode, number of links, owner name, group name, number of bytes in the file, abbreviated month, day- of-month, hour:minute file was last modified, and file name ggim001$ ls -l total 160 drwxr-xr-x 13 ggim001 ggim001 442 Jul 9 11:35 Exercises -rw-r--r-- 1 ggim001 ggim001 526 Jul 9 12:46 Test.txt drwxr-xr-x 17 ggim001 ggim001 578 Jun 27 15:01 bak-dir -rwxr-xr-x 1 ggim001 ggim001 294 Jun 27 14:57 bak.bash -rwxr-xr-x 1 ggim001 ggim001 425 Jun 26 18:26 calc.bash … Lecture 6 COMPSCI 215 18 An Example • The second command grep '^-.* Jul .*' allows you to search its input (i.e.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages27 Page
-
File Size-