Searching Inside Files & Pattern Matching

Searching Inside Files & Pattern Matching

SEARCHING INSIDE FILES & PATTERN MATCHING Patricia J Riddle [email protected] Types of Searching • Searching inside files – using regular expressions (this lecture) • Searching for files – using recursive file find (next lecture) Lecture 6 COMPSCI 215 1 REGULAR EXPRESSIONS • Used by several different UNIX commands: . ed - a line-based text editor to create, display, modify, and otherwise manipulate text files . sed - a stream editor that reads input files, modifies the input in line with a list of commands, and writes to STDOUT . awk - a pattern-directed scanning and processing tool . grep - searches input files or STDIN for lines matching a given pattern and, by default, prints the matching lines . vi - a programmers text editor to edit all kinds of plain text (it is especially useful for editing programs) (but it is not as good as emacs!) . Provide a convenient and consistent way of specifying patterns to be matched Lecture 6 COMPSCI 215 2 The awk Command • The awk command, named after its authors, Aho, Weinberger, and Kernigan, was the most powerful utility for text manipulation and report writing before the advent of perl – The POSIX awk also appears as nawk (never awk) in most systems and gawk (GNU awk) in Linux – It combines features of several filters, but also has two unique features: 1. It can identify and manipulate individual fields in a line 2. It is the only UNIX filter that can perform computations – Also, it accepts extended REs for pattern matching, has C-type programming constructs, and several built-in variables and functions Lecture 6 COMPSCI 215 3 WILDCARDS A very limited form of regular expressions recognised by the shell when you use filename substitution * (asterisk, or “splat”) specifies zero or more characters to match ? (question mark) specifies any single character […] specifies any character listed within the brackets [abc] (set) matches any one of the characters listed, i.e. a, b, or c [x-z] (range) matches any one character in the range x-z . Be advised that the asterisk and the question mark are treated differently by these programs that by the shell ! Lecture 6 COMPSCI 215 4 Shortcomings of Wildcards • Wildcards are great in specifying files in a directory, but not really powerful enough when searching inside files (Because they are evaluated by the shell before the command is run) • What we need is something to search inside files . egrep -n PATTERN [FILES…] ggim001$ egrep -n Ju.y * The_Relief_of_Tobruk.txt:23:(on a July estimate) did not return, and it was but a small consolation that The_Relief_of_Tobruk.txt:59:By 10 July General Freyberg (1) was able to point out to the New Zealand inter1.txt:193:July inter1.txt:645:July Lecture 6 COMPSCI 215 5 REGULAR EXPRESSIONS • A pattern that describes a set of strings • A tiny, highly specialised programming language – Sometimes a part of other programming languages such as Python, Perl or Java • Theoretical underpinnings - finite state automata • A regular expression (RE) has an elaborate character set overshadowing the shell’s wildcards – REs take care of usual query requirements Lecture 6 COMPSCI 215 6 Regular Expressions Vs. Wildcards • Alert! Regular expressions are NOT wildcards • Although they look similar, the meaning of wildcards are different to REs – Wildcards are expanded by the shell – REs are interpreted by some other program: often by grep (egrep, fgrep) but also awk, sed (stream editor) and perl (practical extraction and report language) – REs are interpreted even by Java by importing the java.util.regexp package Lecture 6 COMPSCI 215 7 Fixed Strings Regular expressions - • Most common case: most characters in a RE match themselves • Exceptions are called meta-characters: . * ? + [ ] { } | \ ( ) • To match a meta-character, escape it with a \ $ egrep ”p\.j\.delmas" cs_staff.lst if you look for the name p.j.delmas Lecture 6 COMPSCI 215 8 A Single Character • If not a meta-character, it matches itself • Bracketed structure […] . Matches a single character in the set . Example: The pattern [ra] matches either an r or an a . Can contain a range, both for alphabets and numerals . [a-zA-Z0-9] matches a single alphanumeric character . Can be complemented (negated) by putting a caret (^) as the first character (like the wildcard !) . [^a-zA-Z] matches a single non-alphabetic character . Fullstop, or period (.) matches any single character . First use of ^ Lecture 6 COMPSCI 215 9 Repeating Characters A regular expression may be followed by one of several repetition operators: ? The preceding item will be matched zero or 1 times * The preceding item will be matched zero or more times + The preceding item will be matched one or more times \{n\} The preceding item is matched exactly n times \{n,\} The preceding item is matched n or more times \{n,m\} The preceding item is matched at least n times, but no more than m times Lecture 6 COMPSCI 215 10 Repeating Characters * (asterisk) refers to the immediately preceding character – Its interpretation totally differs from the * used by wildcards – It indicates that the previous character can occur many times, or not at all g* matches a null string and also: g gg ggg gggg … g+ matches g gg ggg gggg … .* matches a null string or any number of characters – The * has significance in a regular expression only if it is preceded by a character • If it is the first character in a regular expression, then it matches itself Lecture 6 COMPSCI 215 11 Example: the grep Command • The grep command allows you to search for one or more files for particular character patterns: grep pattern file(s) – Every line of each file that contains pattern is displayed at the terminal – If more than one file is specified to grep, each line is also immediately preceded by the name of the file (in order to identify the latter) – If the pattern does not exist in the specified file(s), the grep command simply displays nothing – It is generally a good idea to enclose the grep pattern inside a pair of single quotes to “protect” it from the shell $ grep * stars - the shell sees the asterisk * and automatically substitutes the names of all the files in your current directory $ grep '*' stars - the quotes remove its special meaning from the shell, so that the two arguments, * and stars, are passed to the grep Lecture 6 COMPSCI 215 12 Specifying Pattern Locations Anchoring a pattern is necessary when it can occur in more than one place in a line ^ (caret) matches an empty string at the beginning of a line . ^pat matches the pattern pat at the beginning of a line $ (dollar) matches an empty string at the end of a line . pat$ matches the pattern pat at the end of a line ggim001$ egrep -n Ju.y * ggim001$ egrep -n ^Ju.y * The_Relief_of_Tobruk.txt:23:5816 (on inter1.txt:193:July a July estimate) did not return, and inter1.txt:645:July The_Relief_of_Tobruk.txt:59:By 10 inter2.txt:193:July July General Freyberg (1) was able to point out to the New Zealand inter1.txt:193:July inter1.txt:645:July inter2.txt:193:July inter4.txt:480: 2 July Second use of ^ Lecture 6 COMPSCI 215 13 Regular Expression Character Subset * Matches zero or more occurrences of previous character . Matches a single character (like the ? Wildcard) [prq] Matches a single character p, r, or q [c1-c2] Matches a single character in the range between c1 and c2 [^pqr] Matches a single character which is not p, q, or r ^pat Matches pattern pat at beginning of line pat$ Matches pattern pat at end of line Some of these symbols are also meaningful to the shell, so that the REs should be quoted – The expressions are interpreted by the command, and quoting ensures that the shell is not able to interfere Lecture 6 COMPSCI 215 14 Large Regular Expressions • Larger REs can be built from smaller REs in two ways: – Concatenation: The resulting RE matches any string formed by concatenating two substrings that respectively match the concatenated sub-expressions • Example: “[a-z][0-9]*” matches any string that begins with a lowercase letter followed by zero or more digits – Alternation: two REs may be joined by the infix operator |; the resulting RE matches any string matching either sub- expression • Example: [a-z]|[0-9] Lecture 6 COMPSCI 215 15 A Few Examples chap1 chap2 $ ls cha* text1 text2 chap1 chap2 $ cat [a-z]hap[0-9] text1 text2 $ cat [a-z]hap[0-9] | tr '[a-z]' '[A-Z]' TEXT1 TEXT2 $ cat [a-z]hap[0-9] | tr '[a-z]' '[A-QU-Z]' WE]W1 WE]W2 $ egrep -n '[a-z]' chap* chap1:1:text1 chap2:1:text2 Lecture 6 COMPSCI 215 16 Saving Matched Characters • The characters matched within a regular expression are captured by enclosing the characters inside \(…\) • The captured characters are stored in “registers” 1, …, 9 and retrieved as \1, …, \9, respectively – Successive occurrences of the \(…\) construct get assigned to successive registers: ^\(…\)\(…\) - the first three characters on the line are stored into register 1, and the next three characters into register 2 – The RE ^\(.\) matches the first character in the line and stores it in register 1 – The RE ^\(.\).*\1$ matches all lines in which the first (^.) and the last character (\1$) on the line are the same • Here, the RE .* matches all the characters in-between Lecture 6 COMPSCI 215 17 An Example • List all the ordinary files in your directory created in July in increasing order of file size (assuming file names do not contain “ Jul ”): ls -l | grep '^-.* Jul .*' | sort -n -k 5 • The first command ls -l lists directory contents – the -l option displays the long format: file mode, number of links, owner name, group name, number of bytes in the file, abbreviated month, day- of-month, hour:minute file was last modified, and file name ggim001$ ls -l total 160 drwxr-xr-x 13 ggim001 ggim001 442 Jul 9 11:35 Exercises -rw-r--r-- 1 ggim001 ggim001 526 Jul 9 12:46 Test.txt drwxr-xr-x 17 ggim001 ggim001 578 Jun 27 15:01 bak-dir -rwxr-xr-x 1 ggim001 ggim001 294 Jun 27 14:57 bak.bash -rwxr-xr-x 1 ggim001 ggim001 425 Jun 26 18:26 calc.bash … Lecture 6 COMPSCI 215 18 An Example • The second command grep '^-.* Jul .*' allows you to search its input (i.e.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    27 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us