Lecture 7 Regular Expressions and Grep 7.1
Total Page:16
File Type:pdf, Size:1020Kb
LECTURE 7 REGULAR EXPRESSIONS AND GREP 7.1 Regular Expressions 7.1.1 Metacharacters, Wild cards and Regular Expressions Characters that have special meaning for utilities like grep, sed and awk are called metacharacters. These special characters are usually alpha-numeric in nature; a "e# e$- ample metacharacters are: caret &�'( dollar &+'( question mar- &.'( asterisk &/'( backslash &1'("orward slash &2'( ampersand &3'( set braces &4 an) 5'( range brac-ets &[...]). These special characters are interpreted contextually 0y the utilities. If you need to use the spe- cial characters as is without any interpretation, they need to be prefixed with the escape character (\). 8or example( a literal + can be inserted as 1+! a literal \can be inserted as \\. Although not as commonly used, numbers also can be turned into metacharacters 0y using escape character. 8or example( a number 9 is a literal number tw*( while 19 has a special meaning. Often times( the same metacharacters ha:e special meaning "or the shell as #ell. Thus #e need to be careful when passing metacharacters as arguments to utilities. It is a standard practice to embed the metacharacter arguments in single quotes to stop the shell from interpreting the metacharacters. $grep a*b file . asteris- gets interpreted 0y shell $grep ’a*b’ file . asterisk gets interpreted 0y 6rep utility Although single quotes are needed only "or arguments with metacharacters( as a 6**) practice( the pattern arguments *" the 6rep( sed and a#- utilities are always embedded in single quotes. $grep ’ab’ file $sed ’/a*b/’ file 7.1 Regular Expressions 47 Wild card Meaning given 0y shell / Zero or more characters . Any single character [...] Range *" characters; [ab] selects a or 0 home directory *" the user Table 7.1: Wild cards "or Bourne again shell $awk ’{print $1}’ file As stated before( the shell uses metacharacters "or its *wn #ork. Typically shell uses metacharacters t* provide services like I/O redirection, command substitution, file name substitution etc. O" all the services provided 0y the shell, file name substitution is :ery popular and handy. The metacharacters used for file name substitution are called wild cards. Three wild cards "or Bourne again shell are given in table 7.1. The patterns containing the characters or metacharacters are called regular expres- sions. The users can 0uild up regular e$pressions using characters or metacharacters. A simple regular e$pression is a*; this regular expression combines the character @a’ and the metacharacter @/@ to "or a string @a*’. One regular expression used in the 6rep example is @a*b’. Another simple "or *" regular expression is @abc’. As a 6**) practice( all the reg- ular expression arguments to the utilities are place) in single quotes. The single quotes stop the shell from possibly interpreting the metacharacters in the regular expressions. ;e ust understand that regular expressions describe sequence of characters. Even though sometimes the sequence *" characters "or meaningful #ords( the proper context *" understandin6 the #ord is that it is a sequence *" characters. 7.1.2 Regular Expression Components A regular expression is made *" atoms and operators. The atoms specify #hat #e are looking "or and #here in the text the match is to be ade. The operators( which are not required in all regular expressions( combine atoms into comple$ expressions. Five types *" atoms are encountered in regular expressions. They are: Single Character This is the simplest "or *" regular expression. When a single charac- ter appears in a regular expression, it matches itsel". Example: s - matches @s’ Dot(.) A dot matches any single character except the newline character - \n. The dot can be combined with other types *" atoms to make useful regular expressions. 7.1 Regular Expressions 48 Examples: . matches any single character except \n a. matches @a’ "ollo#ed 0y any character si. matches @sid’( @sip@( sit’( @sir’( @sic’ etc. .t matches @it’( @at’ etc. Class The class matches a predefined set *" characters or a range *" characters. The class *" characters are placed in[] braces. The dash (-) character can be used to specify continuous sequence (range) *" characters. The caret(�' character #orks as not operator and can be used to indicate exclusion *" characters. Examples: [abc] matches a or 0 or c [afz] matches a or " or A [a-z] matches all the lo#ercase alphabets. [a-zA-Z] A #ay to specify ultiple ranges in the class. The expression matches all the lo#er and the uppercase alphabets. [a-z0-9] matches all the lo#ercase alphabets and the numbers. [�0-9] all the characters except the numbers. Because *" the #ay the class is represented, certain characters (namely -,[,] and �' need special treatment to represent themselves in the character class. - The dash ust appear as first character *" the class *r it ust be escaped 0y 1 ] The C ust appear as first character *" the class or it ust be escape 0y 1 [ The B ust appear as first character *" the class or it ust be escapes 0y 1 � It is best to always escape the � with 1 Examples: [-0-9] matches the dash and all the numbers []a-z] matches the C and all the lo#ercase alphabets [[a-z] matches the B and all the lo#ercase alphabets [b!�] matches 0 and � Anchor Anchors are position specifiers. Anchors specify the position *" the next char- acter in the matching text. 8our kinds *" anchors are :ery handy; They are: � (beginning *" a line), + (end *" a line), 1D (beginnin6 *" a #ord) and 1E (end *" a #ord). 8rom the semantic point *" vie#( there is a special restriction on location *" the � and + characters in the regular expression. A � ust appear at the beginning *" a regular expression to acquire special meanin6! if � appears anywhere else( it is treated as a normal character. Similarly( a + ust always appear at the end *" a regular expression; Otherwise( it is just a normal character. Back Reference Regular expressions ha:e the capability to mar- and sa:e matched subexpressions into sa:ed 0uffers. Bac- references provide a #ay to refer to these 7.1 Regular Expressions 49 Figure 7.1: Anchors "or regular expressions sa:ed 0uffers. A total *" nine sa:ed 0uffers are a:ailable "or usage. These nine 0uffers can be re"erenced 0y using escaped digits. The sa:ed 0u""ers are repre- sented as 1>( \2,...( \9. The second component *" regular expressions is operator. Regular expression operators are similar to mathematical operators; they combine atoms to "or interesting regular expressions. There are 7:e types *" operators. Figure 7.2: Operators "or regular expressions Sequence Sequence operator is a null operator. Sequence implies non-existent join op- erator between a series *" atoms. Examples: )*6 matches the pattern @)*6@ a..b matches the pattern a "ollo#ed any tw* characters ending with 0 [0-"][0-9] matches all the numbers between HH and IG �+ matches a 0lank line 1D0**-1E matches the #ord @0**-@ �[Ll]ong matches lines with @Long’ or @long’ at the beginnin6 *" the line Alternation The alternation operator &J' defines one or more alternatives. Example: UNIX|unix - matches either UNIX or unix 7.1 Regular Expressions ./ Repetition The repetition operator specifies the number *" repetitions *" an atom or a regular expression. It takes the "or \{m,n\}. The :arious possibilities *" the repeti- tion operator are% \{m,n\} The number *" repetitions allo#ed is between m and n. \{,n\} The number *" repetitions allo#ed is between / and n. 14 (15 The number *" repetitions allo#ed are m or more. {m,n} The number *" repetitions allo#ed is between m and n. 4 (5 The number *" repetitions allo#ed are m or more. {,n} The number *" repetitions allo#ed is between / and n. {n} The number *" repetitions allo#ed is exactl0 ’n’. The most popular *" the repetition possibilities ha:e been given a short notation.There are three widely used short "orms. Long "or Short "or Meaning 14H(15 / Aero or more times 14>(15 K one or more times 14H(>15 . Aero or one times The ability to match LAero" or "more" *" something is -nown as closure. Thus /(K(. are called closure operators. The closure operators give elasticity to a regular ex- pression. Group The 6roup operator pairs up the parts *" the regular e$pression. The braces & ' indicate the 6roup operator. Examples: Reg Ex Matching String A(BC){3} matches ABCBCBC (F(BC)\{2\}G) 14915 FBCBCGFBCBCG Save The sa:e operator is indicated 0y escaped parentheses( \(...\). The sa:e operator copies a matched text string to one *" nine 0uffers "*r later reference. With in an ex- pression, the first sa:ed text is copied to 0uffer >( the second sa:ed text is copied to 0uffer 9( and so *n. These sa:ed 0uffers can be accessed through bac- references in the later part *" the regular expression. Examples: Reg Ex Matching String �\(.\).*\1$ matches all the lines on which the first character is the same as the last character. 7.1.3 Regular Expression List The UNIX po#er tools like 6rep( sed, a#- and vi depend on regular expressions a lot. All the atoms and operators *" regular expressions can be separated into tw* categories: Basic Regular Expressions (BRE) and Extended Regular Expressions (ERE). The 6rep( sed tools and vi editor support basic regular expressions (BRE); the egrep and a#- tools support both basic regular expressions(BRE) and extended regular expressions (ERE). 7.1 Regular Expressions 51 The BRE are shown in table 7.2 and the ERE are sh*wn in table 7.3. Modern :ersions *" the 6rep utility are capable *" supporting the ERE through @-e’ option. The ERE e$- tends the BRE to a:oid the clumsy \escape character "or 6roup and introduces the K(.( alternation operators. Metacharacter Name Matches Items to match single character .