<<

REGULAR EXPRESSIONS

1. WHAT ARE REGULAR EXPRESSIONS Regular expressions (or ‘regex patterns’ or ‘ patterns’) are expressions that stand for symbols or strings of symbols, or more often a class of symbols or strings of symbols. You may be familiar with this from the SEARCH AND REPLACE function of word processors like MS Word, OpenOffice Writer, etc., where this mechanism is known as , or wildcard operators. For example, a period (.) may stand for ‘any ’, such that m.le would find male, mile, mole, mule, etc. Obviously, such a mechanism is very useful in corpus linguistics, e.g. in order to search for different tense forms of the same verb. This handout explains the basic use of regular expressions using two regex packages that can be downloaded at no charge from the Internet: TextPad (for Windows) and BBEdit Lite (for Mac OS and OSX). Both are powerful text editors that can actually be used as rudimentary concordancers. that this handout will only deal with searching, not with replacing. Read the documentation for the software packages to find out more about their replace functions; however, you should never use the replace function with your original corpus files.

2. TWO REGEX TOOLS

2.1 TEXT PAD (WINDOWS) This shareware software package can be downloaded at www.textpad.com. Before you work with it, go to the CONFIGURE menu, choose the PREFERENCES command and then the EDITOR subcommand and activate the USE POSIX control box. TextPad has two types of search commands, both in the SEARCH menu: FIND and FIND IN FILES. The first is shown in Figure 1a.

Figure 1a: TextPad FIND dialogue box

When working with regular expressions, make sure that the dialogue box is activated. The regex pattern is typed into the FIND WHAT box. If you click on FIND, the next occurrence of your pattern in the currently open document will be found. If you click on MARK ALL, TextPad will mark all lines containing an occurrence of your search pattern. You can then use the BOOKMARKED LINES

Corpus Linguistics 1/4 © 2003 Anatol Stefanowitsch [email protected] Regular Expressions 2/4 subccommand from the COPY OTHER command in the EDIT menu to copy all occurrences and paste them into a new document. You can also search all open documents by activating the IN ALL DOCUMENTS control box. Note that you can perform case-sensitive and case-insensitive searches by activating or deactivating the appropriate control box. TextPad can also search multiple files in a single pass if they are not currently open. To do this, you use the FIND IN FILES command, whose dialogue box is shown in Figure 1b.

Figure 1b: TextPad FIND IN FILES dialogue box

Here you have the same basic options as before, but in addition you can specify a file type (e.g. .txt) in the IN FILES box and a folder in the IN FOLDER box. TextPad will then search all files of the specified type in the specified folder, and create a new document listing all lines containing an occurrence of your search pattern (to do this, make sure the radio button ALL MATCHING LINES is activated).

2.2 BBEDIT LITE (MAC) This freeware software package can be downloaded at www.barebones.com. BBEdit’s FIND & REPLACE dialogue box is shown in Figure 2. When working with regular expressions, make sure that the USE GREP control box is activated.

Figure 2: BBEdit FIND & REPLACE dialogue box Regular Expressions 3/4

The regex pattern is typed into the SEARCH FOR box. If you click on FIND, the next occurrence of your pattern in the currently open document will be found. If you click on FIND ALL, a new document is created, which lists all lines containing an occurrence of your search pattern. Note that you can perform case-sensitive and case-insensitive searches by activating or deactivating the appropriate control box. Like TextPad, BBEdit can search multiple files in a single pass. To do this, you simply activate the MULTI-FILE SEARCH control box and then choose the folder containing the files you want to search using the right OTHER switch The name of the folder which you have selected will appear in the lowest of the three text boxes. Again, by using the FIND ALL command, you can generate a document listing all lines that contain your search pattern (along with the name and of the file in which it was found).

3. TWO DIALECTS OF REGEX Table 1 lists the important regex characters in TextPad and BBEdit: Table 1: Regex characters in TextPad and BBEdit TEXTPAD BBEDIT LITE EXPLANATION .. Any character (including whitespace characters) except a line break [xyz] [xyz] Any of the characters x, y, z Example: b[aeiou]t finds bat, bet, bit, bot, and but [a-z] [a-z] Any characters from a to z in the ASCII table [^xyz] [^xyz] Any character except x,y,z Example: b[^u]t finds e.g. bat, bit and bet but not but ^^ Beginning of a line (unless used in square , cf. preceding entry) $$ End of a line (unless used in square brackets) \< Left word boundary (beginning of a word) Example: \ Right word boundary (end of a word) Example: ing\> finds ing at the end of a word, as in running, thinking, and ring \t \t Tab \f \f (Form Feed). \n \n () Line break () \r (Mac) ** Zero or more occurrences of the preceding character Example: but?s finds bus, buts, and butts; f[aeiou]*l finds e.g. fail, foil, feel, fool, foul, foal, etc. ?? Zero or one occurrence of the preceding character Example: but?s finds bus, and buts; honou?r finds honor and honour ++ One or more occurrences of the preceding character Example: but+s finds buts and butts, but not bus {x} Exactly x occurrences of the preceding character {x,} At least x occurrences of the preceding character {x,y} At least x, but no more than y occurrences of the preceding character (x|y) (x|y) Either x or y Example f(a|i)t finds fat or fit; (a|the) finds a and the; (a|the|this) finds a, the, and this. \\ Cancels the status of a character as a wildcard; e.g. ? finds one or more occurrences of the preceding character, but \? finds question marks Regular Expressions 4/4

In addition, there are some predefined expressions for whole classes of characters, as shown in Table 2:

Table 2: Regex character classes in TextPad and BBEdit [:alpha:] Any alphabetical character [:lower:] Any lowecase alphabetical character [:upper:] Any uppercase alphabetical character [:alnum:] \w Any alphanumeric character [:word:] Any alphanumeric character, , and \W Any character (including whitespace) except alphanumeric characters [:digit:] \d or # Any numerical character \D Any character except alphanumeric characters [:blank:] or tab [:space:] \s Any [:graph:] \S Any character except whitespace characters [:punct:] Any character except alphanumeric and whitespace characters

4. EXERCISES 1. For each of the following adjectives, design a regex pattern that will retrieve all of its forms.

TALL (tall, taller, tallest) FIT (fit, fitter, fittest) NICE (nice, nicer, nicest) SCARY (scary, scarier, scariest)

2. For each of the following nouns, design a regex pattern that will retrieve all of its forms:

BOOK (book, books) CHILD (child, children) BUS (bus, buses) LEAF (leaf, leaves) WOMAN (woman, women) MOUSE (mouse, mice)

3. For each of the following verbs, design at least one regex pattern that will retrieve all of its forms:

WALK (walk, walks, walking, walked) HIT (hit, hits, hitting) FLIP (flip, flips, flipping, flipped) SIT (sit, sits, sitting, sat) STEAL (steal, steals, stealing, stole, stolen) FIND (find, finds, finding, found) SING (sing, sings, singing, sang, sung) TAKE (take, takes, taking, took, taken) FLY (fly, flies, flying, flew, flown) WREAK (wreaks, wreaked, wrought, wreaking)

Ger. SPRINGEN (spring, springe, springst, springt, springen, sprang, sprangst, sprangt, sprangen, gesprungen)

4. Use TextPad or BBEdit to search a 1-million word corpus (like BROWN, FROWN, LOB, FROB, etc.) for some of the patterns you have designed.