Lecture 2: Regular Expressions, Language Models, and Perl

Lecture 2: Regular Expressions, Language Models, and Perl Lecturer: Mark Hasegawa-Johnson ([email protected]) TA: Sarah Borys ([email protected]) May 19, 2005 1 Bash, Sed, Awk 1.1 Installation If you are on a unix system, bash, sed, gawk, and perl are probably already installed. If not, ask your system administrator. If you are on Windows, download the Cygwin setup program from http://www.cygwin.com. Choose the \advanced installation" option, so that you can choose to install useful programs such as perl that would not be installed by default. cygwin installation can be run as many times as you like; anything already installed on your PC will not be re-installed. In the screen that asks you which pieces of the OS you want to install, be sure to select (DOC)!(man), (Interpreters)!(gawk,perl), and (TEXT)!(less). I also recommend (Math)!(bc), a simple text-based calculator, and (Network)!(inetutils,openssh). You can also install a complete X-windows server and set of clients from (XFree86)!(fvwm,lesstif,Xfree86-base,XFree86- startup,etc.), allowing you to (1) install X-based programs on your PC, and (2) run X-based programs on any unix computer on the network, with I/O coming from windows on your PC. Setting up X requires a little extra work; see http://xfree86.cygwin.com. If cygwin is installed in your computer in the directory c:/cygwin, it will create a stump of a unix hierarchy starting in that directory. For example, the directory c:/cygwin/usr/local/bin is available under cygwin as /usr/local/bin. Arbitrary directories elsewhere on your c: drive are available as /cygdrive/c/.. In order to use cygwin effectively, you need to set the environment variables HOME (to specify your home directory), DISPLAY (if you are using X-windows), and most importantly, PATH (to specify directories that should be searched for useful programs. This list should include at least /bin;/usr/bin;/usr/local/bin;/usr/X11R6/bin). In order to use bash, sed, awk, and perl, you will also need a good ASCII text editor. You can download Xemacs from http://www.xemacs.org. 1.2 Reading Manual pages are available, among other places, at http://www.gnu.org/manual. If your computer is set up correctly, you can also read the bash man page by typing 'man bash' at the cygwin/bash prompt. • Read the bash manual page, sections: (Basic Shell Features)!(Shell Syntax, Shell Commands, Shell Parameters, Shell Expansions). (Shell Builtins)!(Bourne Shell Builtins, Bash Conditional Expres- sions, Shell Arithmetic, Shell Scripts). Alternatively, you can try reading the tutorial chapter in the O'Reilly bash book. • Read the sed manual page, or the sed tutorial chapter in the O'Reilly 'sed and awk' book. • 'gawk' is another name for 'awk' on GNU-based systems | the 'gawk' program has a few more features than the original 'awk' program. You may eventually want to learn gawk or awk, but it's not required. The section of the gawk man page called 'Getting Started with awk' is pretty good. So are the tutorial chapters in the O'Reilly 'sed and awk' book. 1 1.3 A bash/sed example Why not use C all the time? The answer is that some tasks are easier to perform with other programming languages: • Manipulate file hierarchies: use bash and sed. • (Simple manipulation of tabular text files: gawk) • Manipulate text files: use perl. • Manipulate non-text files: use C. perl can do any of these things, but isn't very efficient for numerical calculations. C can also do any of the things listed, but perl has many builtin tools for string manipulation, so it's worthwhile to learn perl. gawk is easier than perl for simple manipulation of tabular text; it's up to you whether or not you want to try learning it. bash is a POSIX-compliant command interpreter, meaning that, like the DOSshell, you can type in a program name, and the program will run. Unlike the DOSshell, bash is also a pretty good programming language (not as good as BASIC or perl, but better than DOSshell or tcsh). For example, suppose you want to search through the entire /data/timit/train hierarchy1, apply the C program \extract" to all WAV files in order to create MFC files, and create a file with extension TRP containing only the third column of each PHN file, and then move all of the resulting files to a directory hierarchy under /data/newfiles/train (but the new directory hierarchy doesn't exist yet). You could do all that by entering the following, either at the bash command prompt or in a shell script: if [ ! -e ${HOME}/newfiles/train ]; then mkdir ${HOME}/newfiles; mkdir ${HOME}/newfiles/train; fi for dr in dr{1,2,3,4,5,6,7,8}; do if [ ! -e ${HOME}/newfiles/train/${dr} ]; then echo mkdir ${HOME}/newfiles/train/${dr}; mkdir ${HOME}/newfiles/train/${dr}; fi for spkr in `ls ${HOME}/timit/train/${dr}`; do if [ ! -e ${HOME}/newfiles/train/${dr}/${spkr} ]; then echo mkdir ${HOME}/newfiles/train/${dr}/${spkr}; mkdir ${HOME}/newfiles/train/${dr}/${spkr}; fi cd ${HOME}/timit/train/${dr}/${spkr}; for file in `ls`; do case ${file} in *.wav | *.WAV ) MFCfile=${HOME}/newfiles/train/${dr}/${spkr}/ècho ${file} | sed 's/wav/mfc/;s/WAV/MFC/'`; echo Copying file ${PWD}/${file} into file ${MFCfile}; extract ${file} ${MFCfile};; *.phn | *.PHN ) TRPfile=${HOME}/newfiles/train/${dr}/${spkr}/ècho ${file} | sed 's/phn/trp/;s/PHN/TRP/'`; echo Extracting third column of file ${PWD}/${file} into file ${TRPfile}; gawk '{print $3}' ${file} > ${TRPfile}; esac done done done 1There are several copies of TIMIT floating around the lab. You can also buy your own copy for $100 from http://www.ldc.upenn.edu, or download individual files from that web site for free. 2 Once you have created the entire new hierarchy, you can list the whole hierarchy using ls -R ${HOME}/newfiles | less You may have noticed by now that bash suffers from cryptic syntax. bash inherits syntax from 'sh', a command interpreter written at AT&T in the days when every ASCII character had to be chiseled on stone tablets in triplicate; thus bash uses characters economically. Three rules will help you to use bash effectively: 1. Keep in mind that ', `, and " mean very different things. f and $f mean very different things. [ standing alone is a synonym for the command 'test'. 2. When trying to figure out how bash parses a line, you need to follow the seven steps of command expansion in the same order that bash follows them: brace expansion, tilde expansion, variable expansion, command substitution, arithmetic expansion, word splitting, and filename expansion, in that order. No, really, I'm serious. Trying to read bash as pseudo-English leads only to frustration. 3. When writing your own bash scripts, trial and error is usually the fastest method. Use the 'echo' command frequently, with appropriate variable expansions at each level, so you can see what bash thinks it is doing. About awk or gawk: the only thing you absolutely need to know about gawk is that if you type the following command into the bash prompt, the file foo.txt will contain the M'th, N'th, and P'th columns from the file bar.txt (where M,N,and P should be any single digits): awk '{printf("%s\t%s\t%s\n",$M,$N,$P)}' bar.txt > foo.txt 3 2 Perl 2.1 Reading Read chapter 1 of Programming Perl by Larry Wall and Randall Schwartz [5]. This book is sometimes available on-line at perl.com; in the perl community, it is called \The Camel Book" in order to distinguish it from all of the other perl books available. Larry Wall (the author of perl, as well as the author of the book) is an occasional linguist with a good sense of humor. This chapter is possibly the best written introduction to perl data structures and control flow, and contains better documentation on blocks, loops, and control flow statements (if, unless, while, until) than the man pages. The manual pages are available in HTML format on-line at http://www.perl.com/doc/. Download the gzipped tar file, and unpack it on your own PC, so that you can keep it open while you program. You can read the manual pages using \man" under cygwin, but it is much easier to navigate this complicated document set using HTML. Before programming, you should read Chapter 1 of the Camel Book, plus the perldata man page and the first halves of the perlref and perlsub manual pages. While programming, you should have an HTML browser open so that you can easily look for useful information in the three manual pages just listed, and also in the perlsyn, perlop, perlfunc, perlref, and perlre pages. 2.2 Perl Data Types and Variable References This section will rapidly introduce perl data types, perl syntax, and the perl method for creating variable references. If you get confused, try the perlsyn and perlref man pages. Perl uses three data types: scalars, arrays, and hash tables. Variables are never declared in perl; perl creates them as soon as they are used. Therefore the syntax must specify the type of a variable. Scalars always start with $, e.g., $foo is a scalar variable. Arrays always start with @, e.g., @foo is an array. Hash tables always start with %, e.g., %foo is a hash table. A scalar variable always starts with $. Scalars may be numbers, strings, or references to other objects.

Load more