Lecture 2: Regular Expressions, Language Models, and Perl

Total Page:16

File Type:pdf, Size:1020Kb

Lecture 2: Regular Expressions, Language Models, and Perl Lecture 2: Regular Expressions, Language Models, and Perl Lecturer: Mark Hasegawa-Johnson ([email protected]) TA: Sarah Borys ([email protected]) May 19, 2005 1 Bash, Sed, Awk 1.1 Installation If you are on a unix system, bash, sed, gawk, and perl are probably already installed. If not, ask your system administrator. If you are on Windows, download the Cygwin setup program from http://www.cygwin.com. Choose the \advanced installation" option, so that you can choose to install useful programs such as perl that would not be installed by default. cygwin installation can be run as many times as you like; anything already installed on your PC will not be re-installed. In the screen that asks you which pieces of the OS you want to install, be sure to select (DOC)!(man), (Interpreters)!(gawk,perl), and (TEXT)!(less). I also recommend (Math)!(bc), a simple text-based calculator, and (Network)!(inetutils,openssh). You can also install a complete X-windows server and set of clients from (XFree86)!(fvwm,lesstif,Xfree86-base,XFree86- startup,etc.), allowing you to (1) install X-based programs on your PC, and (2) run X-based programs on any unix computer on the network, with I/O coming from windows on your PC. Setting up X requires a little extra work; see http://xfree86.cygwin.com. If cygwin is installed in your computer in the directory c:/cygwin, it will create a stump of a unix hierarchy starting in that directory. For example, the directory c:/cygwin/usr/local/bin is available under cygwin as /usr/local/bin. Arbitrary directories elsewhere on your c: drive are available as /cygdrive/c/.. In order to use cygwin effectively, you need to set the environment variables HOME (to specify your home directory), DISPLAY (if you are using X-windows), and most importantly, PATH (to specify directories that should be searched for useful programs. This list should include at least /bin;/usr/bin;/usr/local/bin;/usr/X11R6/bin). In order to use bash, sed, awk, and perl, you will also need a good ASCII text editor. You can download Xemacs from http://www.xemacs.org. 1.2 Reading Manual pages are available, among other places, at http://www.gnu.org/manual. If your computer is set up correctly, you can also read the bash man page by typing 'man bash' at the cygwin/bash prompt. • Read the bash manual page, sections: (Basic Shell Features)!(Shell Syntax, Shell Commands, Shell Parameters, Shell Expansions). (Shell Builtins)!(Bourne Shell Builtins, Bash Conditional Expres- sions, Shell Arithmetic, Shell Scripts). Alternatively, you can try reading the tutorial chapter in the O'Reilly bash book. • Read the sed manual page, or the sed tutorial chapter in the O'Reilly 'sed and awk' book. • 'gawk' is another name for 'awk' on GNU-based systems | the 'gawk' program has a few more features than the original 'awk' program. You may eventually want to learn gawk or awk, but it's not required. The section of the gawk man page called 'Getting Started with awk' is pretty good. So are the tutorial chapters in the O'Reilly 'sed and awk' book. 1 1.3 A bash/sed example Why not use C all the time? The answer is that some tasks are easier to perform with other programming languages: • Manipulate file hierarchies: use bash and sed. • (Simple manipulation of tabular text files: gawk) • Manipulate text files: use perl. • Manipulate non-text files: use C. perl can do any of these things, but isn't very efficient for numerical calculations. C can also do any of the things listed, but perl has many builtin tools for string manipulation, so it's worthwhile to learn perl. gawk is easier than perl for simple manipulation of tabular text; it's up to you whether or not you want to try learning it. bash is a POSIX-compliant command interpreter, meaning that, like the DOSshell, you can type in a program name, and the program will run. Unlike the DOSshell, bash is also a pretty good programming language (not as good as BASIC or perl, but better than DOSshell or tcsh). For example, suppose you want to search through the entire /data/timit/train hierarchy1, apply the C program \extract" to all WAV files in order to create MFC files, and create a file with extension TRP containing only the third column of each PHN file, and then move all of the resulting files to a directory hierarchy under /data/newfiles/train (but the new directory hierarchy doesn't exist yet). You could do all that by entering the following, either at the bash command prompt or in a shell script: if [ ! -e ${HOME}/newfiles/train ]; then mkdir ${HOME}/newfiles; mkdir ${HOME}/newfiles/train; fi for dr in dr{1,2,3,4,5,6,7,8}; do if [ ! -e ${HOME}/newfiles/train/${dr} ]; then echo mkdir ${HOME}/newfiles/train/${dr}; mkdir ${HOME}/newfiles/train/${dr}; fi for spkr in `ls ${HOME}/timit/train/${dr}`; do if [ ! -e ${HOME}/newfiles/train/${dr}/${spkr} ]; then echo mkdir ${HOME}/newfiles/train/${dr}/${spkr}; mkdir ${HOME}/newfiles/train/${dr}/${spkr}; fi cd ${HOME}/timit/train/${dr}/${spkr}; for file in `ls`; do case ${file} in *.wav | *.WAV ) MFCfile=${HOME}/newfiles/train/${dr}/${spkr}/`echo ${file} | sed 's/wav/mfc/;s/WAV/MFC/'`; echo Copying file ${PWD}/${file} into file ${MFCfile}; extract ${file} ${MFCfile};; *.phn | *.PHN ) TRPfile=${HOME}/newfiles/train/${dr}/${spkr}/`echo ${file} | sed 's/phn/trp/;s/PHN/TRP/'`; echo Extracting third column of file ${PWD}/${file} into file ${TRPfile}; gawk '{print $3}' ${file} > ${TRPfile}; esac done done done 1There are several copies of TIMIT floating around the lab. You can also buy your own copy for $100 from http://www.ldc.upenn.edu, or download individual files from that web site for free. 2 Once you have created the entire new hierarchy, you can list the whole hierarchy using ls -R ${HOME}/newfiles | less You may have noticed by now that bash suffers from cryptic syntax. bash inherits syntax from 'sh', a command interpreter written at AT&T in the days when every ASCII character had to be chiseled on stone tablets in triplicate; thus bash uses characters economically. Three rules will help you to use bash effectively: 1. Keep in mind that ', `, and " mean very different things. f and $f mean very different things. [ standing alone is a synonym for the command 'test'. 2. When trying to figure out how bash parses a line, you need to follow the seven steps of command expansion in the same order that bash follows them: brace expansion, tilde expansion, variable ex- pansion, command substitution, arithmetic expansion, word splitting, and filename expansion, in that order. No, really, I'm serious. Trying to read bash as pseudo-English leads only to frustration. 3. When writing your own bash scripts, trial and error is usually the fastest method. Use the 'echo' command frequently, with appropriate variable expansions at each level, so you can see what bash thinks it is doing. About awk or gawk: the only thing you absolutely need to know about gawk is that if you type the following command into the bash prompt, the file foo.txt will contain the M'th, N'th, and P'th columns from the file bar.txt (where M,N,and P should be any single digits): awk '{printf("%s\t%s\t%s\n",$M,$N,$P)}' bar.txt > foo.txt 3 2 Perl 2.1 Reading Read chapter 1 of Programming Perl by Larry Wall and Randall Schwartz [5]. This book is sometimes available on-line at perl.com; in the perl community, it is called \The Camel Book" in order to distinguish it from all of the other perl books available. Larry Wall (the author of perl, as well as the author of the book) is an occasional linguist with a good sense of humor. This chapter is possibly the best written introduction to perl data structures and control flow, and contains better documentation on blocks, loops, and control flow statements (if, unless, while, until) than the man pages. The manual pages are available in HTML format on-line at http://www.perl.com/doc/. Download the gzipped tar file, and unpack it on your own PC, so that you can keep it open while you program. You can read the manual pages using \man" under cygwin, but it is much easier to navigate this complicated document set using HTML. Before programming, you should read Chapter 1 of the Camel Book, plus the perldata man page and the first halves of the perlref and perlsub manual pages. While programming, you should have an HTML browser open so that you can easily look for useful information in the three manual pages just listed, and also in the perlsyn, perlop, perlfunc, perlref, and perlre pages. 2.2 Perl Data Types and Variable References This section will rapidly introduce perl data types, perl syntax, and the perl method for creating variable references. If you get confused, try the perlsyn and perlref man pages. Perl uses three data types: scalars, arrays, and hash tables. Variables are never declared in perl; perl creates them as soon as they are used. Therefore the syntax must specify the type of a variable. Scalars always start with $, e.g., $foo is a scalar variable. Arrays always start with @, e.g., @foo is an array. Hash tables always start with %, e.g., %foo is a hash table. A scalar variable always starts with $. Scalars may be numbers, strings, or references to other objects.
Recommended publications
  • Use Perl Regular Expressions in SAS® Shuguang Zhang, WRDS, Philadelphia, PA
    NESUG 2007 Programming Beyond the Basics Use Perl Regular Expressions in SAS® Shuguang Zhang, WRDS, Philadelphia, PA ABSTRACT Regular Expression (Regexp) enhance search and replace operations on text. In SAS®, the INDEX, SCAN and SUBSTR functions along with concatenation (||) can be used for simple search and replace operations on static text. These functions lack flexibility and make searching dynamic text difficult, and involve more function calls. Regexp combines most, if not all, of these steps into one expression. This makes code less error prone, easier to maintain, clearer, and can improve performance. This paper will discuss three ways to use Perl Regular Expression in SAS: 1. Use SAS PRX functions; 2. Use Perl Regular Expression with filename statement through a PIPE such as ‘Filename fileref PIPE 'Perl programm'; 3. Use an X command such as ‘X Perl_program’; Three typical uses of regular expressions will also be discussed and example(s) will be presented for each: 1. Test for a pattern of characters within a string; 2. Replace text; 3. Extract a substring. INTRODUCTION Perl is short for “Practical Extraction and Report Language". Larry Wall Created Perl in mid-1980s when he was trying to produce some reports from a Usenet-Nes-like hierarchy of files. Perl tries to fill the gap between low-level programming and high-level programming and it is easy, nearly unlimited, and fast. A regular expression, often called a pattern in Perl, is a template that either matches or does not match a given string. That is, there are an infinite number of possible text strings.
    [Show full text]
  • A First Course to Openfoam
    Basic Shell Scripting Slides from Wei Feinstein HPC User Services LSU HPC & LON [email protected] September 2018 Outline • Introduction to Linux Shell • Shell Scripting Basics • Variables/Special Characters • Arithmetic Operations • Arrays • Beyond Basic Shell Scripting – Flow Control – Functions • Advanced Text Processing Commands (grep, sed, awk) Basic Shell Scripting 2 Linux System Architecture Basic Shell Scripting 3 Linux Shell What is a Shell ▪ An application running on top of the kernel and provides a command line interface to the system ▪ Process user’s commands, gather input from user and execute programs ▪ Types of shell with varied features o sh o csh o ksh o bash o tcsh Basic Shell Scripting 4 Shell Comparison Software sh csh ksh bash tcsh Programming language y y y y y Shell variables y y y y y Command alias n y y y y Command history n y y y y Filename autocompletion n y* y* y y Command line editing n n y* y y Job control n y y y y *: not by default http://www.cis.rit.edu/class/simg211/unixintro/Shell.html Basic Shell Scripting 5 What can you do with a shell? ▪ Check the current shell ▪ echo $SHELL ▪ List available shells on the system ▪ cat /etc/shells ▪ Change to another shell ▪ csh ▪ Date ▪ date ▪ wget: get online files ▪ wget https://ftp.gnu.org/gnu/gcc/gcc-7.1.0/gcc-7.1.0.tar.gz ▪ Compile and run applications ▪ gcc hello.c –o hello ▪ ./hello ▪ What we need to learn today? o Automation of an entire script of commands! o Use the shell script to run jobs – Write job scripts Basic Shell Scripting 6 Shell Scripting ▪ Script: a program written for a software environment to automate execution of tasks ▪ A series of shell commands put together in a file ▪ When the script is executed, those commands will be executed one line at a time automatically ▪ Shell script is interpreted, not compiled.
    [Show full text]
  • Shell Scripting, Scripting Examples
    Last Time… on the website Lecture 6 Shell Scripting What is a shell? • The user interface to the operating system • Functionality: – Execute other programs – Manage files – Manage processes • Full programming language • A program like any other – This is why there are so many shells Shell History • There are many choices for shells • Shell features evolved as UNIX grew Most Commonly Used Shells – /bin/csh C shell – /bin/tcsh Enhanced C Shell – /bin/sh The Bourne Shell / POSIX shell – /bin/ksh Korn shell – /bin/bash Korn shell clone, from GNU Ways to use the shell • Interactively – When you log in, you interactively use the shell • Scripting – A set of shell commands that constitute an executable program Review: UNIX Programs • Means of input: – Program arguments [control information] – Environment variables [state information] – Standard input [data] • Means of output: – Return status code [control information] – Standard out [data] – Standard error [error messages] Shell Scripts • A shell script is a regular text file that contains shell or UNIX commands – Before running it, it must have execute permission: • chmod +x filename • A script can be invoked as: – ksh name [ arg … ] – ksh < name [ args … ] – name [ arg …] Shell Scripts • When a script is run, the kernel determines which shell it is written for by examining the first line of the script – If 1st line starts with #!pathname-of-shell, then it invokes pathname and sends the script as an argument to be interpreted – If #! is not specified, the current shell assumes it is a script in its own language • leads to problems Simple Example #!/bin/sh echo Hello World Scripting vs.
    [Show full text]
  • Unix (And Linux)
    AWK....................................................................................................................................4 BC .....................................................................................................................................11 CHGRP .............................................................................................................................16 CHMOD.............................................................................................................................19 CHOWN ............................................................................................................................26 CP .....................................................................................................................................29 CRON................................................................................................................................34 CSH...................................................................................................................................36 CUT...................................................................................................................................71 DATE ................................................................................................................................75 DF .....................................................................................................................................79 DIFF ..................................................................................................................................84
    [Show full text]
  • Chapter 10 SHELL Substitution and I/O Operations
    Chapter 10 SHELL Substitution and I/O Operations 10.1 Command Substitution Command substitution is the mechanism by which the shell performs a given set of commands and then substitutes their output in the place of the commands. Syntax: The command substitution is performed when a command is given as: `command` When performing command substitution make sure that you are using the backquote, not the single quote character. Example: Command substitution is generally used to assign the output of a command to a variable. Each of the following examples demonstrate command substitution: #!/bin/bash DATE=`date` echo "Date is $DATE" USERS=`who | wc -l` echo "Logged in user are $USERS" UP=`date ; uptime` echo "Uptime is $UP" This will produce following result: Date is Thu Jul 2 03:59:57 MST 2009 Logged in user are 1 Uptime is Thu Jul 2 03:59:57 MST 2009 03:59:57 up 20 days, 14:03, 1 user, load avg: 0.13, 0.07, 0.15 10.2 Shell Input/Output Redirections Most Unix system commands take input from your terminal and send the resulting output back to your terminal. A command normally reads its input from a place called standard input, which happens to be your terminal by default. Similarly, a command normally writes its output to standard output, which is also your terminal by default. Output Redirection: The output from a command normally intended for standard output can be easily diverted to a file instead. This capability is known as output redirection: If the notation > file is appended to any command that normally writes its output to standard output, the output of that command will be written to file instead of your terminal: Check following who command which would redirect complete output of the command in users file.
    [Show full text]
  • Bash Guide for Beginners
    Bash Guide for Beginners Machtelt Garrels Garrels BVBA <tille wants no spam _at_ garrels dot be> Version 1.11 Last updated 20081227 Edition Bash Guide for Beginners Table of Contents Introduction.........................................................................................................................................................1 1. Why this guide?...................................................................................................................................1 2. Who should read this book?.................................................................................................................1 3. New versions, translations and availability.........................................................................................2 4. Revision History..................................................................................................................................2 5. Contributions.......................................................................................................................................3 6. Feedback..............................................................................................................................................3 7. Copyright information.........................................................................................................................3 8. What do you need?...............................................................................................................................4 9. Conventions used in this
    [Show full text]
  • Introduction to Perl
    Introduction to Perl Science and Technology Support Group High Performance Computing Ohio Supercomputer Center 1224 Kinnear Road Columbus, OH 43212-1163 Introduction to Perl • Setting the Stage • Functions • Data Types • Exercises 3 • Operators • File and Directory Manipulation • Exercises 1 • External Processes • Control Structures • References • Basic I/O • Exercises 4 • Exercises 2 • Some Other Topics of Interest • Regular Expressions • For Further Information 2 Introduction to Perl Setting the Stage • What is Perl? • How to get Perl • Basic Concepts 3 Introduction to Perl What is Perl? • Practical Extraction and Report Language –Or: Pathologically Eclectic Rubbish Lister • Created by Larry Wall • A compiled/interpreted programming language • Combines popular features of the shell, sed, awk and C • Useful for manipulating files, text and processes – Also for sysadmins and CGI • Latest official version is 5.005 – Perl 4.036 still in widespread use –Use perl -v to see the version • Perl is portable •Perlis free! 4 Introduction to Perl How to Get Perl • Perl’s most natural “habitat” is UNIX • Has been ported to most other systems as well – MS Windows, Macintosh, VMS, OS/2, Amiga… • Available for free under the GNU Public License – Basically says you can only distribute binaries of Perl if you also make the source code available, including the source to any modifications you may have made • Can download source code (C) and compile yourself (unlikely to be necessary) • Pre-compiled binaries also available for most systems • Support available
    [Show full text]
  • A Crash Course on UNIX
    AA CCrraasshh CCoouurrssee oonn UUNNIIXX UNIX is an "operating system". Interface between user and data stored on computer. A Windows-style interface is not required. Many flavors of UNIX (and windows interfaces). Solaris, Mandrake, RedHat (fvwm, Gnome, KDE), ... Most UNIX users use "shells" (or "xterms"). UNIX windows systems do provide some Microsoft Windows functionality. TThhee SShheellll A shell is a command-line interface to UNIX. Also many flavors, e.g. sh, bash, csh, tcsh. The shell provides commands and functionality beyond the basic UNIX tools. E.g., wildcards, shell variables, loop control, etc. For this tutorial, examples use tcsh in RedHat Linux running Gnome. Differences are minor for the most part... BBaassiicc CCoommmmaannddss You need these to survive: ls, cd, cp, mkdir, mv. Typically these are UNIX (not shell) commands. They are actually programs that someone has written. Most commands such as these accept (or require) "arguments". E.g. ls -a [show all files, incl. "dot files"] mkdir ASTR688 [create a directory] cp myfile backup [copy a file] See the handout for a list of more commands. AA WWoorrdd AAbboouutt DDiirreeccttoorriieess Use cd to change directories. By default you start in your home directory. E.g. /home/dcr Handy abbreviations: Home directory: ~ Someone else's home directory: ~user Current directory: . Parent directory: .. SShhoorrttccuuttss To return to your home directory: cd To return to the previous directory: cd - In tcsh, with filename completion (on by default): Press TAB to complete filenames as you type. Press Ctrl-D to print a list of filenames matching what you have typed so far. Completion works with commands and variables too! Use ↑, ↓, Ctrl-A, & Ctrl-E to edit previous lines.
    [Show full text]
  • Minimal Perl for UNIX and Linux People
    Minimal Perl For UNIX and Linux People BY TIM MAHER MANNING Greenwich (74° w. long.) For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact: Special Sales Department Manning Publications Co. Cherokee Station PO Box 20386 Fax: (609) 877-8256 New York, NY 10021 email: [email protected] ©2007 by Manning Publications Co. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps. Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Manning Publications Co. Copyeditor: Tiffany Taylor 209 Bruce Park Avenue Typesetters: Denis Dalinnik, Dottie Marsico Greenwich, CT 06830 Cover designer: Leslie Haimes ISBN 1-932394-50-8 Printed in the United States of America 12345678910–VHG–1009080706 To Yeshe Dolma Sherpa, whose fortitude, endurance, and many sacrifices made this book possible. To my parents, Gloria Grady Washington and William N. Maher, who indulged my early interests in literature. To my limbic system, with gratitude for all the good times we’ve had together.
    [Show full text]
  • Shells and Shell Programming
    Shells & Shell Programming (Part A) Software Tools EECS2031 Winter 2018 Manos Papagelis Thanks to Karen Reid and Alan J Rosenthal for material in these slides SHELLS 2 What is a Shell • A shell is a command line interpreter that is the interface between the user and the OS. • The shell: – analyzes each command – determines what actions are to be performed – performs the actions • Example: wc –l file1 > file2 3 Which shell? • sh – Bourne shell – Most common, other shells are a superset – Good for programming • csh or tcsh – command-line default on EECS labs – C-like syntax – Best for interactive use. • bash – default on Linux (Bourne again shell) – Based on sh, with some csh features. • korn – written by David Korn – Based on sh – Some claim best for programming. – Commercial product. 4 bash versus sh • On EECS labs, when you run sh, you are actually running bash. • bash is a superset of sh. • For EECS2031, you will be learning only the features of the language that belong to sh. 5 Changing your shell • I recommend changing your working shell on EECS to bash – It will make it easier to test your shell programs. – You will only need to learn one set of syntax. • What to do: – echo $SHELL (to check your current shell) – chsh <userid> bash – Logout and log back in. – .profile is executed every time you log in, so put your environment variables there 6 Standard Streams • Preconnected input and output channels between a computer program and its environment. There are 3 I/O connections: – standard input (stdin) – standard output (stdout) – standard
    [Show full text]
  • Linux Programming [R15a0527] Lecture Notes
    LINUX PROGRAMMING [R15A0527] LECTURE NOTES B.TECH IV YEAR – I SEM (R15) (2019-2020) DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING MALLA REDDY COLLEGE OF ENGINEERING & TECHNOLOGY (Autonomous Institution – UGC, Govt. of India) Recognized under 2(f) and 12 (B) of UGC ACT 1956 (Affiliated to JNTUH, Hyderabad, Approved by AICTE - Accredited by NBA & NAAC – ‘A’ Grade - ISO 9001:2015 Certified) Maisammaguda, Dhulapally (Post Via. Hakimpet), Secunderabad – 500100, Telangana State, India Syllabus (R15A0527) LINUX PROGRAMMING Objectives: • To develop the skills necessary for Unix systems programming including file system programming, process and signal management, and interprocess communication. • To make effective use of Unix utilities and Shell scripting language such as bash. • To develop the basic skills required to write network programs using Sockets. UNIT I Linux Utilities - File handling utilities, Security by file permissions, Process utilities, Disk utilities, Networking commands, Filters, Text processing utilities and Backup utilities. Sed- Scripts, Operation, Addresses, Commands, Applications, awk- Execution, Fields and Records, Scripts, Operation, Patterns, Actions, Associative Arrays, String and Mathematical functions, System commands in awk, Applications. Shell programming with Bourne again shell(bash) - Introduction, shell responsibilities, pipes and Redirection, here documents, running a shell script, the shell as a programming language, shell meta characters, file name substitution, shell variables, command substitution, shell
    [Show full text]
  • Unix Login Profile
    Unix login Profile A general discussion of shell processes, shell scripts, shell functions and aliases is a natural lead in for examining the characteristics of the login profile. The term “shell” is used to describe the command interpreter that a user runs to interact with the Unix operating system. When you login, a shell process is initiated for you, called your login shell. There are a number of "standard" command interpreters available on most Unix systems. On the UNF system, the default command interpreter is the Korn shell which is determined by the user’s entry in the /etc/passwd file. From within the login environment, the user can run Unix commands, which are just predefined processes, most of which are within the system directory named /usr/bin. A shell script is just a file of commands, normally executed at startup for a shell process that was spawned to run the script. The contents of this file can just be ordinary commands as would be entered at the command prompt, but all standard command interpreters also support a scripting language to provide control flow and other capabilities analogous to those of high level languages. A shell function is like a shell script in its use of commands and the scripting language, but it is maintained in the active shell, rather than in a file. The typically used definition syntax is: <function-name> () { <commands> } It is important to remember that a shell function only applies within the shell in which it is defined (not its children). Functions are usually defined within a shell script, but may also be entered directly at the command prompt.
    [Show full text]