<<

Appendix I: Brief reference From now on, we assume that you have access to a Unix terminal window. We will be focusing on operations that are possible the line, as this is where you are most effective in the long run.

Operating systems system Whenever you are using a computer you interact with it with the help of an (OS), a vital interface between hardware and user. The operating system does a Examples of directories and their contents number of different things. For instance, multiple programs are often run at the same and in this situation the operating system allocates resources to the different programs or / root of file system may be able to appropriately interrupt programs. Another common feature of an /bin executable binary files operating system is a graphical user interface, originally developed for personal /dev special files used to represent real physical devices computers. Examples of popular operating systems are Microsoft Windows, Mac OSX /etc commands and files used for system administration and . /home contains a home directory for each user of the system

/home/joe home directory of user joe Linux is an example of a Unix (or "Unix-like") operating system. Unix was originally /lib libraries used by various programs and programming languages developed in 1969 at Bell Laboratories in the US. Many different flavours of the Unix a "scratch" area where any user can store files temporarily operating systems have been developed, such as Solaris, HP-UX and AIX and there is a /tmp system files and directories that you share with other users number of freely available Unix or Unix-like systems such as GNU/Linux in different /usr distributions such as Red Hat Enterprise Linux, Fedora, SUSE Linux Enterprise, openSUSE and Ubuntu. Moving around in the file system

If you are at a personal computer your access to a Unix depends on what operating system you are using. For instance, Microsoft Windows is not based on Unix and does out what directory you are currently in: not provide a Unix interface. If you would like to have a Unix environment within Windows a possible choice is to Cygwin (http://www.cygwin.com). Cygwin has a % "present working directory" lot of Unix functionality and useful Unix programs and it also includes Perl (see also the Appendix III). The Mac OS X operating system is based on Unix and all you need to do Move to a specific directory with , "change directory": to communicate with Unix commands is to open a terminal window (available under Applications-Utilities). Same thing if you are at a computer with Linux or any other Unix % cd /tmp change directory to /tmp flavour; just open a terminal window to get started. % cd .. go up one level in directory tree % cd (without argument) go to your home directory Accessing a UNIX computer Find out what files are in the current directory: Even though you do not have Unix at your personal computer, you can connect to a Unix-based server and operate from there. In fact, this is a common mode of working in % show files in current directory bioinformatics. You typically use of an SSH (Secure Shell) client program to % ls -al show files in current directory connect. SSH is a network protocol to allow data to be transferred between two networked computers. The SSH client program communicates with a SSH daemon The ls -al command will result in a detailed output, as in this example: running on the server side. A typical application of the SSH client program is to login to a remote computer and execute commands at the remote computer. Examples of freely -rw-r--r-- 1 joe users 383269 2007-11-25 16:54 PF02854.txt drwxr-xr-x 9 joe users 656 2003-04-04 20:12 scripts/ available implementations of SSH are openSSH (http://www.openssh.com/), copSSH -rw-r--r-- 1 joe users 4898 2006-09-12 09:12 README.txt (http://www.itefix.no/i2/copssh) and PuTTY -rwxr-xr-x 1 joe users 120635 2004-08-03 01:47 dnapars* (http://www.chiark.greenend.org.uk/~sgtatham/putty/).

In this listing there are in each line a set of characters describing the file status. In a string Removing files like -rw-r--r--, the first '- ' means that it is a regular file ( a 'd' says that it is a directory). The next symbols are three groups of three where the first is what the owner % seq.fa can do, second what the group members can do and third what the other users can do. In each group of three, the symbols are 'r' = readable, 'w' = writable, 'x' = executable. To Creating and removing directories change the file attributes, see documentation on the Unix command . % As with many other Unix commands, the wildcard * may be used with ls. The following % dirname command will list files with names beginning with 'HIV': Viewing and editing files % ls HIV* % seq.fa (will show the contents of seq.fa on the screen)

Manipulating Files and Directories The cat command may also be used to merge (concatenate) files. This command will merge three different files, file1, file2 and file3 into a new file newfile: Copying files % cat file1 file2 file3 > newfile [source] [destination] The symbol > means that we are redirecting the output of the cat program to a file Examples: instead of the standard output (which is the screen). We can also append to an existing file, using the symbols >>: 1) Copy the file /tmp/seq.fa to the current working directory. The current directory is represented by a dot '.': % cat file4 >> newfile

% cp /tmp/seq.fa . Viewing a text file on the screen one page at a time.

2) Copy the file /tmp/seq.fa to another file in /tmp named seq2.fa: To view a text file on the screen, use either more or less. With less: % cp /tmp/seq.fa /tmp/seq2.fa % less seq.fa 3) Copy all files (*) in the directory /home/joe/seqfiles to the directory /tmp: Some useful keys for less are: % cp /home/joe/seqfiles/* /tmp space : move down one page Moving and renaming files enter : go down one line u : go up (back) [source] [destination] /HIV : search for 'HIV' q : quit program Examples: Viewing or extracting the first or last lines of a file 1) Rename the file in to : seq.fa /tmp seq2.fa

% mv /tmp/seq.fa /tmp/seq2.fa First and last lines of a file may be displayed using the commands and , respectively. 2) Move all files (*) in the directory /home/joe/seqfiles to the directory /tmp: % head seq.fa (by default head will show the first 10 lines of the file) % mv /home/joe/seqfiles/* /tmp % head -1000 seq.fa (first 1000 lines of file will be extracted)

% -f1,2 -d ';' dat2.txt Text-mode editor The output will be: % vi seq.fa A;2 B;5 The editor vi is very useful whenever you do not have access to a graphical editor. F;4 Description of it, however, requires a book on its own. For more information on vi the reader is referred to other sources such as: Sorting http://en.wikibooks.org/wiki/Learning_the_vi_Editor http://unixhelp.ed.ac.uk/vi/ Consider the file dat3.txt that contains:

Graphical editors A 12 1300 1306 11 1500 1458 B 17 1620 1700 Examples of graphical editors are (http://www.gnu.org/software/emacs/), gedit

(http://www.gedit.org) and nedit (http://www.nedit.org). The lines may be sorted using :

% sort dat3.txt Extracting file components with cut The output will be: Consider the content of a file dat.txt where the columns are separated with tabs: A 12 1300 1306 1 12 1300 1306 B 17 1620 1700 2 11 1500 1458 C 11 1500 1458 3 17 1620 1700 The sort utility sorts lines alphabetically by default. Sorting is done numerically if we use the option -n. In addition, we may specify sorting with respect to a specific column, We may extract the columns 1 and 3 with cut: using the parameter -k:

% cut -f1,3 dat.txt % sort -n -k2 dat3.txt which produces: The output is then:

1 1300 C 11 1500 1458 2 1500 A 12 1300 1306 3 1620 B 17 1620 1700

The fields or columns to be extracted are specified with the -f option. The default Note that now the values in column 2 are in numerical order. We may also reverse the separator is tab, but we may use any separator. The separator is specified with the -d order of sorting with the -r parameter: option to cut. Consider the file dat2.txt which contains: % sort -n -k2 -r dat3.txt A;2;4500 B;5;4505 F;4;4510 Unique lines

We try the cut command: The command is used to identify the unique lines in a file: A highly useful Unix utility is the command, used to search files for text or % uniq sortedfile regular expression matching:

For this to work well the lines in the file need to be sorted first with sort. A useful option % grep ">" seq.fa | to uniq is -c . The effect is to list the number of times each line occurs: In this example grep will identify all lines in seq.fa that contain '>'. The output of grep % uniq -c sortedfile will then be directed to wc. The final output is therefore the number of lines with '>'. This is a simple way of counting the number of sequences in a FASTA format file. Some useful parameters of the grep command are illustrated here: Comparing files % grep -v -i -l "AACGTA” seqfile Report differences between the two (sorted) files: -v report lines where AACGTA does not match % sortedfile1 sortedfile2 -i ignore case , i.e., we consider also "aacgta” -l show only the file name, not the matching text Show lines that are shared between two (sorted) files:

% -12 sortedfile1 sortedfile2 Finding files

Counting words The find command is very useful to locate files. To simplify somewhat, the find command has the following syntax: Count lines , words and number of bytes: find [path or list of files ] [expression] % wc filename To locate files with extension '.fa': Count lines only: % find -name "*.fa" % wc -l filename By default find is recursive to that it will search all subdirectories as well. Use '- maxdepth 1' to restrict the search only to the current directory. Redirection and pipes You may also want to locate files with a certain content. This command will show all lines in all files with extension that contain the string 'HIV': For the > and >> redirection symbols see above under "Viewing and editing files". .fa

The output of a file may be directed as input to another program with a pipe '|' symbol, % find -name "*.fa" -exec grep HIV {} \; like:

The parameter means that any program following will be executed on the % sort somefile | uniq | wc -l -exec exec files found by find. There are some peculiar symbols towards the end of the command In the above command line the output of sort will be sent to uniq and the output of uniq line. The curly brackets {} is a placeholder to indicate where the names of files found by find should be placed. The backslashed semicolon indicates the end of the command will in turn be sent to wc. The final result is the number of unique lines in the file specified by . somefile. -exec

As another example, you may instead want to list the files that contain the string 'HIV': Finding text with grep % find -name "*.fa" -exec grep -l HIV {} \;

Useful features of Unix shells when typing commands Utilities when retrieving data over the network

When typing a command the may be used for command line completion. Thus, The program wget is a useful Unix utility to retrieve a specific URL without making use the first few characters of a program or filename, and press the Tab key to fill in the of a web browser. Here is how to retrieve from the NCBI FTP site Genbank records of rest of the item. With this function you save a lot of typing. Bacillus anthracis CDC 684 (the following is to be typed on one line only):

Arrows on the keyboards are used to recall and edit previous commands. Thus, no need to wget ftp://ftp.ncbi.nih.gov/genomes/Bacteria/ type a long and complex command that you already typed before. Bacillus_anthracis_CDC_684_uid59303/NC_012581.gbk

In some cases data has been compressed and archived and may have the extension When editing a command Ctrl-E is used to move to the end of the line and Ctrl-A is used "tar.gz". Consider with this file containing a distribution of the linux blast program: to move to the beginning of the line. wget

wget ftp://ftp.ncbi.nih.gov/blast/executables/release/2.0.10/blast- More information about Unix commands 2.0.10-ia32-linux.tar.gz

For more information on the different Unix commands, try at the Unix command prompt Having downloaded the file you need to uncompress: the man command. Many systems also have info. Examples: gunzip blast-2.0.10-ia32-linux.tar.gz % man cut % info cut The resulting file is blast-2.0.10-ia32-linux.tar

Then unpack the contents of the tar archive: Interrupting a program tar -xvf blast-2.0.10-ia32-linux.tar

On many occasions you find that you want to interrupt a program. For instance, you may You can also skip the step to use this option of : find that it takes too long time to run and you do not want to for the results. As gunzip tar another example, you discover that you typed the wrong parameters to a program. In % tar -zxvf blast-2.0.10-ia32-linux.tar.gz many cases you can interrupt a program with the Ctrl-C key combination.

Some files are compressed with bzip2 so as to generate files with extension .bz2. They can be unpacked with: Program run in the background % bunzip2 some_archive.tar.bz2 When you are invoking a program with a graphical interface, or when you are initiating a or program that takes a while to complete, you typically want to be able to return to the % tar -xjvf some_archive.tar.bz2

Unix command prompt while the program is still running. The way to do that is to put a & (ampersand) after the program command. The program is said to run "in the background". An example: Learning more about UNIX

% xclock & There is a whole range of books and free web resources about Unix. See for instance "Unix in a Nutshell" by Arnold Robbins (http://oreilly.com/catalog/9780596100292/) and For further studies on the interruption of a program and on programs run in the "UNIXhelp for Users" (http://unixhelp.ed.ac.uk/). background the reader is encouraged to read documentation on the Unix bg (background), fg (foreground) and commands.