Introduction to Software on Bio-Linux

The aim of this practical is to give you experience with a number of programs using different command line and graphical interfaces. We will begin by looking at the practicalities and advantages of running programs on the command line.

At the end of this course, we hope that you have some feel for some of the software pre-installed Bio- Linux and different ways to access it. The main points we hope you take away with you are:

1.) If you have repetitive tasks to carry out, chances are there is a way of automating the job, at least to some extent.

2.) Web interfaces are easy, and have certain benefits, but there are other ways to access software, and sometimes they will suit your needs better.

4.) We can be contacted for help. Please mail us if you have questions or problems relating to your account or your analysis! Email [email protected]

Please note: The point of this practical is not to give you a good knowledge of any program in particular, but rather to introduce you to some of the software, ways of using it, and where to find documentation and help. Before you use any of the programs to analyse your own data, we highly recommend that you read the documentation!

The programs we will be using in this part of the practical are: readseq converting between different sequence formats remap restriction mapping (an EMBOSS program) clustalw multiple program (command line) clustalx multiple sequence alignment program (Xwindows) searching sequence databases with sequence data MSPcrunch post blast processing (command line and web-based) blixem Xwindows – further post-blast processing jalview Xwindows – multiple sequence editor and more prettyplot – creating pretty versions of multiple sequence alignments (EMBOSS)

The sample sequences you need for this session are in a directory named intro_pract . Please move into this directory if you are not already there.

1 Interface choices - pros and cons of different interfaces

Command line Pros Fast to run Very flexible – many options are available Repetitive tasks can be carried out easily and quickly

Cons Have to learn syntax – (just need to read the documentation!)

Prompted command line Pros Can get flexibility of the command line without having to type in everything

Cons Easy to forget the diversity of options that exist for many programs Slow to run compared with “pure” command line

Xwindows Pros More intuitive than the command line – windows-like Usually quite colourful Some programs can only be run through Xwindows (e.g. Staden and Blixem) Often, extensive help is readily available through the menu system

Cons Much slower to use than command line, especially for repetitive tasks

Web-based Pros Usually very intuitive Some web-based programs are linked together so you can directly use the results of one program as input to the next Cons Slow to use relative to the command line, especially with repetitive tasks Your data needs to be in an accessible location so you can either upload it, or copy/paste it into the web form You need to consider where and how to save the results Network speed affects how fast you get your results Security issues

The most effective way to analyse your data is often to use programs through a combination of the above interfaces, depending on your requirements.

For repetitive tasks – please consider using the command line, and learning to use scripting to automate what you need to do. 2

Some general points before you start

File naming conventions in bioinformatics

Some bioinformatics programs will present you with a default filename for the output. It is a good idea to take either accept the default (depending on how sensible it is), or if not, at least take note of the default suffix , (often proceeding a dot). E.g. the default output filename for a format multiple sequence alignment might be:

output.aln

It is common to call clustal format files something. aln .

If you were outputting a multiple sequence alignment in msf format , it might be called

output.msf

You are not restricted to naming your files in any particular way but we highly recommend that you follow the convention for the type of file you are generating/saving.

Examples of naming conventions will be pointed out throughout the practical. Benefits to following the general trends include:

• you will have at least some notion of what kind of data are in a file just by looking at the name of the file

• by following the standard conventions, (rather than making up your own), you will make it easier for other people looking at your files, (e.g. collaborators, or people helping you), to know what is in your files just by looking at the title.

Following file naming conventions will save you a lot of time!

Naming files and the danger of over-writing previous results

Many programs will suggest a name for your results file. Sometimes this name is generated by taking the beginning of the name of your input file, and adding a new suffix. However, sometimes it is just a generic name like prettyplot.ps or clustalw.aln . We encourage you to change generic names as soon as you can.

Apart from the fact that filenames like prettyplot.ps give you little idea what data you actually analysed, if you do not change the name, the next time a file of the same name is generated, you will overwrite previous results.

3 Sequence formats

A simple thing that often trips people up is sequence formats .

You can think of a sequence format as how the sequence looks on the page, or on the screen, as well as how it is stored.

Sequences are stored in text or binary formats. Text formats are human readable, binary formats are not human readable (for most humans).

Examples of text formats: embl plain/staden msf clustal gcg (seq format) fasta phylip

The reasons there are different sequence formats are both historical and functional.

When people first started writing biological analysis programs, they would design a format that their program would understand. As time went on, numerous formats came into existence.

We live with the legacy of this; we must be aware of what format a sequence is in, and whether the program we want to run understands it.

Functionally, some programs require information that can be handled by some formats, but not others. For example, embl format files can contain lots of descriptive information about a sequence, whereas plain format contains none, and can contain only a small amount. Clustal and msf formats can handle multiple sequences that are aligned, and phylip format files can contain information relevant to phylogenetic analysis programs.

To be able to analyse data, it must be presented to the analysis program in a format it can understand, and must be appropriate for the analysis you are performing. This seems obvious, but frequent errors (or worse, meaningless results) occur when the data entered into a program is not appropriate.

Note

EMBOSS is a large, and recommended, package of programs for sequence analysis. EMBOSS programs accept data in many different formats , and sequence format conversion is rarely required when using EMBOSS programs.

4 A common problem: what is a text file and what is not

Word documents may look like text, but they aren’t! The letters you see on the page of a Word document (or Word Perfect, or most word processing programs) are actually stored in what is known as binary format. Most sequence analysis programs expect text . Plain old, nothing fancy, text. It is an unusual situation to need to use sequence data that has been stored as a Word document (if it is not unusual to you, please ask a demonstrator as you may be doing things the hard way!). To get a text document when using Word, save it as text only .

Please note: If you are using Word at all in any part of the process of your bioinformatics analysis, you are probably doing things the hard way! Please contact us for alternatives!!

Converting between different sequence formats

There are a number of programs available to convert one sequence format to another. One of the most versatile is readseq .

Readseq allows conversion between many different formats, both for files containing single sequences, and those containing more than one sequence.

Exercise

Converting sequences from embl to fasta format.

First look at testseq1.embl. Notice the type of information held in this file. less testseq1.embl

Quit the command less by typing

‘q’

Now, type readseq on the command line. You will be prompted for all additional required information. Read the prompts carefully! Readseq asks for the name of an output file before the input file! You don’t want to end up overwriting your original data!

Give the output filename as: testseq1.tfa

As long as your sequence is in a recognised format, readseq will understand it. You will need to specify what output format you want when you are prompted. In this case, it will be number 8, Fasta .

When prompted, give the input filename as testseq1.embl

5 You will now again be presented with the prompt

Name an input sequence or –option:

This gives you the opportunity to specify other filenames (if you were creating a multiple sequence file) or other formatting options. We don’t need to do this, so just press the return key .

Now list your files by typing ls

You should see one called testseq1.tfa. Look at it using the command less less testseq1.tfa

Compare this fasta-formatted version to the embl formatted version testseq1.embl.

Now do the same thing for testseq2.embl and testseq3.embl.

Bring up the help for readseq by typing readseq -h

Can you see how you might be able to do this type of conversion in a single command given on the command line?

Try using the full command line to convert testseq4.embl to fasta format. Do it again for testseq5.embl.

Exercise

Sequence format conversion and multiple sequence files

Multiple sequence files, that is, files containing more than one sequence are often used for input in multiple sequence alignment programs, and for carrying out repetitive analysis.

Look again at some of the options available in readseq .. readseq –h

Notice the option –all . If we include this on the command line, it indicates to readseq that we want it to take all the sequences we name, reformat them, and place the output of all of them into a single file .

Try creating a multiple fasta sequence file called testseqs_all.tfa that contains sequences testseq1.embl, testseq2.embl, testseq3.embl and testseq4.embl.

If you succeeded, all the sequences in the output file, testseqs_all.tfa , will be in fasta format.

Note that the input sequences do not all have to be in the same format as each other, they need only be in a format recognised by readseq .

6 There are many ways of doing the above, and many are faster than the way described here. We cannot describe all the possible methods here, but if you already had all your files in the appropriate format (e.g. here we are using fasta format), and you wanted to create a file containing all your sequences, you could use the command:

cat testseq[1-4].tfa > testseq_all2.tfa

Or, if you wanted to add extra information to a file that already exists, you could use the command:

cat testseq5.tfa >> testseq_all2.tfa

The message: If you suspect there may be a more efficient way to do what you are doing, there probably is!

Please email [email protected] and ask us if we know of programs or options that might help you.

Running programs via the command line

Most programs can be run by typing everything the program needs to know on the command line and pressing the return key. Usually this resembles the way you give Linux/Unix commands: you enter the command followed by flags or arguments specifying how you want the program to run.

Some programs will prompt you for the information they need if you have not entered everything required on the command line.

Even in the case of programs that will prompt you for information on the command line, there are good reasons to provide all of the information directly, rather than in response to the program prompts:

• Many programs can be tailored to run according to your needs, but when a program prompts you for input, it usually only prompts you for information that is absolutely necessary in order to run. There may be many useful options that you miss out on using if you only answer the prompts. That is, if you miss out any information that is not vital, but affects the way you want the program to act, it will run fine, but not as you wanted.

• If you have repetitive tasks to carry out, (e.g. say you have 100 sequences to analyse in a particular way), the easiest (i.e. fastest) method to set this up involves using the full command line and automating the task.

To illustrate these points, we will be using an EMBOSS program called remap . This program looks for restriction sites in nucleotide sequences and is also useful for looking for open reading frames and translations in any or all 6 frames.

7

Note

There are necessary setup steps required to get certain EMBOSS applications to work. Remap is one of them. Remap relies on information from a restriction enzyme database called Rebase . Information is available on our website explaining how to get rebase and get EMBOSS applications to interact with it.

http://envgen.nox.ac.uk/envgen/software/archives/000329.html#rebase

Exercise

Running Remap – prompted and complete command line

Using the prompted command line

We will start by running remap as simply as possible. Just type: remap

The sequence you want to run the program on is testseq1.tfa

Choose the default answers to all questions.

When your analysis has run you will find the results in a file called testseq2.remap .

Use the command less to look at this file: type less testseq2.remap

You should see your sequence, with restriction sites marked out along it, with a six frame translation shown below.

Keep pressing the space bar until you near the bottom of the file. Here you should find three lists: one of enzymes that did cut your sequence, one of enzymes that did not cut, and the number of enzymes that did not match your criteria.

Specifying everything on the command line

There are lots of useful remap options available – these can be accessed by using the full command line. To find out what these are, you can go to the web page documentation for remap (http://www.hgmp.mrc.ac.uk/Software/EMBOSS/Apps/remap.html ) or can type:

remap –help

8 This time, lets try running remap, and specify that we do not wish to see any ORF’s that are less than 20 amino acids long, and we wish to list only 6-cutter enzymes. Try looking at the documentation and see if you can figure out what the command line would be yourself.

In case you had problems, the following should work:

remap -orfminsize=20 –sitelen=6

Notice that you are still prompted for the necessary information you didn’t provide on the command line – here this includes what enzyme set to choose from, the name of the input file and the name of the output file.

By providing all the necessary information on the command line, you can speed up your analysis, and take the first step towards automating your task should you have to run this mapping many times.

Try this command – decide what you think it should be doing before looking at the results file. remap –orfminsize=20 –outfile=teseq3remap.html –highlight=“24-80 blue” –sitelen=6 –html –enzymes=all testseq3.tfa

Notes: • type the above command all on one line • EMBOSS programs allow you to use the above syntax, with a = sign between parameters and values, or spaces. For example, -orfminsize 20 is acceptable. What syntax is accepted by a program depends on how it has been written. For example, readseq is not so flexible in what it accepts.

Notice that we gave the output name of this file the suffix .html . This is because we asked for an html output file from remap, and most browsers require the .html suffix to recognize that the page you are trying to view is an html document.

Take a look at the results of the above command by opening a web browser and loading the file testseq3remap.html.

The command line truly comes into its own when you need to run an analysis over and over again. For example, what if I had 100 sequences I wanted to map exactly the same way? The prompted command line would become tedious very quickly. And the full command line would too, although it would be faster.

A first foray into automation – the foreach loop

Foreach loops allow you to say to the computer:

“Foreach thing in this list, do the following:”

So, when running a restriction mapping analysis, you might want to do something like:

“Foreach sequence in my list, run the program map to look for the enzymes that that have recognition sites of at least six bases and that cut a minimum of twice.”

Mapping 100 sequences with a foreach loop would only take fractionally more time than mapping a single sequence, and required practically no extra effort on your part. 9

Since the general idea is to get the computer to read a list, and run an analysis on each item in that list, you need to generate a list of the sequences you want analysed. We will go through one example here.

Please note: If you have no prior experience with Linux or unix, you may find it a challenge to set up your own foreach loops the first time. Please contact us at [email protected] if you have problems setting up your own foreach loops.

Exercise

Looping through multiple restriction mapping analyses

Type the following on the command line: ls testseq[1-5].tfa

You should see 5 sequences listed: testseq1.tfa testseq2.tfa testseq3.tfa testseq4.tfa testseq5.tfa

If we wish to map these five sequences, we know that the command ls testseq[1-5].tfa will list our sequences of interest. We can put this list into a foreach loop by stating foreach i ( ‘ls testseq[1-5].tfa‘ )

Here there are several things to note:

• we have used the command “ foreach ” • the “ i” means “ each thing ” – for each thing in the list, i takes the value of that thing (in this case, a sequence name) • the information in the brackets is the list of sequences you want to work on • the quotation marks around the ls testseq[1-5].tfa command are backquotes and the computer understands this to mean “take the results of the command inside these backquotes” . • The brackets are important: the computer needs them to understand what you want • So the overall effect of that one line is : “foreach thing in the list that can be generated using the command ls testseq[1-5].tfa , do the following:”

If you have typed the foreach line in, you will now be seeing something like:

jbloggs@machine [demo] foreach i ( ‘ls testseq*.seq‘ ) foreach>

The foreach is a prompt - we need to tell the computer what we want it to do with each item in the list. To do this, type:

remap –outfile=$i.remap –enzymes=all –sitelen=6 –mincuts=2 $i

10 and press the return key. Each $i in that command will be replaced by the name of a sequence file from the list and the remap command is executed.

You will now see another foreach> prompt. Type

end to let the computer know that you have told it all it needs to do for each item in the list.

Now type:

ls –l *.remap

You should see that you have run the mapping analysis on all 5 sequences (they will all be called testseqs#.tfa.remap.)

As you can see, running this analysis on 100, or 1000 sequences, would be relatively painless if you did it using a foreach loop.

Note

Sometimes programs (like remap) send information to the screen that you don’t want. One way to get rid of it while carrying out a foreach loop as above is to send this screen output to a “garbage” area (aka a bit-bucket).

The information you see in the case of remap is being sent from the program to STDERR, (which, in this case, is your screen). By writing 2>/dev/null at the end of the remap command, you are saying to send anything send to STDERR (2) to a garbage area (/dev/null).

Try replacing the line:

remap –outfile=$.remap –enzymes=all –sitelen=6 –mincuts=2 $i

in your foreach loop with:

remap –outfile=$.remap –enzymes=all –sitelen=6 –mincuts=2 $i 2>/dev/null

This may seem trivial at this point, but could be important if you start writing shell or cron jobs where information is being sent to STDERR.

Nicing – aka “Being a considerate user!”

If you are running a computationally intensive job (e.g. when you search databases, or run large alignments), you should consider being polite to other users of your system by setting your jobs to work at a low priority. The priority given to your jobs are referred to as nice levels.

We won’t be nicing any jobs today, but for the sake of all the other users of your Bio-Linux machine, please read the documentation on nice: 11

man nice

To nice a job you are about to run, use nice –n level command . Levels range from For example, to nice a program called someprog.pl , you could type to level 15 (an low-ish priority):

nice –n 15 someprog.pl You can also move a running program to a lower priority using the command renice .

Note: You may have to give the full path of the command you wish to run when using nice, rather than just the short name.

There are other facilities, such as queuing and load balancing systems, which are more sophisticated than just “nicing” a job, but nice is simple, built-in, and effective for machines with a very small number of users.

Running programs with graphical interfaces

Your Bio-Linux machine runs X windows, which means that you can run programs with graphical interfaces – either by working on the console, or by working remotely. Your system administrator should be able to help you if you do not know how to run graphical programs on Bio-Linux remotely.

Programs you can run using graphical interfaces include clustalx , blixem , jalview, and Staden ., among others.

Exercise

Running clustalx

Clustalx is a multiple alignment program. (A command line version called clustalw is also available.)

To start up clustalx, just type the name of the program on the command line:

clustalx &

The & allows us to work in the clustalx window and also to continue working in our original terminal window. (Nothing bad happens if you forget the &.)

Alternatively, you can start up clustalx from choosing the icon under the Bioinformatics drop down menu.

You should now see a new window appear with the title ClustalX . There are a number of drop-down menus available within the clustalx program: Files, Edit, Alignment, Trees, Colors, Quality, and Help.

Click on each of them and see what choices you are presented with.

Any choice in grey text is not available to you at this moment, any choice in black text is.

12 Please note: we have just discovered that the help menus for clustalx in Bio-Linux 2.0 are not working. There is and easy (though not completely elegant) fix for this. Please see our bioinformatics software faq at

http://envgen.nox.ac.uk/envgen/software/archives/000329.html#clustalx_help

We have several options for loading sequences into clustalx. Some of these are:

• we can load all the sequences at once using a multiple sequence fasta format file (e.g. testseqs_all.tfa) • we can add the sequences one at a time by loading the first sequence using the menu option Load Sequences , and adding all subsequent sequences using the Append Sequences option • we can do a mixture of both the above

Try loading testseqs_all.tfa into clustalx:

Choose the Load Sequences option , and choose testseqs_all.tfa from the menu.

Sequences should appear in the Clustalx window. Please ask a demonstrator if they do not.

To add a single sequence to those already loaded: Choose the Append Sequences option and choose testseq5.tfa . This sequence should now be visible in your Clustalx window.

In order to carry out an alignment, you need to highlight the name of the sequences you want aligned, and click on the Alignment menu .

It is not part of the course today, but we highly advise you to click on the Alignment parameters and the Output Format Options choices and see what is available!

For now, just click on Do Complete Alignment under the Alignment menu.

You will be presented with output file names that you can change if you like.

Click on the button marked Align .

At the bottom of the clustalx window, you should see text describing the progression of the alignment.

To keep this alignment, you need to save it.

It can be saved in a text format so that it can be used in other programs, or you can save the output much as it looks in the ClustalX window: a coloured alignment . This is like a picture and cannot be used for further analysis.

To save the alignment in text form, click on the Save Sequences As… choice under the File menu. You will be presented with choices as to what format to save your alignment in, which portion of the alignment to save, and what to call the output file. Take note of what the output file will be called, and then click on OK .

Go back to your main window and type ls at the prompt. You should now see the file you just created.

13 Try and create a colour version of the alignment – name it testseqs_all.ps.

To view this postscript format file, you can use a command called ghostview :

ghostview testseqs_all.ps

Quit clustalx by choosing Quit from under the File menu .

Exercise

Fetching sequences using SRS at the EBI

The majority of this exercise will be carried out via concurrent demonstration.

Exercise

Running artemis

Artemis is a program for viewing and annotating DNA sequences. It is an Xwindows program.

You should have a file in your account called hsy14768.embl.

Start artemis by typing

artemis

Now choose the option Open from under the File menu, and select the file you just saved: hsy14768.embl. You may get an error, just hit the button OK.

This should open up a large window where this sequence will be displayed graphically.

In another xterm window, you may like to view the actual text of the entry using the command less. less hsy14768.embl

Notice how Artemis has essentially transformed this text information into a picture.

For more information about how to work with Artemis, please refer to the web page:

http://www.sanger.ac.uk/Software/Artemis

Explore the options available to you. Not all options will be functional – if you need them, you will need to ask your system administrator to set up some of them on your local Bio-Linux machine.

14 Blast – from step one onwards

Running blast searches is one of the most frequent tasks in bioinformatics. Running blast searches locally, and learning how to automate this task will greatly add to your efficiency.

To search a database with blast locally, you need to get that database and then make a copy in a format that blast can read – that is, you must index the database for blast .

There are two main types of blast – NCBI’s blast, (aka blastall) and Washington University blast (aka wu- blast). Both do a good job, but they work slightly differently (under the hood), and can produce different results in some cases. In additiona, wu-blast offers some features NCBI blast does not. Academic licenses for wu-blast are free and can be obtained by emailing [email protected] .

NCBI blast (blastall) comes pre-installed on your Bio-Linux system.

We are going to download the peptide database swissprot, format it for searches with blast, and then carry out some blast searches against this formatted database.

Please note that where we put the blast database during this course is not the recommended location!!!

Please ask your system administrator to put blast databases in the location /home/db/blastdb OR to change the environmental variable BLASTDB set in the file /usr/software/bioenvrc to the appropriate location.

We recommend that you store all databases somewhere under /home, and preferably somewhere under /home/db .

Exercise

Running blast – from database formatting to database searching

Step 1 – get the database onto the machine

The majority of sequence and sequence-related databases are disseminated as flatfiles and there are a number of ways you can get hold of such databases. The most common is to ftp them from a central repository like the EBI, Expasy or the NCBI.

We will download the entire swissprot database from the EBI using ftp , and the command wget .

Create a directory to store your database in and move into that directory:

mkdir blastdb cd blastdb

Download the appropriate file (the fasta file!) from the ftp site:

wget ftp://ftp.ebi.ac.uk/pub/databases/sp_tr_nrdb/fasta/sprot.fas.gz

15 Uncompress the database:

gunzip sprot.fas.gz

To format the database for use with blastall, you need to use a program called formatdb . Documentation for this program can be found by clicking on the desktop folder called Bioinformatics Software Manuals , and then the sub folder db_search_docs , then the folder BLAST_docs . A reasonable command to run to format the above database, creating a blast version with the name “swissprot” would be:

formatdb –i sprot.fas –p T –o T –n swissprot

Run the above command and look at the list of files created:

ls -l

It is worth looking at the file formatdb.log after you create a blast database.

Now move back to your original directory

cd ..

Blastall needs to know where to find the database you want to search. You can do this by giving the full path to the database, or by defining an environmental variable $BLASTDB as the directory where your blast database is. We will give the full path during this practical.

There are many command line options available for blastall, and we HIGHLY recommend you read the documentation for this program! Understanding how this program works and how it can be used will aid you greatly in searching databases effectively and understanding what your blast results really mean.

Blast comes in a number of different “flavours”, which carry out different types of searches.

FLAVOUR SEARCH SEQUENCE TYPE DATABASE SEQUENCE TYPE blastn nucleotide nucleotide blastp peptide peptide blastx nucleotide (6 frame conceptual peptide translation of) tblastn peptide nucleotide tblastx nucleotide (6 frame conceptual nucleotide (6 frame conceptual translation of) translation of)

It is beyond the scope of this course to cover the details of blast searching. We will just run a basic blastp search, and then we’ll use a foreach loop to run 5 blastx searches.

16 A simple blastp search

blastall –p blastp –d blastdb/swissprot –i cd4_cerae.tfa –e 0.01 –o cd4_cerae.blastp

This means: run blastall , using the flavour (-p) blastp . The database (-d) to be searched is called swissprot a nd can be found in the blastdb directory. The input sequence (-i) is cd4_cerae.tfa . I only want to see results of sequences with e-values (-e) better than (i.e. lower than) 0.01 , and I want the results of this search (-o) to be sent to the file cd4_cerae.blast .

Please look at the results file.

Because you used the –o option when you created your blast database, you can use the fastacmd program to retrieve any sequences you are interested in using the sequence id you find in the blast report. E.g. for this search, you could retrieve the sequence with id swissprot_1 by typing:

fastacmd –d blastdb/swissprot –s COAD_BPFD

Notice that this finds the sequence and returns the results to the screen. If you wish to keep this sequence, you need to capture the output. You can do this using the > redirect symbol. This will send the output to a file. For example:

fastacmd –d blastdb/swissprot –s COAD_BPFD > coad_bpfd.tfa The output is now stored in a file called coad_bpfd.tfa . You can look at this file using cat , more , or less .

Try a blastx search:

blastall –p blastx –d blastdb/swissprot –i unknown.tfa -e 1 –o unknown.blastx

This may have seemed a lot of work when you could have just gone to a web site to do it!

There are many reasons to choose to blast locally including configurability, speed, security, and being able to automate your jobs.

Five blastx searches, one command

Remember the foreach loop?

Try to set up a foreach loop to run blastx searches of testseq1.tfa through to testseq5.tfa against the swissprot database.

The answer is given below, but try it yourself first and see how it goes.

foreach i (`ls testseq[1-5].tfa`) blastall –p blastx –d blastdb/swissprot -i $i -e 0.01 -o $i.blast end

If you type

17 ls *blast you should see the blast reports you have just generated listed.

You could read through the testseq*.blast files by using the command less :

less testseq*.blast

When you get to the end of one document, (or just want to go to the next document), just type :n If you want to quit, type q

18 MSPCrunch and Blixem

MSPcrunch is a program that process the output of blast searches. It is used to filter the output from blast searches, the aim being to optimise the chances of finding new biologically significant matches by reducing the display of redundant matches, and retaining the best other matches.

There are many command line options for this command. You can list these by typing:

mspcrunch –h

Exercise

In our case, we will be using MSPcrunch specifically to convert our blast output into a format readable by the program blixem.

This means we need to use the –q option. Try the following:

mspcrunch –q unknown.blastx > mspcrunch.out

This makes a file called mspcrunch.out, which we will feed into Blixem.

Blixem stands for “BLast matches In an X-windows Embedded Multiple alignment” and is an interactive browser of pairwise Blast matches. The alignment that is produced is thus not a ‘true’ multiple alignment, such as produced by e.g. Clustalw, but a ‘one-to-many’ alignment (all sequences are aligned to your original search sequence.) There are many viewing options in Blixem and it is worth getting to know this program.

Running Blixem: Make sure you have saved your MSPcrunch results to the same directory as your sequence files.

To run Blixem, you need to give the program your MSPcrunch results AND the file containing the sequence you did the blast search with.

Blixem accepts only fasta formatted sequence.

On the command line, type :

blixem unknown.tfa mspcrunch.out &

You may have to wait a few seconds for Blixem to start up. Place your cursor on the blue box in the top section, click on the middle mouse button and drag the box to the left or right. What happens to the sequences in the bottom section?

There are many menus available if you click with your right mouse button. Place the cursor over different areas of the screen and click with the right mouse button. Choose some of the menu options and observe their effects.

A particularly good program available through Blixem (and also directly on the command line) is Dotter. Try opening the option Dotter query vs. itself when you see it in a sub-menu.

19

Warning

Many of the best features of Blixem rely on being able to fetch sequences from a database. For instance, when you double click on a sequence name, the full sequence should pop up, and when you right click and choose the program Dotter, it should fetch the sequence of interest and show you a graphical pairwise alignment of your query sequence against it.

To take advantage of these features requires two steps. • change the default search program used by Blixem (WWW-efetch) to efetch (simple) • write a script called efetch to get sequences from databases you hold on your machine (often not as simple). The script efetch must be on your PATH.

A simple example is given on our Bioinformatics FAQ page.

http://envgen.nox.ac.uk/envgen/software/archives/000329.html#blixem_db

If you want to try this out, feel free and ask a demonstrator if you have problems.

Quit Blixem by right clicking over the window and choosing the option Quit .

Further help and explanation about Blixem is available at:

http://www.cgr.ki.se/cgr/groups/sonnhammer/Blixem.html

Running Jalview

Jalview is a versatile program that allows you to do multiple sequence alignments, view and edit alignments, carry out a restricted amount of phylogenetic analysis, view trees, etc.

The only “bothersome” thing about running jalview from the command line (as opposed to when it is offered as a web-based application) is that the command line is ugly. You have to give both the name of the sequence file you want to view, and the format it is in.

The formats allowed are MSF, CLUSTAL, FASTA, BLC, MSP or PIR

Exercise

We will load the multiple sequence fasta file capsall.tfa into jalview. Type the following

jalview capsall.tfa FASTA

You can run a clustal alignment within Jalview by going to the Align menu and choosing Local Alignment . Try this now. This is a big file, so it may take a little time.

20 A new window with the aligned sequences should appear eventually. If it does not have the same colouring scheme as before, go to the menu Colour and choose Clustal colours

Please note: Clustalw run in this way runs with default options It is a better idea to run clustalx or clustalx and then load the aligned file into Jalview so that you can have full control over the parameters used when creating your alignment.

Try out some of the other options available to you under the menus.

Running Prettyplot

Prettyplot is a program to generate “pretty” versions of multiple sequence alignments.

By now you have generated a number of alignment files and are hopefully fairly comfortable in finding out information about programs, and trying out options available to you.

Try looking at the prettyplot documentation by typing

prettyplot –h or referring to the web documentation at

http://www.hgmp.mrc.ac.uk/Software/EMBOSS/Apps/prettyplot.html

Try running this program on some of your alignments, choosing various display options.

21