Introduction to Bioinformatics Software on Bio-Linux
Total Page:16
File Type:pdf, Size:1020Kb
Introduction to Bioinformatics Software on Bio-Linux The aim of this practical is to give you experience with a number of programs using different command line and graphical interfaces. We will begin by looking at the practicalities and advantages of running programs on the command line. At the end of this course, we hope that you have some feel for some of the software pre-installed Bio- Linux and different ways to access it. The main points we hope you take away with you are: 1.) If you have repetitive tasks to carry out, chances are there is a way of automating the job, at least to some extent. 2.) Web interfaces are easy, and have certain benefits, but there are other ways to access software, and sometimes they will suit your needs better. 4.) We can be contacted for help. Please mail us if you have questions or problems relating to your account or your analysis! Email [email protected] Please note: The point of this practical is not to give you a good knowledge of any program in particular, but rather to introduce you to some of the software, ways of using it, and where to find documentation and help. Before you use any of the programs to analyse your own data, we highly recommend that you read the documentation! The programs we will be using in this part of the practical are: readseq converting between different sequence formats remap restriction mapping (an EMBOSS program) clustalw multiple sequence alignment program (command line) clustalx multiple sequence alignment program (Xwindows) blast searching sequence databases with sequence data MSPcrunch post blast processing (command line and web-based) blixem Xwindows – further post-blast processing jalview Xwindows – multiple sequence editor and more prettyplot – creating pretty versions of multiple sequence alignments (EMBOSS) The sample sequences you need for this session are in a directory named intro_pract . Please move into this directory if you are not already there. 1 Interface choices - pros and cons of different interfaces Command line Pros Fast to run Very flexible – many options are available Repetitive tasks can be carried out easily and quickly Cons Have to learn syntax – (just need to read the documentation!) Prompted command line Pros Can get flexibility of the command line without having to type in everything Cons Easy to forget the diversity of options that exist for many programs Slow to run compared with “pure” command line Xwindows Pros More intuitive than the command line – windows-like Usually quite colourful Some programs can only be run through Xwindows (e.g. Staden and Blixem) Often, extensive help is readily available through the menu system Cons Much slower to use than command line, especially for repetitive tasks Web-based Pros Usually very intuitive Some web-based programs are linked together so you can directly use the results of one program as input to the next Cons Slow to use relative to the command line, especially with repetitive tasks Your data needs to be in an accessible location so you can either upload it, or copy/paste it into the web form You need to consider where and how to save the results Network speed affects how fast you get your results Security issues The most effective way to analyse your data is often to use programs through a combination of the above interfaces, depending on your requirements. For repetitive tasks – please consider using the command line, and learning to use scripting to automate what you need to do. 2 Some general points before you start File naming conventions in bioinformatics Some bioinformatics programs will present you with a default filename for the output. It is a good idea to take either accept the default (depending on how sensible it is), or if not, at least take note of the default suffix , (often proceeding a dot). E.g. the default output filename for a clustal format multiple sequence alignment might be: output.aln It is common to call clustal format files something. aln . If you were outputting a multiple sequence alignment in msf format , it might be called output.msf You are not restricted to naming your files in any particular way but we highly recommend that you follow the convention for the type of file you are generating/saving. Examples of naming conventions will be pointed out throughout the practical. Benefits to following the general trends include: • you will have at least some notion of what kind of data are in a file just by looking at the name of the file • by following the standard conventions, (rather than making up your own), you will make it easier for other people looking at your files, (e.g. collaborators, or people helping you), to know what is in your files just by looking at the title. Following file naming conventions will save you a lot of time! Naming files and the danger of over-writing previous results Many programs will suggest a name for your results file. Sometimes this name is generated by taking the beginning of the name of your input file, and adding a new suffix. However, sometimes it is just a generic name like prettyplot.ps or clustalw.aln . We encourage you to change generic names as soon as you can. Apart from the fact that filenames like prettyplot.ps give you little idea what data you actually analysed, if you do not change the name, the next time a file of the same name is generated, you will overwrite previous results. 3 Sequence formats A simple thing that often trips people up is sequence formats . You can think of a sequence format as how the sequence looks on the page, or on the screen, as well as how it is stored. Sequences are stored in text or binary formats. Text formats are human readable, binary formats are not human readable (for most humans). Examples of text formats: embl plain/staden msf genbank clustal gcg (seq format) fasta phylip The reasons there are different sequence formats are both historical and functional. When people first started writing biological analysis programs, they would design a format that their program would understand. As time went on, numerous formats came into existence. We live with the legacy of this; we must be aware of what format a sequence is in, and whether the program we want to run understands it. Functionally, some programs require information that can be handled by some formats, but not others. For example, embl format files can contain lots of descriptive information about a sequence, whereas plain format contains none, and fasta format can contain only a small amount. Clustal and msf formats can handle multiple sequences that are aligned, and phylip format files can contain information relevant to phylogenetic analysis programs. To be able to analyse data, it must be presented to the analysis program in a format it can understand, and must be appropriate for the analysis you are performing. This seems obvious, but frequent errors (or worse, meaningless results) occur when the data entered into a program is not appropriate. Note EMBOSS is a large, and recommended, package of programs for sequence analysis. EMBOSS programs accept data in many different formats , and sequence format conversion is rarely required when using EMBOSS programs. 4 A common problem: what is a text file and what is not Word documents may look like text, but they aren’t! The letters you see on the page of a Word document (or Word Perfect, or most word processing programs) are actually stored in what is known as binary format. Most sequence analysis programs expect text . Plain old, nothing fancy, text. It is an unusual situation to need to use sequence data that has been stored as a Word document (if it is not unusual to you, please ask a demonstrator as you may be doing things the hard way!). To get a text document when using Word, save it as text only . Please note: If you are using Word at all in any part of the process of your bioinformatics analysis, you are probably doing things the hard way! Please contact us for alternatives!! Converting between different sequence formats There are a number of programs available to convert one sequence format to another. One of the most versatile is readseq . Readseq allows conversion between many different formats, both for files containing single sequences, and those containing more than one sequence. Exercise Converting sequences from embl to fasta format. First look at testseq1.embl. Notice the type of information held in this file. less testseq1.embl Quit the command less by typing ‘q’ Now, type readseq on the command line. You will be prompted for all additional required information. Read the prompts carefully! Readseq asks for the name of an output file before the input file! You don’t want to end up overwriting your original data! Give the output filename as: testseq1.tfa As long as your sequence is in a recognised format, readseq will understand it. You will need to specify what output format you want when you are prompted. In this case, it will be number 8, Fasta . When prompted, give the input filename as testseq1.embl 5 You will now again be presented with the prompt Name an input sequence or –option: This gives you the opportunity to specify other filenames (if you were creating a multiple sequence file) or other formatting options. We don’t need to do this, so just press the return key . Now list your files by typing ls You should see one called testseq1.tfa. Look at it using the command less less testseq1.tfa Compare this fasta-formatted version to the embl formatted version testseq1.embl.