EMBOSS Software for Sequence Analysis
Total Page:16
File Type:pdf, Size:1020Kb
Fall 08 Biochemistry 711 – Book 3 – EMBOSS Software for sequence analysis Professor Ann Palmenberg, Institute for Molecular Virology & Department of Biochemistry [email protected] Dr. Jean-Yves Sgro Biotechnology Center & Institute for Molecular Virology [email protected] University of Wisconsin‐Madison version 10/2008 Biochemistry 711 - 2008 This labbook is Copyright © 1997-2008 A.C. Palmenberg & J.-Y. Sgro, University of Wisconsin-Madison. All Rights Reserved (October 2008) [ @k ? \ Biochem 711 – 2008 i Foreword and Acknowledgements The original laboratory exercises resulted from a long-term commitment to promote and foster genetic computing on the Madison campus by the Genetics Computing Group Inc., (GCG) and its standing collaborative teaching efforts with Ann Palmenberg. John Devereux and Maggie Smith provided, through GCG, the original UNIX-based hardware and software licenses necessary to create the first such curriculum for UW students. We are thankful for their largess in providing the funding for purchase and yearly upgrades the original UW UNIX-based teaching computer. The GCG exercises of this lab book were inspired by the original educational tutorials developed by Barbara Butler to teach this complex family of software programs. She has generously shared her materials and her knowledge for the benefit of UW students and staff. GCG has now been replaced by an open source software and the exercises adapted to this new package: EMBOSS, the European Molecular Biology Open Software Suite. We want to express special thanks to Ms. Marchel Hill, a course instructor, who has helped translate the GCG exercises to an EMBOSS equivalent and has unselfishly volunteered many hundreds of hours of her time and also her teaching skills towards tutoring UW students, both inside and outside of the scheduled classes. Ann and Jean-Yves would also like to acknowledge Joshua Harder at the Digital Media Center (DMC) for the maintenance of the desktop computing classroom and John Koger for installing EMBOSS both on Macintosh and Windows partitions. The goal of these exercises, is to provide an introduction to sequence analysis that will help students acquire the expertise beneficial to his or her research program. Two key lessons are (1) that computers are nothing to be afraid of, and (2) they will only do what they are told. In this modern age of genomics, “what can I DO with my sequence, now that I have it?” and ”how can I put my sequence into biological perspective?” are very important questions for the learned biologist. If by taking this lab course you simply increase your confidence when using a computer, it will be time well spent! Foreword and Acknowledgements ‐ i Biochem 711 – 2008 ii The BLOSUM62 matrix BLOSUM (BLOcks of Amino Acid SUbstitution Matrix) is a substitution matrix used for sequence alignment of proteins. BLOSUM are used to score alignments between evolutionarily divergent protein sequences. BLOSUM is based on local alignments. BLOSUM was first introduced in a paper by Henikoff and Henikoff [1]. They scanned the BLOCKS database for very conserved regions of protein families (that do not have gaps in the sequence alignment) and then counted the relative frequencies of amino acids and their substitution probabilities. Then, they calculated a log-odds score for each of the 210 possible substitutions of the 20 standard amino acids. All BLOSUM are based on observed alignments; they are not extrapolated from comparisons of closely related proteins like the PAM Matrices. [1] Henikoff, S., Henikoff, JG. (1992). "Amino Acid Substitution Matrices from Protein Blocks". Proc Natl Acad Sci 89 (22): 10915–10919. doi:10.1073/pnas.89.22.10915. PMID 1438297 Source: http://en.wikipedia.org/wiki/BLOSUM Introduction to EMBOSS ‐ ii Biochem 711 – 2008 1 Introduction to EMBOSS Table of Contents Introduction: The EMBOSS Package ....................................................... 2 1. History ......................................................................................................... 2 2. Overview....................................................................................................... 2 3. License......................................................................................................... 2 4. The EMBOSS software organization .............................................................. 3 4.1. Applications ............................................................................................ 3 4.2. Platforms & Interface ................................................................................ 3 4.3. Accessing the line-command..................................................................... 4 5. Download and installation............................................................................. 4 5.1. Windows.................................................................................................. 5 5.2. Macintosh ............................................................................................... 5 6. Manual, documentation and help .................................................................. 6 7. Tutorial ........................................................................................................ 6 EMBOSS Graphical Output ...................................................................... 7 EMBOSS Commands Organized by Functional Group............................... 8 GCG to EMBOSS Commands Equivalence .............................................. 14 Introduction to EMBOSS ‐ 1 Biochem 711 – 2008 2 Introduction: The EMBOSS Package 1. History The Genetics Computer Group (GCG or Wisconsin package), originated in Madison1, was a pioneering software for sequence analysis that became commercial in 1992. EGCG developed by a group within EMBnet2 from 1988 provided extensions to the GCG package. Because of changes in the source rcode distribution rules of GCG and other factors the former EGCG developers created a totally new generation of academic sequence analysis software: the present EMBOSS project. 2. Overview EMBOSS is "The European Molecular Biology Open Software Suite". EMBOSS is a free Open Source software analysis package specially developed for the needs of the molecular biology community […]. EMBOSS also integrates a range of currently available packages and tools for sequence analysis into a seamless whole. EMBOSS breaks the historical trend towards commercial software packages3. Citation: EMBOSS: The European Molecular Biology Open Software Suite (2000) Rice, P. Longden, I. and Bleasby, A. Trends in Genetics 16, (6) pp276-277 3. License EMBOSS is licensed for use by everyone under the GNU General Public Licence (GPL) and GNU Library General Public Licence (LGPL) licences. No one individual or institute 'owns' the code. For developers who have their own licensing conditions already in effect […] the EMBASSY collection can include packages that use the EMBOSS core libraries and interfaces but under their own licensing conditions. They will be bound by the Library GPL […], but not necessarily by the full GPL. For more information see http://emboss.sourceforge.net/licence/ 1 Devereux J, Haeberli P, Smithies O. A comprehensive set of sequence analysis programs for the VAX. Nucleic Acids Res. 1984 Jan 11;12(1 Pt 1):387-95. 2 EMBnet (http://www.embnet.org/) is the only organisation world-wide bringing bioinformatics professionals to work together to serve the expanding fields of genetics and molecular biology. 3 Rice,P. Longden,I. and Bleasby,A. "EMBOSS: The European Molecular Biology Open Software Suite" Trends in Genetics June 2000, 16(6) pp.276-277 Introduction to EMBOSS ‐ 2 Biochem 711 – 2008 3 4. The EMBOSS software organization 4.1. Applications EMBOSS is a set of a few hundred programs (applications) that handle specific functions. The EMBOSS applications are organized into 45 logical groups according to their function. (http://emboss.sourceforge.net/apps/groups.html). The groups cover the EMBOSS and EMBASSY (see above) sets of applications. For example the group ALIGNMENT GLOBAL contains 4 applications: Table - Global sequence alignment Program name Description est2genome Align EST and genomic DNA sequences needle Needleman-Wunsch global alignment stretcher Finds the best global alignment between two sequences esim4 Align an mRNA to a genomic DNA sequence while the group ALIGNMENT LOCAL contains 5 applications: Table - Local sequence alignment Program name Description matcher Finds the best local alignments between two sequences seqmatchall All-against-all comparison of a set of sequences supermatcher Match large sequences against one or more other sequences water Smith-Waterman local alignment wordmatch Finds all exact matches of a given size between 2 sequences 4.2. Platforms & Interface EMBOSS exists for multiple computer platforms. All platforms can support the basic line-command version of EMBOSS, including in Microsoft Windows cmd DOS interface. The line-command applications are the core engine of EMBOSS. These commands can be called from multiple graphical interface (GUI) variations that can be added over EMBOSS (some GUIsand not available for all platforms.) The most common GUI is the Java-based Jemboss that is part of the EMBOSS development. However, Jemboss assumes a client-server set-up but in some cases can be available as a stand-alone application. Introduction to EMBOSS ‐ 3 Biochem 711 – 2008 4 Some GUIs are specific to an operating system, such as EMBOSSrunner for User MacOSX. There also exists various web GUI