Sequence Analysis in a Nutshell by Darryl Leon, Scott Markel
Total Page:16
File Type:pdf, Size:1020Kb
[ Team LiB ] • Table of Contents • Index • Reviews • Reader Reviews • Errata Sequence Analysis in a Nutshell By Darryl Leon, Scott Markel Publisher : O'Reilly Pub Date : January 2003 ISBN : 0-596-00494-X Pages : 302 Sequence Analysis in a Nutshell: A Guide to Common Tools and Databases pulls together all of the vital information about the most commonly used databases, analytical tools, and tables used in sequence analysis. The book contains details and examples of the common database formats (GenBank, EMBL, SWISS- PROT) and the GenBank/EMBL/DDBJ Feature Table Definitions. It also provides the command line syntax for popular analysis applications such as Readseq and MEME/MAST, BLAST, ClustalW, and the EMBOSS suite, as well as tables of nucleotide, genetic, and amino acid codes. Written in O'Reilly's enormously popular, straightforward "Nutshell" format, this book draws together essential information for bioinformaticians in industry and academia, as well as for students. If sequence analysis is part of your daily life, you'll want this easy-to-use book on your desk. [ Team LiB ] [ Team LiB ] • Table of Contents • Index • Reviews • Reader Reviews • Errata Sequence Analysis in a Nutshell By Darryl Leon, Scott Markel Publisher : O'Reilly Pub Date : January 2003 ISBN : 0-596-00494-X Pages : 302 Copyright Preface Sequence Analysis Tools and Databases How This Book Is Organized Assumptions This Book Makes Conventions Used in This Book How to Contact Us Acknowledgments Part I: Data Formats Chapter 1. FASTA Format Section 1.1. NCBI's Sequence Identifier Syntax Section 1.2. NCBI's Non-Redundant Database Syntax Section 1.3. References Chapter 2. GenBank/EMBL/DDBJ Section 2.1. Example Flat Files Section 2.2. GenBank Example Flat File Section 2.3. DDBJ Example Flat File Section 2.4. GenBank/DDBJ Field Definitions Section 2.5. EMBL Example Flat File Section 2.6. EMBL Field Definitions Section 2.7. DDBJ/EMBL/GenBank Feature Table Section 2.8. References Chapter 3. SWISS-PROT Section 3.1. SWISS-PROT Example Flat File Section 3.2. SWISS-PROT Field Definitions Section 3.3. SWISS-PROT Feature Table Section 3.4. References Chapter 4. Pfam Section 4.1. Pfam Example Flat File Section 4.2. Pfam Field Definitions Section 4.3. References Chapter 5. PROSITE Section 5.1. PROSITE Example Flat File Section 5.2. PROSITE Field Definitions Section 5.3. References Part II: Tools Chapter 6. Readseq Section 6.1. Supported Formats Section 6.2. Command-Line Options Section 6.3. References Chapter 7. BLAST formatdb blastall megablast blastpgp PSI-BLAST PHI-BLAST bl2seq Section 7.1. References Chapter 8. BLAT Section 8.1. Command-Line Options Section 8.2. References Chapter 9. ClustalW Section 9.1. Command-Line Options Section 9.2. References Chapter 10. HMMER hmmalign hmmbuild hmmcalibrate hmmconvert hmmemit hmmfetch hmmindex hmmpfam hmmsearch Section 10.1. References Chapter 11. MEME/MAST Section 11.1. MEME Section 11.2. MAST Section 11.3. References Chapter 12. EMBOSS Section 12.1. Common Themes Section 12.2. List of All EMBOSS Programs Section 12.3. Details of EMBOSS Programs aaindexextract abiview alignwrap antigenic backtranseq banana biosed btwisted cai chaos charge checktrans chips cirdna codcmp coderet compseq cons contacts cpgplot cpgreport cusp cutgextract cutseq dan dbiblast dbifasta dbiflat dbigcg degapseq descseq dichet diffseq digest distmat domainer dotmatcher dotpath dottup dreg einverted embossdata embossversion emma emowse entret eprimer3 equicktandem est2genome etandem extractfeat extractseq findkm freak funky fuzznuc fuzzpro fuzztran garnier geecee getorf helixturnhelix hetparse hmmgen hmoment iep infoalign infoseq interface isochore lindna listor marscan maskfeat maskseq matcher megamerger merger msbar mwcontam mwfilter needle newcpgreport newcpgseek newseq noreturn notseq nrscope nthseq octanol oddcomp palindrome pasteseq patmatdb patmatmotifs pdbparse pdbtosp pepcoil pepinfo pepnet pepstats pepwheel pepwindow pepwindowall plotcon plotorf polydot preg prettyplot prettyseq primersearch printsextract profgen profit prophecy prophet prosextract pscan psiblast rebaseextract recoder redata remap restover restrict revseq scopalign scope scopnr scopparse scoprep scopreso scopseqs seealso seqalign seqmatchall seqnr seqret seqretsplit seqsearch seqsort seqwords showalign showdb showfeat showorf showseq shuffleseq sigcleave siggen sigscan silent skipseq splitter stretcher stssearch supermatcher swissparse syco textsearch tfextract tfm tfscan tmap tranalign transeq trimest trimseq union vectorstrip water whichdb wobble wordcount wordmatch wossname yank Section 12.4. References Part III: Appendixes Appendix A. Nucleotide andAmino Acid Tables Section A.1. Nucleotide Codes Section A.2. Amino Acid Codes Section A.3. References Appendix B. Genetic Codes Section B.1. The Standard Code Section B.2. Vertebrate Mitochondrial Code Section B.3. Yeast Mitochondrial Code Section B.4. Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code Section B.5. Invertebrate Mitochondrial Code Section B.6. Ciliate, Dasycladacean, and Hexamita Nuclear Code Section B.7. Echinoderm and Flatworm Mitochondrial Code Section B.8. Euplotid Nuclear Code Section B.9. Bacterial and Plant Plastid Code Section B.10. Alternative Yeast Nuclear Code Section B.11. Ascidian Mitochondrial Code Section B.12. Alternative Flatworm Mitochondrial Code Section B.13. Blepharisma Nuclear Code Section B.14. Chlorophycean Mitochondrial Code Section B.15. Trematode Mitochondrial Code Section B.16. Scenedesmus Obliquus Mitochondrial Code Section B.17. Thraustochytrium Mitochondrial Code Section B.18. References Appendix C. Resources Section C.1. Web Sites Section C.2. Books Section C.3. Journal Articles Appendix D. Future Plans crystalball Colophon Index [ Team LiB ] [ Team LiB ] Copyright Copyright © 2003 O'Reilly & Associates, Inc. Printed in the United States of America. Published by O'Reilly & Associates, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O'Reilly & Associates books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://safari.oreilly.com). For more information, contact our corporate/institutional sales department: (800) 998-9938 or [email protected]. Nutshell Handbook, the Nutshell Handbook logo, and the O'Reilly logo are registered trademarks of O'Reilly & Associates, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O'Reilly & Associates, Inc. was aware of a trademark claim, the designations have been printed in caps or initial caps. The association between the image of a liger and the topic of sequence analysis is a trademark of O'Reilly & Associates, Inc. Material in Chapter 3 (SWISS-PROT) and Chapter 5 (PROSITE) is used with the permission of the Swiss Institute of Bioinformatics. Material in Chapter 8 (BLAT) is used with the permission of Jim Kent. Material in Chapter 10 (HMMER) is used with the permission of Sean Eddy. Material in Chapter 11 (MEME/MAST) is used with the permission of Michael Gribscov and Tim Baily. While every precaution has been taken in the preparation of this book, the publisher and authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein. [ Team LiB ] [ Team LiB ] Preface Gene sequence data is the most abundant type of data available, and there is a rich array of computational methods and tools that can help analyze patterns within that data. This book brings together the detailed terms, definitions, and command-line options found in the key databases and tools used in sequence analysis. It's meant for use by bioinformaticians in both industry and academia, as well as students. This book is a handy resource and an invaluable reference for anyone who needs to know about the practical aspects and mechanics of sequence analysis. It's no coincidence that the gene sequences of related species of plants, animals, and microorganisms show complex patterns of similarity to one another. This is one of the most fascinating aspects of the study of evolution. In fact, many molecular biologists are convinced that an understanding of sequence evolution is the first step toward understanding evolution itself. The comparison of gene sequences, or biological sequence analysis, is one of the processes used to understand sequence evolution. It is an important discipline within computational biology and bioinformatics. If you're new to the field, this book won't teach you how to perform sequence analysis, but it will help you sort out the details of the common tools and data sources used for sequence analysis. If sequence analysis is part of your daily lives (as it is for us), you'll want this easy-to-use book on your desk. We've included many references (especially URLs) for further information on the tools we document, but with this book handy we hope you won't need to use them. [ Team LiB ] [ Team LiB ] Sequence Analysis Tools and Databases Many of the software tools used in studying genomes involve sequence analysis, which is one of the many subfields of computational molecular biology. The field of sequence analysis includes pattern and motif searching, sequence comparison, multiple sequence alignment, sequence composition determination, and secondary structure prediction. Because sequence data consists primarily of character strings, it's relatively easy