[ Team Lib ] BLAST (Basic Local Alignment Search Tool) Is a Set Of
Total Page:16
File Type:pdf, Size:1020Kb
[ Team LiB ] • Table of Contents • Index • Reviews • Examples • Reader Reviews • Errata BLAST By Joseph Bedell, Ian Korf, Mark Yandell Publisher: O'Reilly Pub Date: July 2003 ISBN: 0-596-00299-8 Pages: 360 BLAST (Basic Local Alignment Search Tool) is a set of similarity search programs that explore all of the available sequence databases for protein or DNA. BLAST is the only book completely devoted to this popular and important technology and offers biologists, computational biology students, and bioinformatics professionals a clear understanding of this program. This book shows you how to get specific answers with BLAST and how to use the software to interpret results. If you have an interest in sequence analysis this is a book you should own. [ Team LiB ] [ Team LiB ] • Table of Contents • Index • Reviews • Examples • Reader Reviews • Errata BLAST By Joseph Bedell, Ian Korf, Mark Yandell Publisher: O'Reilly Pub Date: July 2003 ISBN: 0-596-00299-8 Pages: 360 Copyright Forward Preface Audience for This Book Structure of This Book A Little Math, a Little Perl Conventions Used in This Book URLs Referenced in This Book Comments and Questions Acknowledgments Part I: Introduction Chapter 1. Hello BLAST Section 1.1. What Is BLAST? Section 1.2. Using NCBI-BLAST Section 1.3. Alternate Output Formats Section 1.4. Alternate Alignment Views Section 1.5. The Next Step Section 1.6. Further Reading Part II: Theory Chapter 2. Biological Sequences Section 2.1. The Central Dogma of Molecular Biology Section 2.2. Evolution Section 2.3. Genomes and Genes Section 2.4. Biological Sequences and Similarity Section 2.5. Further Reading Chapter 3. Sequence Alignment Section 3.1. Global Alignment: Needleman-Wunsch Section 3.2. Local Alignment: Smith-Waterman Section 3.3. Dynamic Programming Section 3.4. Algorithmic Complexity Section 3.5. Global Versus Local Section 3.6. Variations Section 3.7. Final Thoughts Section 3.8. Further Reading Chapter 4. Sequence Similarity Section 4.1. Introduction to Information Theory Section 4.2. Amino Acid Similarity Section 4.3. Scoring Matrices Section 4.4. Target Frequencies, lambda, and H Section 4.5. Sequence Similarity Section 4.6. Karlin-Altschul Statistics Section 4.7. Sum Statistics and Sum Scores Section 4.8. Further Reading Part III: Practice Chapter 5. BLAST Section 5.1. The Five BLAST Programs Section 5.2. The BLAST Algorithm Section 5.3. Further Reading Chapter 6. Anatomy of a BLAST Report Section 6.1. Basic Structure Section 6.2. Alignments Chapter 7. A BLAST Statistics Tutorial Section 7.1. Basic BLAST Statistics Section 7.2. Using Statistics to Understand BLAST Results Section 7.3. Where Did My Oligo Go? Chapter 8. 20 Tips to Improve Your BLAST Searches Section 8.1. Don't Use the Default Parameters Section 8.2. Treat BLAST Searches as Scientific Experiments Section 8.3. Perform Controls, Especially in the Twilight Zone Section 8.4. View BLAST Reports Graphically Section 8.5. Use the Karlin-Altschul Equation to Design Experiments Section 8.6. When Troubleshooting, Read the Footer First Section 8.7. Know When to Use Complexity Filters Section 8.8. Mask Repeats in Genomic DNA Section 8.9. Segment Large Genomic Sequences Section 8.10. Be Skeptical of Hypothetical Proteins Section 8.11. Expect Contaminants in EST Databases Section 8.12. Use Caution When Searching Raw Sequencing Reads Section 8.13. Look for Stop Codons and Frame-Shifts to find Pseudo-Genes Section 8.14. Consider Using Ungapped Alignment for BLASTX, TBLASTN, and TBLASTX Section 8.15. Look for Gaps in Coverage as a Sign of Missed Exons Section 8.16. Parse BLAST Reports with Bioperl Section 8.17. Perform Pilot Experiments Section 8.18. Examine Statistical Outliers Section 8.19. Use links and topcomboN to Make Sense of Alignment Groups Section 8.20. How to Lie with BLAST Statistics Chapter 9. BLAST Protocols Section 9.1. BLASTN Protocols Section 9.2. BLASTP Protocols Section 9.3. BLASTX Protocols Section 9.4. TBLASTN Protocols Section 9.5. TBLASTX Protocols Part IV: Industrial-Strength BLAST Chapter 10. Installation and Command-Line Tutorial Section 10.1. NCBI-BLAST Installation Section 10.2. WU-BLAST Installation Section 10.3. Command-Line Tutorial Section 10.4. Editing Scoring Matrices Chapter 11. BLAST Databases Section 11.1. FASTA Files Section 11.2. BLAST Databases Section 11.3. Sequence Databases Section 11.4. Sequence Database Management Strategies Chapter 12. Hardware and Software Optimizations Section 12.1. The Persistence of Memory Section 12.2. CPUs and Computer Architecture Section 12.3. Compute Clusters Section 12.4. Distributed Resource Management Section 12.5. Software Tricks Section 12.6. Optimized NCBI-BLAST Part V: BLAST Reference Chapter 13. NCBI-BLAST Reference Section 13.1. Usage Statements Section 13.2. Command-Line Syntax Section 13.3. blastall Parameters Section 13.4. formatdb Parameters Section 13.5. fastacmd Parameters Section 13.6. megablast Parameters Section 13.7. bl2seq Parameters Section 13.8. blastpgp Parameters (PSI-BLAST and PHI-BLAST) Section 13.9. blastclust Parameters Chapter 14. WU-BLAST Reference Section 14.1. Usage Statements Section 14.2. Command-Line Syntax Section 14.3. WU-BLAST Parameters Section 14.4. xdformat Parameters Section 14.5. xdget Parameters Part VI: Appendixes Appendix A. NCBI Display Formats Section A.1. Brief Descriptions Section A.2. Detailed Descriptions and Examples Appendix B. Nucleotide Scoring Schemes Appendix C. NCBI-BLAST Scoring Schemes Section C.1. NCBI-BLAST Matrices and Gap Costs Appendix D. blast-imager.pl Appendix E. blast2table.pl Glossary Numbers A-G H-U Colophon Index [ Team LiB ] [ Team LiB ] Copyright Copyright © 2003 O'Reilly & Associates, Inc. Printed in the United States of America. Published by O'Reilly & Associates, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O'Reilly & Associates books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://safari.oreilly.com). For more information, contact our corporate/institutional sales department: (800) 998-9938 [email protected] . Nutshell Handbook, the Nutshell Handbook logo, and the O'Reilly logo are registered trademarks of O'Reilly & Associates, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O'Reilly & Associates, Inc. was aware of a trademark claim, the designations have been printed in caps or initial caps. The association between the image of a coelacanth and the topic of BLAST is a trademark of O'Reilly & Associates, Inc. While every precaution has been taken in the preparation of this book, the publisher and authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein. [ Team LiB ] [ Team LiB ] Forward Reading a book such as this brings home how much BLAST-now in its teenage years-has grown, and provides an occasion for fond reflection. BLAST was born in the first months of 1989 at the National Center for Biotechnology Information (NCBI). The Center had been created at the National Institutes of Health in November 1988, by an act of the U.S. Congress, to foster the development of a field that then had no widely accepted name, but which has since come to be known as "Bioinformatics." In early 1989, David Lipman, my post-doctoral advisor, who at the time was perhaps best known as a codeveloper of the FASTA program, was appointed director of NCBI. On the first of March we moved into new offices at the National Library of Medicine.The NCBI was small, but had large ambitions, and already a number of friends. Several of these well-wishers made it a point to drop by for a visit. Gene Myers, a computer scientist then at Arizona, arrived during a week in which Science was hyping a special-purpose computer chip for sequence comparison. He and David, software partisans both, were unimpressed and over dinner resolved to do better. Their original idea was to find not subtle sequence similarities, but fairly obvious ones, and to do it in a flash. Gene pursued a rigorous approach at first, but David, with a fine Darwinian wisdom, was willing to settle for imperfection. If one were to gamble, what kind of match could one expect a strong alignment to contain? Detailed algorithmic and code development on BLAST by Webb Miller-later to be joined by Warren Gish-had hardly begun before Sam Karlin, a Stanford mathematician, came calling. I had approached him a few months earlier with a conjecture concerning the asymptotic behavior of optimal ungapped local sequence alignments. Since then, he had spun this conjecture into a beautiful theory. Now, for the first time, rigorous statistics were available for alignment scoring systems of more than academic interest, and the essential nature of amino acid substitution matrices also began to come into clear focus. This theory dovetailed perfectly with the work that had just started on BLAST: both informing the selection of its algorithmic parameters, and yielding units for the alignment scores produced. Although David chose BLAST's name as a bit of a pun on "FASTA" (it was only later that I realized "BLAST" to be an acronym), the new program was never intended to vie with the earlier one. Rather, the idea was to turn the "threshold parameter" way up, to find undoubted homologies before you take more than one sip of coffee. It surprised us all when BLAST started returning most weak similarities as well. Thus was born a sort of friendly competition with Bill Pearson's and David's earlier creation. From the start, BLAST had two major advantages to FASTA and one major disadvantage.