IT-SC Beginning Perl for Bioinformatics James Tisdall Publisher: O'reilly First Edition October 2001 ISBN: 0-596-00080-4, 384 Pages
Total Page:16
File Type:pdf, Size:1020Kb
IT-SC Beginning Perl for Bioinformatics James Tisdall Publisher: O'Reilly First Edition October 2001 ISBN: 0-596-00080-4, 384 pages This book shows biologists with little or no programming experience how to use Perl, the ideal language for biological data analysis. Each chapter focuses on solving particular problems or class of problems, so you'll finish the book with a solid understanding of Perl basics, a collection of programs for such tasks as parsing BLAST and GenBank, and the skills to tackle more advanced bioinformatics programming. $ IT-SC IT-SC 2 Preface What Is Bioinformatics? About This Book Who This Book Is For Why Should I Learn to Program? Structure of This Book Conventions Used in This Book Comments and Questions Acknowledgments 1. Biology and Computer Science 1.1 The Organization of DNA 1.2 The Organization of Proteins 1.3 In Silico 1.4 Limits to Computation 2. Getting Started with Perl 2.1 A Low and Long Learning Curve 2.2 Perl's Benefits 2.3 Installing Perl on Your Computer 2.4 How to Run Perl Programs 2.5 Text Editors 2.6 Finding Help 3. The Art of Programming 3.1 Individual Approaches to Programming 3.2 Edit—Run—Revise (and Save) 3.3 An Environment of Programs 3.4 Programming Strategies 3.5 The Programming Process 4. Sequences and Strings 4.1 Representing Sequence Data 4.2 A Program to Store a DNA Sequence 4.3 Concatenating DNA Fragments 4.4 Transcription: DNA to RNA 4.5 Using the Perl Documentation 4.6 Calculating the Reverse Complement in Perl 4.7 Proteins, Files, and Arrays 4.8 Reading Proteins in Files 4.9 Arrays 4.10 Scalar and List Context 4.11 Exercises 5. Motifs and Loops 5.1 Flow Control 5.2 Code Layout 5.3 Finding Motifs 5.4 Counting Nucleotides 5.5 Exploding Strings into Arrays 5.6 Operating on Strings 5.7 Writing to Files IT-SC 1 5.8 Exercises 6. Subroutines and Bugs 6.1 Subroutines 6.2 Scoping and Subroutines 6.3 Command-Line Arguments and Arrays 6.4 Passing Data to Subroutines 6.5 Modules and Libraries of Subroutines 6.6 Fixing Bugs in Your Code 6.7 Exercises 7. Mutations and Randomization 7.1 Random Number Generators 7.2 A Program Using Randomization 7.3 A Program to Simulate DNA Mutation 7.4 Generating Random DNA 7.5 Analyzing DNA 7.6 Exercises 8. The Genetic Code 8.1 Hashes 8.2 Data Structures and Algorithms for Biology 8.3 The Genetic Code 8.4 Translating DNA into Proteins 8.5 Reading DNA from Files in FASTA Format 8.6 Reading Frames 8.7 Exercises 9. Restriction Maps and Regular Expressions 9.1 Regular Expressions 9.2 Restriction Maps and Restriction Enzymes 9.3 Perl Operations 9.4 Exercises 10. GenBank 10.1 GenBank Files 10.2 GenBank Libraries 10.3 Separating Sequence and Annotation 10.4 Parsing Annotations 10.5 Indexing GenBank with DBM 10.6 Exercises 11. Protein Data Bank 11.1 Overview of PDB 11.2 Files and Folders 11.3 PDB Files 11.4 Parsing PDB Files 11.5 Controlling Other Programs 11.6 Exercises 12. BLAST 12.1 Obtaining BLAST 12.2 String Matching and Homology IT-SC 2 12.3 BLAST Output Files 12.4 Parsing BLAST Output 12.5 Presenting Data 12.6 Bioperl 12.7 Exercises 13. Further Topics 13.1 The Art of Program Design 13.2 Web Programming 13.3 Algorithms and Sequence Alignment 13.4 Object-Oriented Programming 13.5 Perl Modules 13.6 Complex Data Structures 13.7 Relational Databases 13.8 Microarrays and XML 13.9 Graphics Programming 13.10 Modeling Networks 13.11 DNA Computers A. Resources A.1 Perl A.2 Computer Science A.3 Linux A.4 Bioinformatics A.5 Molecular Biology B. Perl Summary B.1 Command Interpretation B.2 Comments B.3 Scalar Values and Scalar Variables B.4 Assignment B.5 Statements and Blocks B.6 Arrays B.7 Hashes B.8 Operators B.9 Operator Precedence B.10 Basic Operators B.11 Conditionals and Logical Operators B.12 Binding Operators B.13 Loops B.14 Input/Output B.15 Regular Expressions B.16 Scalar and List Context B.17 Subroutines and Modules B.18 Built-in Functions IT-SC 3 Preface What Is Bioinformatics? About This Book Who This Book Is For Why Should I Learn to Program? Structure of This Book Conventions Used in This Book Comments and Questions Acknowledgments What Is Bioinformatics? Biological data is proliferating rapidly. Public databases such as GenBank and the Protein Data Bank have been growing exponentially for some time now. With the advent of the World Wide Web and fast Internet connections, the data contained in these databases and a great many special-purpose programs can be accessed quickly, easily, and cheaply from any location in the world. As a consequence, computer-based tools now play an increasingly critical role in the advancement of biological research. Bioinformatics, a rapidly evolving discipline, is the application of computational tools and techniques to the management and analysis of biological data. The term bioinformatics is relatively new, and as defined here, it encroaches on such terms as "computational biology" and others. The use of computers in biology research predates the term bioinformatics by many years. For example, the determination of 3D protein structure from X-ray crystallographic data has long relied on computer analysis. In this book I refer to the use of computers in biological research as bioinformatics. It's important to be aware, however, that others may make different distinctions between the terms. In particular, bioinformatics is often the term used when referring to the data and the techniques used in large-scale sequencing and analysis of entire genomes, such as C. elegans, Arabidopsis, and Homo sapiens. What Bioinformatics Can Do Here's a short example of bioinformatics in action. Let's say you have discovered a very interesting segment of mouse DNA and you suspect it may hold a clue to the IT-SC 4 development of fatal brain tumors in humans. After sequencing the DNA, you perform a search of Genbank and other data sources using web-based sequence alignment tools such as BLAST. Although you find a few related sequences, you don't get a direct match or any information that indicates a link to the brain tumors you suspect exist. You know that the public genetic databases are growing daily and rapidly. You would like to perform your searches every day, comparing the results to the previous searches, to see if anything new appears in the databases. But this could take an hour or two each day! Luckily, you know Perl. With a day's work, you write a program (using the Bioperl module among other things) that automatically conducts a daily BLAST search of Genbank for your DNA sequence, compares the results with the previous day's results, and sends you email if there has been any change. This program is so useful that you start running it for other sequences as well, and your colleagues also start using it. Within a few months, your day's worth of work has saved many weeks of work for your community. This example is taken from real life. There are now existing programs you can use for this purpose, even web sites where you can submit your DNA sequence and your email address, and they'll do all the work for you! This is only a small example of what happens when you apply the power of computation to a biological problem. This is bioinformatics. About This Book This book is a tutorial for biologists on how to program, and is designed for beginning programmers. The examples and exercises with only a few exceptions use biological data. The book's goal is twofold: it teaches programming skills and applies them to interesting biological areas. I want to get you up and programming as quickly and painlessly as possible. I aim for simplicity of explanation, not completeness of coverage. I don't always strictly define the programming concepts, because formal definitions can be distracting. The Perl language makes it possible to start writing real programs quickly. As you continue reading this book and the online Perl documentation, you'll fill in the details, learn better ways of doing things, and improve your understanding of programming concepts. Depending on your style of learning, you can approach this material in different ways. One way, as the King gravely said to Alice, is to "Begin at the beginning and go on till you come to the end: then stop." (This line from Alice in Wonderland is often used as a whimsical definition of an algorithm.) The material is organized to be read in this fashion, as a narrative. Another approach is to get the programs into your computer, run them, see what they do, and perhaps try to alter this or that in the program to see what effect your changes have. This may be combined with a quick skim of the text of the chapter. This is a common approach used by programmers when learning a new language. Basically, you learn by imitation, looking at actual programs. IT-SC 5 Anyone wishing to learn Perl programming for bioinformatics should try the exercises found at the end of most chapters. They are given in approximate order of difficulty, and some of the higher-numbered exercises are fairly challenging and may be appropriate for classroom projects. Because there's more than one way to do things in Perl, there is no one correct answer to an exercise.