Bioinformatics Programming Using Python

Bioinformatics Programming Using Python Bioinformatics Programming Using Python Mitchell L Model Beijing • Cambridge • Farnham • Köln • Sebastopol • Taipei • Tokyo Bioinformatics Programming Using Python by Mitchell L Model Copyright © 2010 Mitchell L Model. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://my.safaribooksonline.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or [email protected]. Editor: Mike Loukides Indexer: Lucie Haskins Production Editor: Sarah Schneider Cover Designer: Karen Montgomery Copyeditor: Rachel Head Interior Designer: David Futato Proofreader: Sada Preisch Illustrator: Robert Romano Printing History: December 2009: First Edition. O’Reilly and the O’Reilly logo are registered trademarks of O’Reilly Media, Inc. Bioinformatics Pro- gramming Using Python, the image of a brown rat, and related trade dress are trademarks of O’Reilly Media, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O’Reilly Media, Inc. was aware of a trademark claim, the designations have been printed in caps or initial caps. While every precaution has been taken in the preparation of this book, the publisher and author assume no responsibility for errors or omissions, or for damages resulting from the use of the information con- tained herein. TM This book uses RepKover, a durable and flexible lay-flat binding. ISBN: 978-0-596-15450-9 [M] 1259959883 Table of Contents Preface . .................................................................... xi 1. Primitives . ............................................................ 1 Simple Values 1 Booleans 2 Integers 2 Floats 3 Strings 4 Expressions 5 Numeric Operators 5 Logical Operations 7 String Operations 9 Calls 12 Compound Expressions 16 Tips, Traps, and Tracebacks 18 Tips 18 Traps 20 Tracebacks 20 2. Names, Functions, and Modules . ......................................... 21 Assigning Names 23 Defining Functions 24 Function Parameters 27 Comments and Documentation 28 Assertions 30 Default Parameter Values 32 Using Modules 34 Importing 34 Python Files 38 Tips, Traps, and Tracebacks 40 Tips 40 v Traps 45 Tracebacks 46 3. Collections . ........................................................... 47 Sets 48 Sequences 51 Strings, Bytes, and Bytearrays 53 Ranges 60 Tuples 61 Lists 62 Mappings 66 Dictionaries 67 Streams 72 Files 73 Generators 78 Collection-Related Expression Features 79 Comprehensions 79 Functional Parameters 89 Tips, Traps, and Tracebacks 94 Tips 94 Traps 96 Tracebacks 97 4. Control Statements . ................................................... 99 Conditionals 101 Loops 104 Simple Loop Examples 105 Initialization of Loop Values 106 Looping Forever 107 Loops with Guard Conditions 109 Iterations 111 Iteration Statements 111 Kinds of Iterations 113 Exception Handlers 134 Python Errors 136 Exception Handling Statements 138 Raising Exceptions 141 Extended Examples 143 Extracting Information from an HTML File 143 The Grand Unified Bioinformatics File Parser 146 Parsing GenBank Files 148 Translating RNA Sequences 151 Constructing a Table from a Text File 155 vi | Table of Contents Tips, Traps, and Tracebacks 160 Tips 160 Traps 162 Tracebacks 163 5. Classes . ............................................................. 165 Defining Classes 166 Instance Attributes 168 Class Attributes 179 Class and Method Relationships 186 Decomposition 186 Inheritance 194 Tips, Traps, and Tracebacks 205 Tips 205 Traps 207 Tracebacks 208 6. Utilities . ............................................................ 209 System Environment 209 Dates and Times: datetime 209 System Information 212 Command-Line Utilities 217 Communications 223 The Filesystem 226 Operating System Interface: os 226 Manipulating Paths: os.path 229 Filename Expansion: fnmatch and glob 232 Shell Utilities: shutil 234 Comparing Files and Directories 235 Working with Text 238 Formatting Blocks of Text: textwrap 238 String Utilities: string 240 Comma- and Tab-Separated Formats: csv 241 String-Based Reading and Writing: io 242 Persistent Storage 243 Persistent Text: dbm 243 Persistent Objects: pickle 247 Keyed Persistent Object Storage: shelve 248 Debugging Tools 249 Tips, Traps, and Tracebacks 253 Tips 253 Traps 254 Tracebacks 255 Table of Contents | vii 7. Pattern Matching . .................................................... 257 Fundamental Syntax 258 Fixed-Length Matching 259 Variable-Length Matching 262 Greedy Versus Nongreedy Matching 263 Grouping and Disjunction 264 The Actions of the re Module 265 Functions 265 Flags 266 Methods 268 Results of re Functions and Methods 269 Match Object Fields 269 Match Object Methods 269 Putting It All Together: Examples 270 Some Quick Examples 270 Extracting Descriptions from Sequence Files 272 Extracting Entries From Sequence Files 274 Tips, Traps, and Tracebacks 283 Tips 283 Traps 284 Tracebacks 285 8. Structured Text . ...................................................... 287 HTML 287 Simple HTML Processing 289 Structured HTML Processing 297 XML 300 The Nature of XML 300 An XML File for a Complete Genome 302 The ElementTree Module 303 Event-Based Processing 310 expat 317 Tips, Traps, and Tracebacks 322 Tips 322 Traps 323 Tracebacks 323 9. Web Programming . ................................................... 325 Manipulating URLs: urllib.parse 325 Disassembling URLs 326 Assembling URLs 327 Opening Web Pages: webbrowser 328 Module Functions 328 viii | Table of Contents Constructing and Submitting Queries 329 Constructing and Viewing an HTML Page 330 Web Clients 331 Making the URLs in a Response Absolute 332 Constructing an HTML Page of Extracted Links 333 Downloading a Web Page’s Linked Files 334 Web Servers 337 Sockets and Servers 337 CGI 343 Simple Web Applications 348 Tips, Traps, and Tracebacks 354 Tips 355 Traps 357 Tracebacks 358 10. Relational Databases . ................................................. 359 Representation in Relational Databases 360 Database Tables 360 A Restriction Enzyme Database 365 Using Relational Data 370 SQL Basics 371 SQL Queries 380 Querying the Database from a Web Page 392 Tips, Traps, and Tracebacks 395 Tips 395 Traps 398 Tracebacks 398 11. Structured Graphics . .................................................. 399 Introduction to Graphics Programming 399 Concepts 400 GUI Toolkits 404 Structured Graphics with tkinter 406 tkinter Fundamentals 406 Examples 411 Structured Graphics with SVG 431 SVG File Contents 432 Examples 436 Tips, Traps, and Tracebacks 444 Tips 444 Traps 445 Tracebacks 447 Table of Contents | ix A. Python Language Summary . ........................................... 449 B. Collection Type Summary . ............................................. 459 Index . 473 x | Table of Contents Preface This preface provides information I expect will be important for someone reading and using this book. The first part introduces the book itself. The second talks about Python. The third part contains other notes of various kinds. Introduction I would like to begin with some comments about this book, the field of bioinformatics, and the kinds of people I think will find it useful. About This Book The purpose of this book is to show the reader how to use the Python programming language to facilitate and automate the wide variety of data manipulation tasks en- countered in life science research and development. It is designed to be accessible to readers with a range of interests and backgrounds, both scientific and technical. It emphasizes practical programming, using meaningful examples of useful code. In ad- dition to meeting the needs of individual readers, it can also be used as a textbook for a one-semester upper-level undergraduate or graduate-level course. The book differs from traditional introductory programming texts in a variety of ways. It does not attempt to detail every possible variation of the mechanisms it describes, emphasizing instead the most frequently used. It offers an introduction to Python programming that is more rapid and in some ways more superficial than what would be found in a text devoted solely to Python or introductory programming. At the same time, it includes some advanced features, techniques, and topics that are often omitted from entry-level Python books. These are included because of their wide applicability in bioinformatics programming, and they are used extensively in the book’s examples. Python’s installation includes a large selection of optional components called “modules.” Python books usually cover a small selection of the most generally useful modules, and perhaps some others in less detail. Having bioinformatics programming as this book’s target had some interesting effects on the choice of which modules to discuss, and at what depth. The modules (or parts of modules) that are xi covered in this book are the ones that are most likely to be particularly valuable in bioinformatics programming. In some cases the discussions are more substantial than would be found in a generic Python book, and many of the modules covered here appear in few other books. Chapter 6, in particular, describes a large number of narrowly focused “utility” modules. The remaining chapters focus on particular

Load more