<<

BookID _ChapID _Proof# 1 - 29/08/2009

Bioinformatics BookID _ChapID _Proof# 1 - 29/08/2009

David Edwards ● Jason Stajich ● David Hansen Editors

Bioinformatics

Tools and Applications BookID _ChapID _Proof# 1 - 29/08/2009 BookID _ChapID _Proof# 1 - 29/08/2009

Editors David Edwards David Hansen Australian Centre for Plant Functional Genomics Australian E-Health Research Centre Institute for Molecular Biosciences CSIRO and School of Land Qld 4027, Brisbane, Crop and Food Sciences University of Brisbane, QLD 4072 Australia

Jason Stajich Department of Plant Pathology and Microbiology University of California Berkeley, CA USA

ISBN 978-0-387-92737-4 e-ISBN 978-0-387-92738-1 DOI 10.1007/978-0-387-92738-1 Springer New York Dordrecht Heidelberg London

Library of Congress Control Number: 2009927717

© Springer Science+Business Media, LLC 2009 All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights.

Printed on acid-free paper

Springer is part of Springer Science+Business Media (www.springer.com) BookID _ChapID _Proof# 1 - 29/08/2009

Preface

Biology has progressed tremendously in the last decade due in part to the increased automation in the generation of data from sequences to genotypes to phenotypes. Biology is now very much an information science, and bioinformatics provides the means to connect biological data to hypotheses. Within this volume, we have collated chapters describing various areas of applied bioinformatics, from the analysis of sequence, literature, and functional data to the function and evolution of organisms. The ability to process and interpret large volumes of data is essential with the application of new high throughput DNA sequencers providing an overload of sequence data. Initial chapters provide an introduction to the analysis of DNA and protein sequences, from motif detection to gene prediction and annotation, with specific chapters on DNA and protein databases as well as data visualization. Additional chapters focus on gene expression analysis from the perspective of traditional microarrays and more recent sequence-based approaches, followed by an introduction to the evolving field of phenomics, with specific chapters detailing advances in plant and microbial phenome analysis and a chapter dealing with the important issue of standards for functional genomics. Further chapters present the area of literature databases and associated mining tools which are becoming increasingly essential to interpret the vast volume of published biological information, while the final chapters present bioinformatics purely from a developer’s point of view, describing the various data and databases as well as common programming languages used for bioinformatics applications. These chapters provide an introduction and motivation to further avenues for implementation. Together, this volume aims to provide a resource for biology students wanting a greater understanding of the encroaching area of bioinformatics, as well as computer scientists who are interested learning more about the field of applied bioinformatics. Brisbane, QLD David Edwards Berkeley, CA Jason E. Stajich Brisbane, QLD David Hansen

v BookID _ChapID _Proof# 1 - 29/08/2009

Contents

1 DNA Sequence Databases...... 1 David Edwards, David Hansen, and Jason E. Stajich

2 Sequence Comparison Tools...... 13 Michael Imelfort

3 Genome Browsers...... 39 Sheldon McKay and Scott Cain

4 Predicting Non-coding RNA Transcripts...... 65 Laura A. Kavanaugh and Uwe Ohler

5 Gene Prediction Methods...... 99 William H. Majoros, Ian Korf, and Uwe Ohler

6 Gene Annotation Methods...... 121 Laurens Wilming and Jennifer Harrow

7 Regulatory Motif Analysis...... 137 Alan Moses and Saurabh Sinha

8 Molecular Marker Discovery and Genetic Map Visualisation...... 165 Chris Duran, David Edwards, and Jacqueline Batley

9 Sequence Based Gene Expression Analysis...... 191 Lakshmi K. Matukumalli and Steven G. Schroeder

10 Protein Sequence Databases...... 209 Terry Clark

vii BookID _ChapID _Proof# 1 - 29/08/2009 BookID _ChapID _Proof# 1 - 29/08/2009

viii Contents

11 Protein Structure Prediction...... 225 Sitao Wu and Yang Zhang

12 Classification of Information About Proteins...... 243 Amandeep S. Sidhu, Matthew I. Bellgard, and Tharam S. Dillon

13 High-Throughput Plant Phenotyping – Data Acquisition, Transformation, and Analysis...... 259 Matthias Eberius and José Lima-Guerra

14 Phenome Analysis of Microorganisms...... 279 Christopher M. Gowen and Stephen S. Fong

15 Standards for Functional Genomics...... 293 Stephen A. Chervitz, Helen Parkinson, Jennifer M. Fostel, Helen C. Causton, Susanna-Assunta Sanson, Eric W. Deutsch, Dawn Field, Chris F. Taylor, Philippe Rocca-Serra, Joe White, and Christian J. Stoeckert

16 Literature Databases...... 331 J. Lynn Fink

17 Advanced Literature-Mining Tools...... 347 Pierre Zweigenbaum and Dina Demner-Fushman

18 Data and Databases...... 381 Daniel Damian

19 Programming Languages...... 403 John Boyle

Index...... 441 BookID _ChapID _Proof# 1 - 29/08/2009

Contributors

Jacqueline Batley Australian Centre for Plant Functional Genomics, Centre of Excellence for Integrative Legume Research, School of Land, Crop and Food Sciences, , Brisbane, QLD 4072, Australia [email protected]

John Boyle The Institute for Systems Biology, 1441 North 34th Street, Seattle, WA 98105, USA [email protected]

Matthew Belgard Centre for Comparative Genomics, Murdoch University, , WA, Australia [email protected]

Scott Cain Ontario Institute for Cancer Research, 101 College Street, Suite 800, Toronto, ON, Canada M5G0A3 [email protected]

Helen C. Causton MRC Clinical Sciences Centre, Imperial College London, Hammersmith Hospital Campus, Du Cane Road, London W12 0NN, UK [email protected]

Stephen A. Chervitz Affymetrix Inc., Santa Clara, CA 95051, USA [email protected]

Terry Clark Australian Centre for Plant Functional Genomics, Institute for Molecular Biosciences and School of Land, Crop and Food Sciences, University of Queensland, Brisbane, QLD 4072, Australia [email protected]

Daniel Damian Biowisdom Ltd., CB 22 7GG, Cambridge, UK [email protected]

Dina Demner-Fushman Communications Engineering Branch, Lister Hill National Center for Biomedical Communications, US National Library of Medicine, Bethesda, MD, USA [email protected]

ix BookID _ChapID _Proof# 1 - 29/08/2009 BookID _ChapID _Proof# 1 - 29/08/2009

x Contributors

Eric W. Deutsch The Institute for Systems Biology, Seattle, WA 98105, USA [email protected]

Tharram Dillon Digital Ecosystems and Business Intelligence Institute, of Technology, Perth, WA, Australia [email protected]

Chris Duran Australian Centre for Plant Functional Genomics, School of Land, Crop and Food Sciences, University of Queensland, Brisbane, QLD 4072, Australia [email protected]

Matthias Eberius LemnaTec GmbH, Schumanstr. 1a, 52146 Wuerselen, Germany [email protected]

David Edwards Australian Centre for Plant Functional Genomics, Institute for Molecular Biosciences and School of land, Crop and Food Sciences, University of Queensland, Brisbane, QLD 4072, Australia [email protected]

Dawn Field Natural Environmental Research Council, Centre for Ecology and Hydrology, Oxford, OX1 3SR, UK [email protected]

J. Lynn Fink Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California, San Diego, CA, USA [email protected]

Stephen S. Fong Department of Chemical and Life Science Engineering, Virginia Commonwealth University, P.O. Box 843028, Richmond, VA 23284, USA [email protected]

Jennifer M. Fostel Division of Intramural Research, National Institute of Environmental Health Sciences, Research Triangle Park, NC 27709, USA [email protected]

Christopher M. Gowen Department of Chemical and Life Science Engineering, Virginia Commonwealth University, P.O. Box 843028, Richmond, VA 23284, USA [email protected]

David Hansen Australian E-Health Research Centre, CSIRO QLD 4027, Brisbane, Australia [email protected]

Jennifer Harrow Wellcome Trust Sanger Institute, Morgan Building, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1HH, UK [email protected] BookID _ChapID _Proof# 1 - 29/08/2009

Contributors xi

Michael Imelfort Australian Centre for Plant Functional Genomics, Institute for Molecular Biosciences and School of Land, Crop and Food Sciences, University of Queensland, Brisbane, QLD 4072, Australia [email protected]

Laura A. Kavanaugh Department of Molecular Genetics and Microbiology, Duke University, Durham, NC 27710, USA [email protected]

Ian Korf UC Davis Genome Center, University of California, Davis, 451 Health Sciences Drive, Davis, CA 95616, USA [email protected]

José Lima-Guerra Keygene N.V., Agrobusiness Park 90, 6708 PW Wageningen, The Netherlands [email protected]

William H. Majoros Institute for Genome Sciences & Policy, Duke University, Durham, NC 27708, USA [email protected]

Lakshmi K. Matukumalli Department of Bioinformatics and Computational Biology, George Mason University, Manassas, VA 20110, USA [email protected]

Sheldon McKay Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY 11724, USA [email protected]

Alan Moses Department of Cell & Systems Biology, University of Toronto, 25 Willcocks Street, Toronto, ON, Canada M5S 3B2 [email protected]

Uwe Ohler Department of Biostatistics & Bioinformatics, Institute for Genome Sciences & Policy, Duke University, Durham, NC 27708, USA [email protected]

Helen Parkinson European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK [email protected]

Philippe Rocca-Serra European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK [email protected]

Susanna-Assunta Sansone European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK [email protected] BookID _ChapID _Proof# 1 - 29/08/2009

xii Contributors

Steven G. Schroeder Bovine Functional Genomics Laboratory, US Department of Agriculture, Beltsville, MD 20705, USA [email protected]

Amandeep S. Sidhu Centre for Comparative Genomics, Murdoch University, Perth, WA, Australia [email protected]

Saurabh Sinha Department of Computer Science, University of Illinois, Urbana-Champaign, 201 N. Goodwin Ave, Urbana, IL 61801, USA [email protected]

Jason Stajich Department of Plant Pathology and Microbiology, University of California, Berkeley, CA 94720-3102, USA [email protected]

Christian J. Stoeckert Jr Department of Genetics, Penn Center for Bioinformatics, University of Pennsylvania School of Medicine, Philadelphia, PA 19104-6021, USA [email protected]

Chris F. Taylor European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK [email protected]

Joe White Dana-Farber Cancer Institute and Harvard School of Public Health, Harvard University, Boston, MA 02115, USA [email protected]

Laurens Wilming Wellcome Trust Sanger Institute, Morgan Building, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1HH, UK [email protected]

Sitao Wu Center for Bioinformatics and Department of Molecular Bioscience, University of Kansas, Lawrence, KS 66047, USA [email protected]

Yang Zhang Center for Bioinformatics and Department of Molecular Bioscience, University of Kansas, Lawrence, KS 66047, USA [email protected]

Pierre Zweigenbaum LIMSI-CNRS, BP 133, 91403 Orsay Cedex, France [email protected]