Java for Bioinformatics and Biomedical Applications Java for Bioinformatics and Biomedical Applications

JAVA FOR BIOINFORMATICS AND BIOMEDICAL APPLICATIONS JAVA FOR BIOINFORMATICS AND BIOMEDICAL APPLICATIONS by Harshawardhan Bal Booz Allen Hamilton, Inc., Rockville, MD and Johnny Hujol Vertex Pharmaceuticals, Inc., Cambridge, MA ^ Sprringei r Library of Congress Control Number: 2006930294 ISBN-10: 0-387-37235-0 e-ISBN-10: 0-387-37237-7 ISBN-13: 978-0-387-37237-8 Printed on acid-free paper. © 2007 Springer Science-HBusiness Media, LLC All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science-t-Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed in the United States of America. 987654321 springer.com Contents Foreword IX Introduction IX Background and history IX Interfaces and standards X Java as a platform X The future XI Preface XIII Chapter 1 1 Introduction to Bioinformatics and Java 1 The Origins of Bioinformatics 1 Current State of Biomedical Research 3 The cancer Biomedical Informatics Grid program 6 caBIG™ Organization and Architecture 7 The Model-View-Controller Framework 9 Web Services and Service-Oriented Architecture 10 CaGrid 11 Let's look at each of the tools in turn and understand how they sub serve or address a small component of the bigger research problem. 14 CaArray 14 CaWorkbench 16 RProteomics 17 cPath 17 CaTissue Core, caTissue Clinical Annotation Engine and caTIES...18 CaTissue Core 18 Summary 20 Questions and Exercises 21 Additional Resources 21 Selected Reading 23 Chapter II 25 VI Introduction to Basic Local Alignment Search Tool 25 The Purpose of BLAST 25 Performing a BLAST Analysis 27 Developing the SwingBlast Application 32 Designing the SwingBlast Java Application 35 Java Event Model 36 Adding Events to Applications 37 Designing the SwingBlast GUI 39 Coding the SwingBlast GUI 45 Coding the SwingBlast Business Logic 48 Determining Sequence Type: Nucleotide or Protein? 53 Displaying Valid BLAST Options 63 Summary 80 Questions and Exercises 81 Additional Resources 81 Selected Reading 81 Chapter III 83 Running BLAST using SwingBlast 83 Introduction 83 The NCBI QBLAST Package 83 Strategy for Creating a QBlast Based System 84 Designing the BLAST API 86 Description of Blast Classes 88 Implementing JQBlast 92 Enhancing the SwingBlast Application 103 Retrieving Sequences From GenBank Using BioJava 124 Retrieving GenBank Without BioJava 129 Input Validation 132 Controlling Program Events and Responses 137 Reporting BLAST Status 139 Displaying BLAST Results Interactively 143 Summary 151 Questions and Exercises 152 Additional Resources 152 Selected Reading 153 Chapter IV 155 Facilitating PubMed Searches: JavaServer Pages and Java Servletsl55 Introduction 155 VII HTTP and CGI 155 HTTP Protocol 156 GET and POST Methods 157 CGI For Generating Dynamic Content 157 Servlets and JavaServer Pages Technologies 158 Java API for Servlets and JSPs 159 JavaServer Pages Standard Tag Library (JSTL) 160 Apache Tomcat Server 160 The NCBI Pub Med Literature Search and Retrieval Service 161 Accessing Biomedical Literature Through Entrez 161 Create Web Application With Servlets and JSPs 165 Web Application Structure 167 Creating a Servlet to Access Biomedical Literature 169 Displaying PubMed Abstracts 178 Highlighting Search Terms in Retrieved Abstracts 193 Summary 204 Questions and Exercises 205 Additional Resources 206 Selected Reading 206 Chapter V 209 Creating a Gene Prediction and BLAST Analysis Pipeline 209 Introduction 209 Gene Prediction Programs 209 DNA Transcription and Translation 210 Gene Prediction with Genscan 212 Running Genscan Analyses 213 Analyzing GenScan Output 215 Creating SwingGenscan 221 Writing the Code for SwingGenScan 222 The SwingGenScan User Interface 235 Running SwingGenScan 243 Summary 246 Questions and Exercises 247 Additional Resources 247 Selected Reading 247 Chapter VI 249 cancer Biomedical Informatics Grid (caBIG ™) 249 cancer Biomedical Informatics Grid 249 VIII Structure and Organization of caBIG™ 250 Data Integration and ETL 253 cancer Common Ontologic Representation Environment (caCORE) .255 Cancer Bioinformatics Infrastructure Objects (caBIO) 257 Downloading and Configuring caBIO 259 Creating the JcaBIO Application 260 JcaBIO Classes and Application Structure 261 Coding the SwingCaBIO Application 263 Coding JcaBIO: The CaBIOReportEngine Class 275 Coding JcaBIO: The CaBIOSearchEngine Class 282 Running the JcaBIO Application 286 Summary 289 Questions and Exercises 290 Additional Resources 291 Selected Reading 292 Appendix 295 Apache Ant and Tomcat 295 Downloading the Apache Tomcat server 295 Managing the Apache Tomcat Server 302 Installing and Configuring the Apache Ant Build Tool 306 Configuring environmental variables for Ant 309 Building and Deploying The Web Application 310 Building the WAR file 310 Deploying the application on Tomcat using Ant 312 Version Control Systems 314 Additional Resources 315 Foreword April 2006 Introduction Bioinformatics is at a crossroads. We work in a field that is changing every day, increasingly moving from specific solutions created by single researchers working alone or in small groups to larger, often geographically dispersed programs enabled by collaborative computing and open software. This book represents an important development, giving the reader an opportunity to discover how the use of open and reusable Java code can solve large bioinformatics problems in a software engineered and robust way. I work with one of the authors of this book every day, on the National Cancer Institute's cancer Biomedical Informatics Grid (caBIG™) project, and I can attest that they are well suited to share with their readers both their experience in the development and use of bioinformatics software, as well as their interest in solid software engineering and interoperability. Background and history In its short history, bioinformatics has become an increasingly important part of how scientists involved in biological research go about their work. This has lead to an explosion of interest in the subject, and a similar explosion in tools and data resources for researchers to learn and use in their work. Historically, tools for bioinformatics have been idiosyncratic and are custom-developed by the end-users (or those close to them) in an iterative fashion until the specific immediate problem is solved. This has led to a balkanization of informatics systems, sometimes yielding multiple, incompatible systems at a single institution for a single application. This trend is beginning to change, with groups throughout the research community developing standards and shared data models, in areas ranging X from gene expression arrays to pathways and proteomics. With a range of emerging software capabilities and a growing interest in interoperable tools and standards, bioinformatics practitioners have an ever-expanding toolbox from which to draw on to develop the basic software infrastructure behind their work. Similarly, with the increasing interest within the biomedical informatics community in the use of well-defined software engineering methodologies, and disciplines like design patterns and model- driven architecture, the software developed there will increasingly last longer, be easier to maintain, foster interoperability and reuse, and ultimately be more robust and cost effective. Interfaces and standards Interfaces and standards, as well as the use of well established development platforms, especially object-oriented programming, allow the bioinformatics practitioner to solve problems faster, with fewer lines of reusable, well-documented code than before. Through access to and study of well-established principles of software engineering and computer science, the solutions to problems in biomedical informatics will also be solid and optimally designed. With the increasing size of the datasets used in biomolecular informatics, derived from all manner of new high- throughput technologies and online databases, it is increasingly important to use thoughtful, efficient and well-established algorithms in the analysis of that data. Informatics students who can decompose complex, biologically significant informatics problems into simpler models, for which there are corresponding, validated and pre-existing software objects, will be amply rewarded for their efforts. It is by building on well- supported software platforms, using established and tested methodologies, that the most favorable balance can be achieved between effort and benefit. Java as a platform This book will teach you ways to make use of the Java programming language as a platform for your work in biomedical informatics, and in doing so, will open you up to the possibility of using a wide range of software objects in use throughout the large software engineering and computer science communities. Java is, of course, not the only object- oriented platform that is appropriate for bioinformatics. Perl is very well XI established, and are python, C++ and many others. The lessons that you can learn in Java are

Java for Bioinformatics and Biomedical Applications Java for Bioinformatics and Biomedical Applications

Benchmarking of Bioperl, Perl, Biojava, Java, Biopython, and Python for Primitive Bioinformatics Tasks 6 and Choosing a Suitable Language

The Bioperl Toolkit: Perl Modules for the Life Sciences

Review of Java

Plat: a Web Based Protein Local Alignment Tool

The Bioperl Toolkit: Perl Modules for the Life Sciences

The Biojava Tutorial the Biojava Tutorial

An Open-Source Sandbox for Increasing the Accessibility Of

The Bioperl Toolkit: Perl Modules for the Life Sciences

Persistent Bioperl

Biopython Tutorial and Cookbook

International Journal of Applied Science and Engineering Review

Bio-Programming, Utilities