Open Source Tools and Toolkits for Bioinformatics: Significance, and Where Are We? Jasone.Stajichandhilmarlapp

BRIEFINGS IN BIOINFORMATICS. VOL 7. NO 3. 287^296 doi:10.1093/bib/bbl026 Open source tools and toolkits for bioinformatics: significance, and where are we? JasonE.StajichandHilmarLapp Submitted: 11th October 2005; Received (in revised form): 28th June 2006 Abstract This review summarizes important work in open-source bioinformatics software that has occurred over the past couple of years. The survey is intended to illustrate how programs and toolkits whose source code has been developed or released under an Open Source license have changed informatics-heavy areas of life science research. Rather than creating a comprehensive list of all tools developed over the last 2^3 years, we use a few selected projects encompassing toolkit libraries, analysis tools, data analysis environments and interoperability standards to show how freely available and modifiable open-source software can serve as the foundation for building important applications, analysis workflows and resources. Keywords: bioinformatics; open source; software; genomics INTRODUCTION This picture is one of a multitude of projects, Compared to several years ago, there are numerous which at the surface seem to have commonalities, open source projects active in the life science arena, but upon closer inspection have many differences each offering freely available source code that and lack an overall organization or direction that promises to address a specific problem or problem would tie them together by more than just the domain of biological sciences in a reusable way. shared principle of open source. This scenario of Bioinformatics.org (http://www.bioinformatics.org) each project offering ‘stuff’ the utility or application alone is host to 275 projects, which address a of which is often not immediately clear, is much bioinformatics need by definition. In addition, reminiscent of a market bazaar where goods are sold, Sourceforge (http://www.sourceforge.net) hosts except that the ‘stuff’ on display here is software around 750 projects categorized as ‘Bioinformatics’ source code. Indeed, it was the metaphor of a bazaar including projects such as Generic Model Organism that Eric S. Raymond coined in an article about Database (GMOD), Microarray Gene Expression Data hallmarks of successful open-source software devel- society (MGED) and Life Sciences Identifiers (LSID). opment, ‘The Cathedral and the Bazaar’ [1]. The There are many projects hosted by the developers’ bazaar metaphor is contrasted with that of a cathedral home institutions or by other open source-devoted representing a software development model with umbrella organizations such as the Open Bioinformatics high aspirations for scope, strict development rules Foundation (http://www.open-bio.org). The latter in and designations for each individual piece of fact hosts some of the toolkits most widely used in the a project so that the sum of the parts forms an life sciences, such as BioPerl and Biojava. awe-inspiring coherent whole. Corresponding author. Jason E. Stajich, Department of Plant and Microbial Biology, University of California, Berkeley, 321 Koshland Hall Berkeley, CA 94720, USA. Tel: (919)684-2720; Fax: (919)681-1035; E-mail: [email protected] Jason Stajich received his PhD from Duke University in 2006 and is currently a Miller research fellow at University of California, Berkeley. He serves as the President of the Open Bioinformatics Foundation, and has authored numerous modules in BioPerl and GMOD. Hilmar Lapp joined NESCent as Assistant Director of Informatics in May 2006. He has served on the Board of Directors of the Open Bioinformatics Foundation since 2001, and leads development of several open source projects and modules. National Evolutionary Synthesis Center, Durham, NC 27708, USA. E-mail: [email protected] ß The Author 2006. Published by Oxford University Press. For Permissions, please email: [email protected] Downloaded from https://academic.oup.com/bib/article-abstract/7/3/287/327529 by guest on 24 August 2018 288 Stajich and Lapp Inevitably, the question for the end user, devel- to toolkit libraries and to standards for interopera- oper, or informatics manager in the life sciences bility and data dissemination. We select a few searching to find a solution for a particular informatics projects, primarily toolkit libraries that have had a problem is which role to give to open-source software significant impact on genomics science, to illustrate in the overall longer term strategy of software devel- what open source projects have been able to achieve opment. How can one decide between embracing over the past few years, and what they could help an open source project, purchasing a proprietary to accomplish in the future. We also highlight the solution, or finding a blend of the two for a particular establishment of several standards for data exchange, problem? This may involve weighing the potential of and the utilization of these standards in research. an open source product and its likelihood of successfully achieving its goals. Does the increasing amount of ‘stuff’ or ‘software’ being offered for free OPEN-SOURCE SOFTWARE mean that open source is the winning model in LIBRARIES bioinformatics? Does the wide variety of problems Probably the most ubiquitous task in bioinformatics addressed in open source bioinformatics projects is to parse the results of an analysis program. BLAST mean that there will be a freely available solution to is certainly the most commonly used program for almost any problem already? sequence analysis. Hence, it is no surprise that We feel that there is no universal answer for every writing a BLAST report parser has become the research group to these and related questions. The proverbial ‘Hello World’ programming exercise for term open source has achieved buzzword quality over newcomers to the field of bioinformatics. The the past few years, but the principles of freely learning process achieved through crafting one’s available software have been tenants of collaborative own parser is not to be underappreciated, especially science for quite a long time including the basic as typically one of the lessons is that it is much less premise of sharing strains, plasmids, and the trivial than it initially appears. For example, many ‘Bermuda accords’ that outline immediate release different formats (or flavors) exist for BLAST reports of data from publicly funded genome sequencing from NCBI, Washington University, and commer- projects [2]. Open source pledges more though than cial implementations of the algorithm. In addition, it simply inexpensive software. An Open Source- can become more complicated if one also wants to compliant license (http://www.opensource.org) build reusable and extensible software so that parsing states that not only is an application available for reports from different pairwise alignment programs use, but anyone can obtain, inspect, modify and still present the developer with the same program- build upon the source code in new ways and ming interface and data objects. This enumeration of redistribute the changed source. Freely available features describes the more salient capabilities of the software is not meant to be ‘free’ in the sense of BioPerl [3**] parsing system called Bio::SearchIO having no cost on the side of the end user or for BLAST and other pairwise search programs. The developer who adopts it, rather, open-source soft- example also generally describes the target that any ware projects offer intellectual property in a way that toolkit of reusable software components (‘modules’) anyone can learn from, improve upon, participate in will aim for when processing data that may come in and extend. In contrast, closed-source software is a different file formats such as multiple sequence black box that can only be used in a manner dictated alignments or sequence data. The development of by the software publisher. Organizations who adopt generic and reusable software libraries for data an open source project with the principles of processing is one of the most common tasks in collaboration and contribute to the extension of the bioinformatics and helps scientists and programmers software will likely reap more benefits than those who alike to focus their efforts on applying the software see it as something with an unbeatable price tag. and addressing one of the many unsolved problems in In this review we aim to arm the reader not with bioinformatics, rather than on the nuances of parsing a comprehensive laundry list of relevant open source different common data formats and their variants. projects, but rather with a sense of how far open- Software libraries have been developed for the source software has come in addressing needs in purpose of providing fundamental building blocks the bioinformatics arena that range from analysis for algorithms and applications alike. A vocabulary algorithms and providing interfaces to end users, for sequence analysis includes sequence data, features, Downloaded from https://academic.oup.com/bib/article-abstract/7/3/287/327529 by guest on 24 August 2018 Open source tools and toolkits for bioinformatics 289 annotation and sequence alignments. Therefore any EMBL format sequences files, processing a BLAST general-purpose software library for bioinformatics report and generating GFF (General Feature Format) will have parsers and data objects meant to read and files (http://www.sanger.ac.uk/Software/formats/ represent these data. This infrastructure performs GFF/and http://song.sourceforge.net/gff3.shtml). tasks such as parsing sequence data files or BLAST The Bio::Graphics package, which is part of reports, and forms the building blocks for constructing Bioperl and contains the drawing code for the workflows (protocols) for bioinformatics analyses. Generic Genome Browser [12**], can then be used Because there is no true language for bioinformatics to create sophisticated graphical representation of the toolkits in several different languages have been features on top of the reference sequence. The core written to provide a developer with choices so he or objects in BioPerl, particularly the sequence and she can choose the best language for the job at hand.

Open Source Tools and Toolkits for Bioinformatics: Significance, and Where Are We? Jasone.Stajichandhilmarlapp

Introduction to Bioinformatics Software and Computing Infrastructures

Biolib: Sharing High Performance Code Between Bioperl, Biopython, Bioruby, R/Bioconductor and Biojava by Pjotr Prins

Gestalt of Bioinformatics

Visual Programming for Next-Generation Sequencing Data Analytics Franco Milicchio1, Rebecca Rose2, Jiang Bian3, Jae Min4 and Mattia Prosperi4*

Bioruby 2010 Updates: Moving to Agile Bioinformatics

2 Project Overview and New Features Bioruby Project Was Started in Late

BOSC 2015 Dublin, Ireland July 10-11, 2015

Python for Gene Expression[Version 1; Peer Review

Bioruby: Object Oriented Open Source Library for Bioinformatics

UBIC2 – Towards Ubiquitous Bio-Information Computing: Data Protocols, Middleware, and Web Services for Heterogeneous Biological Information Integration and Retrieval

Bioruby Project Update

Bioruby Bioinformatics Software for the Ruby Programming Language