Community-Driven Computational Biology with Debian Linux

Möller et al. BMC Bioinformatics 2010, 11(Suppl 12):S5 http://www.biomedcentral.com/1471-2105/11/S12/S5 PROCEEDINGS Open Access Community-driven computational biology with Debian Linux Steffen Möller1,2*, Hajo Nils Krabbenhöft2,3, Andreas Tille2, David Paleino2,4, Alan Williams5, Katy Wolstencroft5, Carole Goble5, Richard Holland6, Dominique Belhachemi2,7, Charles Plessy2,8 From The 11th Annual Bioinformatics Open Source Conference (BOSC) 2010 Boston, MA, USA. 9-10 July 2010 Abstract Background: The Open Source movement and its technologies are popular in the bioinformatics community because they provide freely available tools and resources for research. In order to feed the steady demand for updates on software and associated data, a service infrastructure is required for sharing and providing these tools to heterogeneous computing environments. Results: The Debian Med initiative provides ready and coherent software packages for medical informatics and bioinformatics. These packages can be used together in Taverna workflows via the UseCase plugin to manage execution on local or remote machines. If such packages are available in cloud computing environments, the underlying hardware and the analysis pipelines can be shared along with the software. Conclusions: Debian Med closes the gap between developers and users. It provides a simple method for offering new releases of software and data resources, thus provisioning a local infrastructure for computational biology. For geographically distributed teams it can ensure they are working on the same versions of tools, in the same conditions. This contributes to the world-wide networking of researchers. Background the huge potential for combining these tools for analysis The field of bioinformatics has gained momentum over is traded off for the technical complexity. Many of these the past two decades. The wealth and heterogeneity of tools run on a command line, where there are differ- biological data available in the public domain provides ences in formats and the semantics of files to be rich resources for bioinformaticians, biologists, chemists exchanged between them. The installation and mainte- and clinicians. The latest Nucleic Acids Research data- nance of these tools can introduce large overheads to bases special issue [1] recorded over 1000 biological data analysis. Alternative web-based interfaces are pro- data resources. This reflects the technological advance- vided for many tools. This reduces the installation over- ments in the field and the complexity of the field that head for the user and the propensity for version has developed many sub-disciplines. heterogeneity, but it also often reduces the function of Users of biological data resources, especially in clinical the tools. Users can submit large batch jobs, but they environments and whenever biological processes are are limited to numbers or time. modelled, need an integration of those specialised Workflows provide the possibility of programmatic resources. If we consider that many data resources have access to distributed tools and resources. The Taverna a collection of analysis tools associated with them, then workflow workbench [2] allows users to chain together pre-made building blocks of web services and other services to build complex analysis pipelines. The myExperi- * Correspondence: [email protected] 1University Clinics of Schleswig-Holstein, Department of Dermatology, ment [3] site allows for the sharing of these workflows, formerly University of Lübeck, Institute for Neuro- andBioinformatics, helping people share informatics methods in the com- Ratzeburger Allee 160, 23530 Lübeck. Germany munity. This way, researchers are helped to avoid Full list of author information is available at the end of the article © 2010 Möller et al; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Möller et al. BMC Bioinformatics 2010, 11(Suppl 12):S5 Page 2 of 6 http://www.biomedcentral.com/1471-2105/11/S12/S5 reinventing approaches and prefer reusing established However, the prospect of integration with Taverna and templates if readily available. This works nicely for pub- therefore being able to access web services and workflows lic data and web services. directly seemed particularly attractive. A Use-Case plugin A challenge remains to integrate public services with was developed [5] to allow regular applications to be inter- the freedom to execute command line applications. The preted as services by Taverna and can be accessed in a incorporation of command-line applications in such similar way to web services. workflows is a more recent development, not yet widely The dependence of the biological research community adopted. We will explain why we think that this is not on computational and data services will increase over surprising and how the availability of Linux distributions the upcoming years. The strong computational demands contributes to a wider acceptance of that principle. of the services and the increase in complexity of the in Applications may be executed either locally or remo- silico research fosters the collaboration of individuals tely, possibly as a service, in a queuing system or a com- from many sites. putational grid. The community has become excellent at sharing data, and similarly excellent in sharing the code Implementation (data) of applications, but there was yet no environment Debian-Med allowing for the integration of it all – either locally or The Debian project is an open society of enthusiasts remotely. The researcher demands the highest respon- from around the globe who collaborate on maintaining siveness from his applications and needs to protect pre- an operating system based on the Linux and FreeBSD cious data, especially in pharmaceutical or clinical kernels. Programs are distributed as binary packages environments. Both are in strong favour of local installa- ready for use, built on Debian’s network of autobuilding tions for applications and public data resources. [6] machines, from source code that is further annotated The traditional installation of Open Source software and uploaded as packages by individuals. Debian sup- requires its compilation from source code. The skills ports today’s most prominent platforms, thus rendering and interest needed for packaging software are different them available from mobiles to supercomputers and for from software development. More complex software will all common processors. Packages invite feedback from require the prior installation of other software, a recur- users with the Bug Tracking System [7]. sive process. This is laborious and often technically chal- Around 90,000 users have allowed the counting of lenging because of the differences between platforms their installations via Debian’s Popularity-Contest initia- and version incompatibilities. A workflow demanding tive [8], started in 2004. Separately counted are installa- the availability of a particular application on a local tions of packages that were forwarded to derived machine will thus be considered non-executable, unless distributions [9]. The most prominent of these is the given application has already been installed. Ubuntu [10], for which more than 1.3 million users are Once installation is successful, scientists are reluctant reporting. Packages are described verbosely and are to update a working environment, so computer net- translated to many languages [11]. More formally they works quickly become heterogeneous and not maintain- may be selected by manual assignment of terms from a able with multiple users. When different members of a controlled vocabulary [12]. community work with different versions of the same Technical constraints for the packaging are laid out in software, it becomes more difficult to collaborate. the Debian Policy document [13]. Changes to it are dis- On Linux, several distributions are preparing packages cussed on the project’s mailing lists and may be subject with readily usable software that are up- and down- to voting by contributors to the distribution. The gradable, and allow libraries of several versions to be Ubuntu Linux distribution adopts the Debian packages installed. Debian [4], founded in 1993, is one of the old- for their own software “universe” and several packages est distributions. It was for many years the only distribu- are co-maintained by developers of both distributions. tion that was managed as a democratic society, to which Everybody can volunteer to maintain a package in users could upload packages and vote for their destiny. Debian. There is no general exclusion of any software, The community of computational biologists should as long as it is redistributable. For auto-building on team up with the community behind the Linux distribu- many platforms, the source code and the libraries that it tions and extend the IT infrastructure respectively. This needs must be available. To allow for improvements, frees considerable resources for research labs and may onemustbeallowedtoeditthecode.Allthisismore be crucial for many smaller groups. formally specified in the Debian Free Software Guide- The remaining challenge was to connect those tools lines (DFSG). with the data and web services. One can certainly use var- For complex suites, packagers have an option to share ious forms

Community-Driven Computational Biology with Debian Linux

Istls Information Services to Life Science Internet Bioinformatics Resources Josef Maier [E-Mail: [email protected]] Last Checked August, 17Th, 2011

Debian Med Integrated Software Environment for All Medical Applications

Impacting the Bioscience Progress by Backporting Software for Bio-Linux

Operating Systems: from Every Palm to the Entire Cosmos in the 21St Century Lifestyle 5

COLLABORATIVE COMPUTATIONAL TECHNOLOGIES for BIOMEDICAL RESEARCH Wiley Series on Technologies for the Pharmaceutical Industry Sean Ekins , Series Editor

Bioinformatics I - Basic Tools and Resources 28.07.2009

Development and Evaluation of Cloud Computing Infrastructures for Next-Generation Sequence Data Analysis

19Th Coordinators Meet

Introduction to Linux

Information Theory, Graph Theory and Bayesian Statistics Based Improved and Robust Methods in Genome Assembly

From Famine to Feast

Free Software in Biology Using Debian-Med: a Resource for Information Agents and Computational Grids ∗