COMMENTARY

Open software for biologists: from famine to feast

Dawn Field, Bela Tiwari, Tim Booth, Stewart Houten, Dan Swan, Nicolas Bertrand & Milo Thurston

Developing and deploying specialized computing systems for specific research communities is achievable, cost effective and has wide-ranging benefits.

very research scientist who depends daily which is facing an exponentially increasing del- systems, software and their hardware indepen- Eon computers to store, manipulate and uge of data, these attributes are not only desir- dence that is now transforming the accessibility http://www.nature.com/naturebiotechnology analyze data wants to arrive at work to a able but increasingly essential. In particular, and affordability of such systems. smoothly working computer system. Anything the advent of ‘omic technologies (genomics, less than an up-to-date, complete and bug- transcriptomics, proteomics, metabolomics) From famine to feast free system can steal precious time away from is presenting biologists and bioinformaticians FOSS software lends itself well to distribu- research. Equally, the top priority of dedicated with the challenge of devising solutions for bet- tion and modification and is supported by computing support services is to provide such ter and faster synthesis of raw data into scien- an active development community. It is also systems. tific knowledge. an economical and powerful way of accessing The qualities of an ideal computing plat- Building and delivering tailored computing some of the best computing solutions avail- form are, of course, in the eyes of the beholder. solutions can require significant expertise, is able1. A driving force of the FOSS revolution Important attributes include speed, stability, often dependent on dedicated staff and hard- is . Technically speaking, the term Linux security, the potential to integrate effectively ware resources and sometimes involves the refers only to one core component of the oper- into existing networked environments and construction of large centralized facilities. ating system, but has become a catchall phrase the inclusion of a wide range of standard and Unfortunately, this has meant that many bio- for complete FOSS systems built around it. Nature Publishing Group Group 200 6 Nature Publishing

© cutting edge software. We believe ideal solu- logical scientists are starved of access to ade- The popularity of Linux is steadily increasing tions would be, whenever possible, freely quate, let alone ideal, computing environments because it delivers fast, cheap, powerful and available and give users access to the wealth for their own research. flexible computing power. of resources traditionally available only at the We argue here that the development and The demands for shared, expert comput- largest research centers. uptake of expertly configured computer ing infrastructure in many fields, from physics In short, an ideal computer system is one workstations loaded with free and open source (https://www.scientificlinux.org/) to electronic that exactly matches user requirements, is read- software (FOSS)1 can help overcome these music (http://linux-sound.org/distro.html) ily available and is economical in terms of time challenges. Such systems provide ready-made and beyond (http://lwn.net/Distributions/), and money. For the life sciences community, solutions for those who are uncertain about are being met by optimizing Linux systems to how to obtain suitable software and the skills meet specific needs. In tandem, the number to fully exploit it. This is especially relevant in of projects delivering ‘out-of-the-box’ suites Dawn Field, Bela Tiwari, Tim Booth and environments where information technology of software for life science research is also Stewart Houten are at the NERC Environmental groups are geared toward supporting generic, rapidly growing (Table 1)2. The frequent use Bioinformatics Centre, and Dawn Field and rather than specialized, scientific systems. of the expressions ‘Bio,’ ‘Linux’ and related Milo Thurston are at the Molecular Evolution Such systems also herald new ways of facilitat- words within the names of these bioinformat- and Bioinformatics Section, Oxford Centre ing research and data management activities ics-centric projects has led to the emergence for Ecology and Hydrology, Mansfield Road, taking place across geographically distributed of ‘BioLinux’ as an umbrella term for projects Oxford, OX1 3SR, UK; Dan Swan is at the sites, an increasingly common phenomenon in making access to bioinformatics software on a Bioinformatics Support Unit, Institute for the life sciences. Linux platform easier through (i) the provision Cell and Molecular Biosciences, University of The concept of ‘turnkey’ computing systems of complete systems, (ii) the provision of soft- Newcastle upon Tyne, Newcastle, NE2 4HH, is not new. Historically, a range of manufactur- ware packages, (iii) community building and UK; and Nicolas Bertrand is at CEH Computing ers have made their computers more appealing support systems. To avoid confusion, we use Services, Oxford Centre for Ecology and by providing a combination of hardware as well the generic term BioLinux in italics throughout Hydrology, Mansfield Road, Oxford, OX1 3SR. as an operating system, software and documen- this article to distinguish it from the names of e-mail: [email protected] tation. It is the availability of FOSS operating specific projects.

NATURE BIOTECHNOLOGY VOLUME 24 NUMBER 7 JULY 2006 801 COMMENTARY

BioLinux for environmental genomics stream Linux distributions have some sort of multiLocus sequence typing (MLST; http:// In 2002, the UK’s Natural Environment package-management software. Most of these www.pubmlst.org/). In doing so, he hopes to Research Council (NERC), in support of its are based on the two most common systems: simplify and standardize the informatics and Environmental Genomics Program, established ’s (Raleigh, NC, USA) RPM and the data management associated with building the NERC Environmental Bioinformatics community’s APT. global databases containing genotypes of a Centre (NEBC)3. As part of its remit, NEBC Building a package for a piece of software range of important pathogens. In particular, developed and deployed NEBC Bio-Linux. Our saves users unwanted installation steps that can building international mirrors of MLST data- user community includes bioinformaticians, often be numerous and sometimes frustrating. bases has been facilitated by the use of a com- software developers, system administrators Using packages, users can get tens to hundreds mon platform as the multiple software packages and most importantly, biological researchers, of pieces of software running in little more than required to run the sites can be assumed to be many of whom were new to both Linux and the time it takes to download them and keep available and installed in a standard location. bioinformatics. them up to date as new versions appear. It is the To build NEBC Bio-Linux, we followed ability to use package repositories to track and Building infrastructure the generic method outlined in Table 2. We redistribute the latest software and data analysis As the demand for bioinformatics training installed open source bioinformatics software methods that holds the greatest allure. grows, access to suitable facilities is becom- on a customized Debian/GNU Linux base sys- ing increasingly important. Likewise, as the tem (http://www.debian.org). To maximize our Unifying a community need for high-throughput processing of data ability to troubleshoot the installation process The installation of a common computing grows, so does the need for access to the type for users, provide improved downstream sup- platform across laboratories facilitates data of computing power that can be provided by port, and ease uptake among users with no sharing, promotes the spread of best practice computing clusters. The ‘cloning’ method of previous Linux experience, we chose to dis- with regard to analysis and data management, deploying machine images is particularly well tribute NEBC Bio-Linux as a software ‘clone.’ and facilitates the provision of centralized, suited for rapidly populating suites of comput- Unlike a true , NEBC Bio- economic support. From a provider’s point of ers and therefore provides a suitable option for

http://www.nature.com/naturebiotechnology Linux is an image, or snapshot, of a machine view, a diverse and active user base is an invalu- larger scale projects. we maintain at NEBC. This image is created able resource. For example, when researchers For example, NEBC Bio-Linux runs in teach- with SystemImager (http://www.systemimager. using NEBC Bio-Linux alert us to new pro- ing laboratories at the University of Liverpool org/) software usually used to install comput- grams they have seen or have developed them- (http://www.genomics.liv.ac.uk/NIBHI_ ing clusters. selves, we can package them and place them in Cluster_page.html), the University of Cardiff our repository, where they become available to (http://watson-bios.grid.cf.ac.uk/Resources/ A bounty of software any NEBC Bio-Linux user. Likewise, a grab bag biolinux.html) and the Oxford Centre for Many BioLinux projects involve the creation of of community-specific programming code can Ecology and Hydrology (http://darwin.nerc- rich software repositories instead of complete easily be generated and maintained. oxford.ac.uk). Students in such laboratories systems (Table 1). The Linux community has These projects also facilitate knowledge have access to a comprehensive bioinformatics simplified the process of distributing software transfer between communities. For example, computing environment, where clean systems and updates by engineering ‘packages.’ The Keith Jolley of the University of Oxford redis- can be easily installed if they are substantially software package manager technique has been tributes NEBC Bio-Linux to research and modified during a course. In addition, NEBC Nature Publishing Group Group 200 6 Nature Publishing 4

© refined over a period of years , and most main- reference laboratories engaged in performing Bio-Linux comes preinstalled with the Condor software5, allowing groups of machines, such as those in teaching laboratories, to be used Table 1 Examples of BioLinux projects. Only freely available projects are listed as loose computing clusters when they might Project URL otherwise sit idle. Complete systems Our longer-term goal is to use NEBC Bio- BioBrew http://bioinformatics.org/biobrew/ Linux not only to bring Linux, bioinformatics BioLand http://bioland.cbi.pku.edu.cn/ software and computing power to our commu- BioLinuxBR http://biolinux.df.ibilce.unesp.br/ nity, but also to advocate the e-Science vision of virtual communities that share trusted Debian-Med http://www.debian.org/devel/debian-med/ computing resources across geographical and NEBC Bio-Linux http://nebc.nox.ac.uk/biolinux.html institutional boundaries using GRID technolo- Package repositories only gies (parallel processing architectures in which BioLinux (Red Hat/Fedora) http://www.biolinux.org/ computers/CPU resources are shared across a BIOrpms (Red Hat/Fedora) http://apt.bea.ki.se/packages.html network)6. In addition to Condor, we are now NEBC Bio-Linux (Debian) http://envgen.nox.ac.uk/repository.html starting to include GRID middleware applica- UMDNJ Informatics Institute (Red Hat/Fedora) http://serine.umdnj.edu/~golharam/biorpms/ tions like Taverna7 in our software repository. Live CD-ROMs and DVDs BioKnoppix http://bioknoppix.hpcf.upr.edu/ Improving user skills DNALinux http://www.dnalinux.com/ The projects described in Table 1 all provide NEBC Bio-Linux Live http://nebc.nox.ac.uk/biolinux.html raw materials for use in bioinformatics studies. Quantian http://dirk.eddelbuettel.com/quantian.html In all cases, users are responsible for their own Vigyaan http://www.vigyaancd.org/ systems, they learn as they go and can choose to Vlinux http://bioinformatics.org/vlinux/index.php become as expert in Linux and bioinformatics as suits their research and career-development

802 VOLUME 24 NUMBER 7 JULY 2006 NATURE BIOTECHNOLOGY COMMENTARY

Table 2 Basic method for building a customized computing platforma NERC environmental genomics community user Step requirements NEBC Bio-Linux solution Define user requirements Powerful, cost-effective bioinformatics platform to Launch project to develop and deploy a suitable, standard computing support environmental ‘omic investigations system containing a broad range of bioinformatics software Select technology Uniform computing platform that eases data sharing, Distribute a Linux base system using the SystemImager software electronic harvesting of data, troubleshooting which must be easy to install Add customizations Secure machine, integrated backup system Custom firewall, second harddrive supporting automatic backups Add software Wide range of bioinformatics software Select more than 60 core bioinformatics packages Document User-friendly, comprehensive documentation Web site, bioinformatics documentation project Test deployment Well-tested system Deploy to pilot user group Full deployment Simple installation on demand Application process to manage rapid and structured deployment Support and training Access to help and training for Linux and bioinformatics Courses, website, helpdesk, phone, meetings Maintain as a community-driven tool Rapid response to feedback Modify system according to feedback, for example, fix problems, add new software, further customizations aThis table outlines the steps involved in developing and supporting a customized distribution across a defined user community, with specific solutions implemented by the NEBC to provide a computing platform. Key factors affecting design choices are size of community, preexisting skill base and funding for developers and installation of hardware.

needs. This is an ideal situation for promoting unparalleled in terms of community empower- The combined features of a BioLinux the uptake and improvement of user skills. ment and democratizing access to computing approach make it possible for laboratories, The amount of specialized software devel- environments and expertise. Expert system including those that have never had access

http://www.nature.com/naturebiotechnology oped by the scientific FOSS community con- administrators with extensive bioinformatics to adequate computing resources, to install a tinues to grow at such a rate that it is now expertise are rare and expensive, but using this complete, well-supported computing environ- spilling over into mainstream Linux distri- approach, just one or two can effectively sup- ment optimized for bioinformatics research. In butions. Most notably, Gentoo (http://www. port the needs of a large, dispersed community. our case, the development and deployment of gentoo.org) contains over 300 scientific Projects aimed at biological researchers can NEBC Bio-Linux strengthened ties between packages, falling into categories including contribute to producing the next generation of community members, reduced duplication of astronomy, biology, chemistry, geosciences, highly computer-literate biologists, bioinfor- efforts and helped to disseminate knowledge. mathematics and visualization. Access to col- maticians and system administrators who are We believe that this approach, applied across lections of software, promoted by BioLinux tightly integrated into the research groups in the sciences in both academic and industrial projects, is an excellent step towards realizing which they work. They have the added advan- settings, is set to make a profound impact on a future vision of integrated communities that tage of feeding back ideas and resources into the way we build and use computing infra- share ‘ideal’ computing platforms, but we argue the FOSS community, thus furthering the qual- structure. that reaching this goal will also take significant ity and quantity of projects available. Nature Publishing Group Group 200 6 Nature Publishing ACKNOWLEDGMENTS

© support and training activities. This approach should be of interest to fund- NEBC is supported by the NERC EG and PG&P Our center was established with the express ing bodies looking to invest in effective data science programs. Special thanks to Pete Kille and remit of providing a wide range of support management, training or the development of Jason Snape for championing the Bio-Linux concept, activities. Support for NEBC Bio-Linux comes a workforce in a particular area. In the long the Steering Committees of both programs for in the form of a bioinformatics software doc- run, the adoption of a BioLinux approach chosing to supporting it, and all NEBC Bio-Linux users for feedback. umentation system, an electronic helpdesk, is a cost-effective and efficient investment phone support, courses and workshops, and in science infrastructure and grassroots 1. Wheeler, D.A. http://www.dwheeler.com/ ad hoc meetings. Comprehensive user support community development. In a field such as oss_fs_why.html (2005). 2. Tiwari, B. & Field, D. LinuxUser and Developer 46, delivered through this mechanism leads to high bioinformatics where technologies, applica- 110–115 (2005). returns and can be provided economically. tions, and the methods of data collection and 3. Field, D., Tiwari, B. & Snape, J., PLoS Biol 3, e297 analysis change so rapidly, what users want (2005). 4. Foster-Johnson, E. http://fedora.redhat.com/docs/ The future from a system can change drastically, even in drafts/rpm-guide-en/ (2005). It is still early days in the history of special- the period of a single grant. Projects taking 5. Douglas, T., Todd, T. & Miron, L. Concurrency and computation: practice and experience 17, 323–356 ized, open-source computing systems, such as advantage of the approach we have outlined (2005). BioLinux projects, but their potential is clear. can absorb almost any new solution at the 6. Foster, I. The Grid: Blueprint for a New Computing The approach we have outlined for developing community level efficiently and quickly, Infrastructure, edn. 2 (Morgan Kaufmann, San Francisco 2004). and delivering customized workstations (Table thereby ‘future-proofing’ investments of time, 7. Oinn, T. et al. Bioinformatics 20, 3045–3054 2), when accompanied by support activities, is effort and money. (2004).

NATURE BIOTECHNOLOGY VOLUME 24 NUMBER 7 JULY 2006 803