From Famine to Feast
Total Page:16
File Type:pdf, Size:1020Kb
COMMENTARY Open software for biologists: from famine to feast Dawn Field, Bela Tiwari, Tim Booth, Stewart Houten, Dan Swan, Nicolas Bertrand & Milo Thurston Developing and deploying specialized computing systems for specific research communities is achievable, cost effective and has wide-ranging benefits. very research scientist who depends daily which is facing an exponentially increasing del- systems, software and their hardware indepen- Eon computers to store, manipulate and uge of data, these attributes are not only desir- dence that is now transforming the accessibility http://www.nature.com/naturebiotechnology analyze data wants to arrive at work to a able but increasingly essential. In particular, and affordability of such systems. smoothly working computer system. Anything the advent of ‘omic technologies (genomics, less than an up-to-date, complete and bug- transcriptomics, proteomics, metabolomics) From famine to feast free system can steal precious time away from is presenting biologists and bioinformaticians FOSS software lends itself well to distribu- research. Equally, the top priority of dedicated with the challenge of devising solutions for bet- tion and modification and is supported by computing support services is to provide such ter and faster synthesis of raw data into scien- an active development community. It is also systems. tific knowledge. an economical and powerful way of accessing The qualities of an ideal computing plat- Building and delivering tailored computing some of the best computing solutions avail- form are, of course, in the eyes of the beholder. solutions can require significant expertise, is able1. A driving force of the FOSS revolution Important attributes include speed, stability, often dependent on dedicated staff and hard- is Linux. Technically speaking, the term Linux security, the potential to integrate effectively ware resources and sometimes involves the refers only to one core component of the oper- Nature Publishing Group Group Nature Publishing 6 into existing networked environments and construction of large centralized facilities. ating system, but has become a catchall phrase the inclusion of a wide range of standard and Unfortunately, this has meant that many bio- for complete FOSS systems built around it. 200 © cutting edge software. We believe ideal solu- logical scientists are starved of access to ade- The popularity of Linux is steadily increasing tions would be, whenever possible, freely quate, let alone ideal, computing environments because it delivers fast, cheap, powerful and available and give users access to the wealth for their own research. flexible computing power. of resources traditionally available only at the We argue here that the development and The demands for shared, expert comput- largest research centers. uptake of expertly configured computer ing infrastructure in many fields, from physics In short, an ideal computer system is one workstations loaded with free and open source (https://www.scientificlinux.org/) to electronic that exactly matches user requirements, is read- software (FOSS)1 can help overcome these music (http://linux-sound.org/distro.html) ily available and is economical in terms of time challenges. Such systems provide ready-made and beyond (http://lwn.net/Distributions/), and money. For the life sciences community, solutions for those who are uncertain about are being met by optimizing Linux systems to how to obtain suitable software and the skills meet specific needs. In tandem, the number to fully exploit it. This is especially relevant in of projects delivering ‘out-of-the-box’ suites Dawn Field, Bela Tiwari, Tim Booth and environments where information technology of software for life science research is also Stewart Houten are at the NERC Environmental groups are geared toward supporting generic, rapidly growing (Table 1)2. The frequent use Bioinformatics Centre, and Dawn Field and rather than specialized, scientific systems. of the expressions ‘Bio,’ ‘Linux’ and related Milo Thurston are at the Molecular Evolution Such systems also herald new ways of facilitat- words within the names of these bioinformat- and Bioinformatics Section, Oxford Centre ing research and data management activities ics-centric projects has led to the emergence for Ecology and Hydrology, Mansfield Road, taking place across geographically distributed of ‘BioLinux’ as an umbrella term for projects Oxford, OX1 3SR, UK; Dan Swan is at the sites, an increasingly common phenomenon in making access to bioinformatics software on a Bioinformatics Support Unit, Institute for the life sciences. Linux platform easier through (i) the provision Cell and Molecular Biosciences, University of The concept of ‘turnkey’ computing systems of complete systems, (ii) the provision of soft- Newcastle upon Tyne, Newcastle, NE2 4HH, is not new. Historically, a range of manufactur- ware packages, (iii) community building and UK; and Nicolas Bertrand is at CEH Computing ers have made their computers more appealing support systems. To avoid confusion, we use Services, Oxford Centre for Ecology and by providing a combination of hardware as well the generic term BioLinux in italics throughout Hydrology, Mansfield Road, Oxford, OX1 3SR. as an operating system, software and documen- this article to distinguish it from the names of e-mail: [email protected] tation. It is the availability of FOSS operating specific projects. NATURE BIOTECHNOLOGY VOLUME 24 NUMBER 7 JULY 2006 801 COMMENTARY BioLinux for environmental genomics stream Linux distributions have some sort of multiLocus sequence typing (MLST; http:// In 2002, the UK’s Natural Environment package-management software. Most of these www.pubmlst.org/). In doing so, he hopes to Research Council (NERC), in support of its are based on the two most common systems: simplify and standardize the informatics and Environmental Genomics Program, established Red Hat’s (Raleigh, NC, USA) RPM and the data management associated with building the NERC Environmental Bioinformatics Debian community’s APT. global databases containing genotypes of a Centre (NEBC)3. As part of its remit, NEBC Building a package for a piece of software range of important pathogens. In particular, developed and deployed NEBC Bio-Linux. Our saves users unwanted installation steps that can building international mirrors of MLST data- user community includes bioinformaticians, often be numerous and sometimes frustrating. bases has been facilitated by the use of a com- software developers, system administrators Using packages, users can get tens to hundreds mon platform as the multiple software packages and most importantly, biological researchers, of pieces of software running in little more than required to run the sites can be assumed to be many of whom were new to both Linux and the time it takes to download them and keep available and installed in a standard location. bioinformatics. them up to date as new versions appear. It is the To build NEBC Bio-Linux, we followed ability to use package repositories to track and Building infrastructure the generic method outlined in Table 2. We redistribute the latest software and data analysis As the demand for bioinformatics training installed open source bioinformatics software methods that holds the greatest allure. grows, access to suitable facilities is becom- on a customized Debian/GNU Linux base sys- ing increasingly important. Likewise, as the tem (http://www.debian.org). To maximize our Unifying a community need for high-throughput processing of data ability to troubleshoot the installation process The installation of a common computing grows, so does the need for access to the type for users, provide improved downstream sup- platform across laboratories facilitates data of computing power that can be provided by port, and ease uptake among users with no sharing, promotes the spread of best practice computing clusters. The ‘cloning’ method of previous Linux experience, we chose to dis- with regard to analysis and data management, deploying machine images is particularly well tribute NEBC Bio-Linux as a software ‘clone.’ and facilitates the provision of centralized, suited for rapidly populating suites of comput- Unlike a true Linux distribution, NEBC Bio- economic support. From a provider’s point of ers and therefore provides a suitable option for http://www.nature.com/naturebiotechnology Linux is an image, or snapshot, of a machine view, a diverse and active user base is an invalu- larger scale projects. we maintain at NEBC. This image is created able resource. For example, when researchers For example, NEBC Bio-Linux runs in teach- with SystemImager (http://www.systemimager. using NEBC Bio-Linux alert us to new pro- ing laboratories at the University of Liverpool org/) software usually used to install comput- grams they have seen or have developed them- (http://www.genomics.liv.ac.uk/NIBHI_ ing clusters. selves, we can package them and place them in Cluster_page.html), the University of Cardiff our repository, where they become available to (http://watson-bios.grid.cf.ac.uk/Resources/ A bounty of software any NEBC Bio-Linux user. Likewise, a grab bag biolinux.html) and the Oxford Centre for Many BioLinux projects involve the creation of of community-specific programming code can Ecology and Hydrology (http://darwin.nerc- rich software repositories instead of complete easily be generated and maintained. oxford.ac.uk). Students in such laboratories systems (Table 1). The Linux community has These projects also facilitate