Gearing for bioinformatics Gearing for bioinformatics

Bela Tiwari and Dawn Field explore the tools and facilities that ioinformatics’ is a buzz word that is Projects with enough funding are able to hire users will depend on the system, how they will becoming increasingly audible in the dedicated system administrators to provide access it, etc. Live CD or DVD distributions may can be used by the budding open source bioinformatician ‘BLinux world. Fast, economical, sustainable bioinformatics computing systems, be good for an individual and for demonstration flexible, and extensible computing power is but many of us are not that lucky and have to purposes, but they are probably not the right making increasingly attractive to scientists go it alone. choice for the provision of tools to a whole in many areas of research, including biology. To add to the challenge, much bioinformatics department. More generally, the open source movement has software is written by academics, and while greatly benefited biological research; the most there are some very good, well tested packages LIVE DISTRIBUTIONS publicised project being the publicly funded out there, there are also many that were Live Linux distributions are a relatively new effort to sequence and make freely available the intended to answer a particular question, on a phenomenon and offer some big advantages. human genome. Less well publicised is the huge particular machine, for a particular group. Such You don’t have to install anything to run them. amount of biological data that can be freely packages were often not built with portability, Just slot the CD or DVD into the drive and boot accessed. The combination of data availability future use or further development in mind. your machine. Et voila! If the developers have The bioinformatics playground

and free software is revolutionising this field. Knowing when to persevere or give up with a done their jobs correctly, the software should be The ability to redistribute Linux, the existence piece of software is all part of the key skills of a configured to run properly without any further of online documentation, active user and bioinformatician or bioinformatics systems configuration. Live distributions may appeal to developer communities, and the fact that much provider. Even very experienced system people who want to try a system out, those bioinformatics software is developed for Linux/ administrators can sometimes find installing and who want to demonstrate software to others, or Unix systems, has opened the way for individual integrating bioinformatics software and those who want a portable Linux system for users without access to large centralised databases frustrating and tedious. their own purposes. It is unlikely, however, that resources to be able to install and run Many developers have faced these challenges a live distribution will suffice as your primary bioinformatics software to analyse data, and to already and taking advantage of the resources bioinformatics system if you want to undertake start developing for the wider community. some of them have made freely available can serious bioinformatics work. Here we outline projects that can help to greatly reduce the overheads involved in significantly ease the experience of trying out, establishing a new system for bioinformatics. FULL SYSTEMS using, and providing computing platforms Some of these resources are described in this Full systems customised for bioinformatics work appropriate for bioinformatics analyses. article including CD and DVD-based Live Linux are offered freely by a number of groups. distributions customized for bioinformatics Installed systems are very flexible. Unlike a Live KNOWING WHEN analyses, full distributions that can be installed distribution, you can always add extra software Turning data into knowledge is a complex task from iso images or installed over the network, and customise to your hearts content. The that demands data manipulation, comparison, and also specialised package repositories. Each distributions reviewed here are available either statistical analysis, visualisation, as well as data of these solutions has its particular attractions by downloadable iso files (BioBrew and storage and dissemination. Usually, the weight for users with different requirements. BioLand) or by network installation (Bio-Linux). of many lines of evidence must be combined to Currently, BioBrew is the only distribution of the In order to carry out answer a scientific question, and the PICKING YOUR SOLUTION three reviewed that can also be purchased on meaningful analyses, you interpretation of the output of many different Whether you plan to use a system yourself or DVD. need to have a question software tools plays a key role in discerning and provide it for others, give thought to your long- By nature a certain degree of knowledge is assembling data from which biological term requirements. Questions you might be required for maintaining a machine running to answer and an knowledge is born. asking yourself include how much computing Linux, with the level required varying between understanding of the Finding and installing common tools for power you are likely to need, whether you the systems reviewed here. For example, if you context of that question bioinformatics on your own machine, especially require a cluster-based solution, how many are a biologist with little computing or systems for those new to Linux, can be a daunting task. databases need to be stored locally, how many knowledge, but you require access to a high

50 LinuxUser & Developer LinuxUser & Developer 51 Gearing for bioinformatics Gearing for bioinformatics

powered bioinformatics computer cluster, Bioinformatics.org, or the bioinformatics section BioBrew might be a distribution for you to on Sourceforge and see how you can get DDBJ Just slot the CD or DVD into the drive and boot your machine. Et voila! consider, but doing this in collaboration with a involved. www.ddbj.nig.ac.jp skilled system administrator could make a huge Setting up your bioinformatics toolbox is just (LinuxUser and Developer, issue 43) we expand is, and applications from it also run from under command line when you enter the system. As difference. one step in the research process. In order to EnsEMBL on that information and outline issues that the graphical menu system. Documentation for speed is one of the major issues when running It is also advisable to consider the level of carry out meaningful analyses, you need to have www.ensembl.org might help people new to this area choose this system is good, with information available Live CD distributions, having less memory security on these systems at the time of a question to answer and an understanding of which distributions to try first. both on the website and on static pages intensive options than KDE, the window installation, and whether you need to take the context of that question. The good news is UCSC Genome Browser included in the system. manager commonly run on -derived additional steps to keep out hackers. If you are that you can keep up with many of the new genome.ucsc.edu VLINUX Overall, this system is worth trying and might systems, is very useful for those working on going to use the machine within an developments in this area of science as access to Developer: V. Vimalkumar be suitable particularly for new users who have slower systems. organisation and connect it to the network, you the scientific literature is increasingly free. Biobar URL: bioinformatics.org/vlinux/ an interest in chemistry and structure in Overall, if you are happy using the command may have to give assurances about the security The freedom to make packages and biobar.mozdev.org Availability: Free download of iso addition to sequence analysis programs. line and giving full paths to the programs, there of the system. distributions available to the wider community Base system: Knoppix are lots of programs to choose from on For bioinformatics systems providers, it may comes back to the commitment of many VLinux contains a good range of bioinformatics BIOKNOPPIX DNALinux. Otherwise, you may want to look at be worth using one of these distributions as a software developers, and others, to the open software, mostly concentrating on sequence Developer: C.M. Rodriguez, High Performance other distributions listed in this review. base system, which can then be further source ethos. Allied with free availability of Open data manipulation and analysis, and structural Computing Facility, University of Puerto Rico modified to meet your requirements. An biological datasets, and access to scientific It has been to biology’s immense benefit that viewing programs. A main benefit, apart from URL: bioknoppix.hpcf.upr.edu/ QUANTIAN example of such a project is the collaboration literature, this truly puts the world of biology the ethos in the biological research community the range of software available, is the provision Availability: Free download of iso Developer: Dirk Eddelbuettel, between Bio-Linux developers and Keith Jolley and bioinformatics within the grasp of anyone (and the requirement for publication in many of graphical menu options for the bioinformatics Base System: Knoppix URL: dirk.eddelbuettel.com/quantian.html of the University of Oxford, who distributes this willing to learn. journals) is that data be submitted to public software on the system. This includes a BioKnoppix is probably the most well known of Availability: free download of iso platform to users involved in studying repositories. If the data was behind locked comprehensive listing of packages under the the Live CD bioinformatics-centric systems Base System: clusterKnoppix pathogenic bacteria. A number of international doors, the rate of biological discovery would be EMBOSS system, available both under the reviewed here. It provides a graphical menu This distribution was originally developed with mirror sites for his project, pubmlst.org, are in Key links slowed greatly from its present pace, and access bioinformatics menu, and as a separate menu through which most of the bioinformatics quantitative analysis in mind and the list of the process of being established and will also be to knowledge would be restricted to the few listing. Unlike some other distributions listed software is available. BioKnoppix contains a software included is long and impressive. using Bio-Linux as the base distribution. Bioconductor that have the financial resources to pay... here, VLinux does not include ncbitools, which smaller collection of bioinformatics software Bioinformatics software was added as part of a www.bioconductor.org The one bioinformatics project that everyone contains the popular Blast program and many than the other systems reviewed here, mostly recent move to DVD images and contains, PACKAGE REPOSITORIES knows about is the Human Genome Project. It other tools from the NCBI, but for new users, it aimed at sequence analysis. It does include among other things, both the EMBOSS and For those users who already have root access to EMBOSS was not a given that the human genome would will probably be more enticing to run this sort BioPerl and BioPython, important for anyone ncbitools packages, BioPerl and BioPython a Linux machine and can install software to emboss.sourceforge.net become freely available. If not for the hard work of analysis via the web. doing bioinformatics-related programming, as libraries, as well as Bioconductor. Quantian also central locations, it may be desirable to add of the publicly funded effort to sequence the This is a good system for the new user to well as Bioconductor - making BioKnoppix and includes essentially all of the statistical packages to an existing core system. An exciting maxdLoad2 and maxdView human genome, undertaken on Linux and free consider starting off with. It has an easy Quantian the only ones of the CD listed here to language modules available from the development in this area is the creation of bioinf.man.ac.uk/microarray/maxd software, we could have had a situation where interface and the inclusion of the categorized include any software for microarray analysis. Comprehensive R Archive Network (CRAN). For several repository projects that specialise in our own genetic heritage was owned, at least EMBOSS graphical menu system is a boon. Unfortunately there is a problem with R on the any bioinformatician or statistician on the go, providing bioinformatics software as packages. NCBI Tools partially, by commercial concerns. latest version, 0.2.1, which prevents use of the inclusion of these is invaluable. At just over Configure your system to accept automatic www.ncbi.nlm.nih.gov/Tools Luckily the human genome was completed, VIGYAAN Bioconductor. 1.9 Gb, Quantian is about three times the size package updates and you can take much of the or rather ‘completed enough’ by the publicly Developer: P.K. Agarwal, Oakridge National Laboratory Overall, this is a good system for someone of any of the other distributions listed here. This pain out of bioinformatics software installation R and CRAN funded group to ensure it is freely available to URL: www.vigyaancd.org who needs access to BioPerl and BioPython on a means you will need a decent network and system maintenance. www.r-project.org all. The human genome, along with millions of Availability: Free download of iso portable system, but issues such as the relatively connection to download the system, and will The bioinformatics software currently available other sequences, can be downloaded from Base system: Knoppix small number of packages available and the fact have to burn a DVD, or follow the instructions in package repositories is just a fraction of what Bio languages various databases including GenBank, EMBL and Vigyaan’s developer can be applauded for that bioinformatics software documentation on on the website explaining other ways you can exists out there. Many developers now provide bio.perl.org DDBJ. Versions of the human genome, and putting so much thought into the beginner- the system is not easy to track down for the run the system. Quantian DVDs can be the software pre-packaged on their websites, and www.biopython.org other genomes, with additional annotation and user’s experience. The provision of an easy to uninitiated, may mean that new users should purchased from a vendor. many more provide source code bundles. If you www.biojava.org information are also available and searchable access and easy to use demo system is very look to another system for their first foray into A person lacking a mathematical background find an application you really like for which there bioruby.org freely from other sites such as EnsEMBL and the useful. Labelled as a ‘bio/chemical software this area. might be better off starting with one of the is no package available, and it is redistributable, UCSC Genome Browser. Beyond genome workbench’, it is no surprise that the range of other distributions in this list, especially those perhaps you could collaborate with the NCBI sequences, there are literally hundreds of software available on Vigyaan includes packages DNALINUX that emphasise documentation or maintainers of a repository to make it available as www.ncbi.nlm.nih.gov publicly accessible databases providing access for chemical and structural analysis and viewing Developer: Universidad Nacional de Quilmes, Argentina demonstrations. However, anyone looking for a a package, thereby boosting the availability of to many different types of data for many as well as sequence analysis tools. This system URL: www.dnalinux.com/ portable system that would allow them to run that software to the wider community. EBI different types of organism. For an easy way to also gives prominence to a Knoppix script that Availability: free download of iso after registration analyses, write scripts, and generally use tools to www.ebi.ac.uk peruse many of the major biological databases, creates a persistent home directory for a user, Base System: a high level, would be hard put to find a more FINAL WORDS try installing BioBar, a toolbar for Mozilla based which means that you could continue to use the This distribution provides access to a fairly large comprehensive live system than Quantian The future is getting brighter for those needing The Sanger Centre browsers. CD to run programs, without installing the range of bioinformatics tools, including both available today. to set up robust and scalable computing www.sanger.ac.uk system on a hard drive, but the system would the EMBOSS package and ncbitools. environments for research. The vast quantities Life on a live CD then recognise a file system, for example on a Unfortunately, it does not provide easy access to of publicly accessible software and data, and the GenBank the roundup floppy disc or memory stick, as an area where these tools, either through graphical menu Life on a full installation amount of online help and books about www.ncbi.nlm.nih.gov your work is stored, rather than data remaining options or through installation of the binaries in the roundup bioinformatics available, means that anyone Building on the overview of CD-based Linux purely in memory. Bioinformatics software can a directory that is on the default PATH. A nice interested in the subject area can get involved EMBL distributions customised for bioinformatics be run using graphical menu options, and while addition in DNALinux is the choice of window As time goes on, standard Linux distributions at some level. If you are interested, why not visit www.ebi.ac.uk/embl given in the article, ‘Hacking the Code of Life’, EMBOSS is not included, the ncbitools package managers, though you will need to use the may include bioinformatics software as a matter

52 LinuxUser & Developer LinuxUser & Developer 53 Gearing for bioinformatics

of course. Already, the systems onto other machines. programs are available via graphical menus. distribution, although considered suitable for Bio-Linux is currently deployed as a full image BioLand currently does not include any of the more experienced users and administrators, using SystemImager technologies. This means ‘bio-libraries’ such as BioPerl or BioPython, nor includes over 180 science packages as part of its that the whole system is installed over the does it, as yet, include software for microarray standard system. Currently, the easiest way for network and is not a technically demanding analysis. many people to get a fully operational process. With the release of Bio-Linux 4.0, the The website gives promise that bioinformatics system will be to try one of the potential ways to install the system will increase documentation for various aspects of the system freely available Linux distributions customised and documentation will be provided on the will be available soon, but as this system was for this purpose. In addition to the range of website as these methods become available. only released very recently, the site is still under bioinformatics software available on each one, Bio-Linux is designed to meet the needs of development. they differ as to which user community or desktop users as well as those requiring access The BioLand system is contained on three iso communities they are aimed at. to a powerful and extensible analysis and images, downloadable from the website. development platform. Installation is straightforward, even for people BIO-LINUX new to Linux. Aimed more towards use as a Developed by: The Environmental Genomics Thematic BIOBREW server, rather than a desktop system, BioLand is Programme Data Centre, Oxford Developed by: Glen Otero, Callident the most recent addition to the freely available URL: envgen.nox.ac.uk/biolinux.html URL: bioinformatics.org/biobrew customized bioinformatics systems. Availability: free download of packages, register to Availability: free download of isos and packages install the full system Base system: Rocks Base system: Debian BioBrew is a bioinformatics clustering solution, Package repositories The Bio-Linux project was funded by the UK based on the NPACI Rocks cluster software. It In addition to the software packages available Natural Environment Research Council. It automates cluster installation and includes high from large package repositories, like those includes a large amount of sequence analysis performance computing software and a number found under www.debian.org, there are a and expression analysis software, and some of bioinformatics applications, including growing number of repositories specializing in other specialist tools of interest to those in the versions for cluster environments. A BioBrew providing bioinformatics software packages. The fields of proteomics and population genetics. installation guide and a user guide are available content of these repositories differs in terms of The bioinformatics software comes fully from the website. the format of packages provided, the systems configured and many programs can be run BioBrew can be obtained as a DVD iso, or as they are tested on, and the number of packages If you find something exciting, make sure it’s real

through graphical menus. Sample data is a ‘roll’ that is installed over a base Rocks available. included with the system for those that want to installation. DVDs and printed documentation Packages from repositories can make your life investigate the use of programs. can be purchased from the website. Information a lot easier, but it should be kept in mind that Bio-Linux also provides for software on the mailing lists suggests that people who using packages others have contributed means developers; Eclipse and KDevelop are on the choose BioBrew find it straightforward to install. that you are dependent on the quality of their system, as well as the various ‘bio-libraries’ such A version of BioBrew augmented with additional work, and the decisions they have made. In as BioPerl, BioPython, BioJava and BioRuby. system tools is also available commercially most cases, this will be fine, but installing and Microarray analysis and annotation is catered through Callident. updating from packages does not completely for by the inclusion of Bioconductor, and the BioBrew is designed primarily to provide a remove the need for testing. graphical packages, maxdView and maxdLoad2. platform for high performance computer Some repositories are set up purely with the System and bioinformatics packages are clusters for people running computer intensive aim of providing packages needed by the updated nightly, minimising necessary system bioinformatics analyses. general user community: administration tasks. There is a large amount of documentation BIO-LAND The BioLinux Project available for Bio-Linux, on the website and on Developed by: Center of Bioinformatics, Peking www.biolinux.org/project the system itself. Of particular note is the University comprehensive bioinformatics software URL: bioland.cbi.pku.edu.cn BIOrpms documentation system, which presents Availability: free download of isos apt.bea.ki.se/ documentation about the software on the Base system: Fedora 2 Core system through a categorised and searchable BioLand comes with a standard set of Bio-Linux-BR interface, and includes links to software bioinformatics software installed, and also glu.df.ibilce.unesp.br homepages, local and remote documentation. comes with several smaller biological databases. Machines running Bio-Linux can be clustered Although no explicit sample data is included, Debian-Med together for computer intensive tasks thanks to these databases can provide a start for new www.debian.org/devel/debian-med the inclusion of the Condor job scheduling users. The desktop is a standard Fedora desktop. software. The SystemImager software, also Software can be run easily from the command Others exist primarily to serve a community present, makes it simple to clone Bio-Linux line and a small number of bioinformatics using an associated base system, but welcome

54 LinuxUser & Developer Gearing for bioinformatics

access to the software on their web sites: the call for open access to scientifi c papers. Arild Bjørndal, Medical Director of the Norwegian Amazing Offer! BioBrew packages Health Services Research Centre puts it this way: ftp.bioinformatics.org/pub/biobrew/rpms/ “Open access is the way forward in medical publishing. We must put a stop to the way that Bio-Linux packages scarce public resources are used; fi rst to fund Worth envgen.nox.ac.uk/pkg_repository.html research and then again to pay to be able to read the results of that same research.’ This £64.95 FREE topic is contentious though, as opinions differ Brief Advice for the as to the best way to disseminate and preserve budding bioinformatician research results for the future. BioMed Central is an independent publishing SUSE 9.29.2 Rule 1: If you fi nd something exciting, make house that is committed to open access. The UK, sure it’s real Norway and Finland have all taken out Rule 2: Once you are convinced it is real, nationwide BioMed Central membership. These Professional double check that it’s a new fi nding before you deals cover the costs of publishing in any of tell the world about it. Whilst this might sound BioMed Central’s 120 journals. Moves are afoot negative, the fact is that a really unusual result is in other countries to try and get research funding With NEW often due to an error somewhere in the analysis, bodies to endorse, and even require, publishing subscriptions to GNU/ or it may already have been reported. For to journals with an open access policy. With big example, one common mistake when working players, like the Wellcome Trust, the UK’s largest in this area is running analysis on data in the biomedical research charity, and the National wrong format. Sure the program should give Institute of Health in the US, backing the Or any one of these you an error and not report garbage results, but movement to make results of research funded by the error checking capabilities of bioinformatics those bodies freely available, the future of open O’Reilly books! software ranges from excellent to non-existent. access to literature is looking positive. How well do you know your software? The increasing open access to scientifi c GET THE AWARD WINNING SUSE LINUX Rule 3: Document what you have done. literature means that anyone can start learning 9.2 PROFESSIONAL INCLUDING THE Never miss an issue! Analyses need to be replicable if others are to about recent developments in the biological 2.6.8, KDE 3.3 AND judge the results as reliable. If you want to be sciences, with many articles already accessible GNOME 2.6 - AVAILABLE ON 5 CDS AND 2 DVDS PLUS EXTENSIVE COVERDISCS able to duplicate your results, or importantly, over the internet. Popular sites for searching for Every LinuxUser & Developer CD is guaranteed DOCUMENTATION. NEWCOMERS AND allow someone else to replicate your results, literature online include PubMed Central, a to be packed with hundreds of the latest and ADVANCED USERS WILL FIND some key information to note includes: freely available digital archive of life sciences greatest Free Software projects, including many PROFESSIONAL ASSISTANCE AND exclusive LinuxUser & Developer covermounts journal literature, and the newly released INSTRUCTIONS FOR ALL KINDS OF the version of the software used Google Scholar. Another option for searching is ISSUES. THE SUSE LINUX 9.2 the version of the database from which any to install the BioMed toolbar and the Google PROFESSIONAL PACKAGE NOW PRIORITY data used was retrieved, and the site the Scholar search plugin. INCLUDES THE PORTED 64-BIT data was taken from If you are interested in the issue of open APPLICATIONS FOR USERS WHO EMPLOY the parameters used for the analysis access, Open Access Now is a newsletter THE LATEST TECHNOLOGY OF INTEL AND DELIVERY all the steps taken in the analysis designed to inform researchers about the Open AMD 64-BIT PROCESSORS AS WELL AS Receive your copy direct to your place or work Access movement, including the different 32-BIT BINARIES. or home before it reaches the newsstands! Open reading models being considered to provide access to Good science means asking good questions. But scientifi c literature, as well as news Please send me the next 12 issues of LinuxUser and Developer and my copy of Suse 9.2 Professional for £59.99 (1 year) or £89.99 (2 years) asking good questions usually involves acquiring and views about this area. Please send me the next 12 issues of LinuxUser and Developer (rest of the world subscription offer) solid background knowledge of the subject, and a good understanding of current developments. BioMed Central MR/MRS/MISS FIRST NAME ...... Send to; Background knowledge is usually best obtained www.biomedcentral.com Subscriptions Dept by referring to textbooks, but a good SURNAME...... LinuxUser and Developer understanding of current developments means PubMed ADDRESS ...... 5 Broadhey, Romily getting access to scientifi c literature, and that www.ncbi.nlm.nih.gov/entrez/query.fcgi Stockport, Cheshire SK6 4NL can be an expensive proposition, involving Analyses ...... Or call the subscriptions hotline on; subscriptions to journals charging hefty fees, or Google Scholar need to be 0161 4303423 perhaps paying companies that specialise in scholar.google.com ...... providing requested articles. This means that for replicable if POSTCODE...... TELEPHONE...... industry, or for the lucky few in well funded BioMed toolbar and Google Scholar plugin others are to Or subscribe online at academic institutions, information is on tap, but www.biomedcentral.com/info/about/toolbar judge the EMAIL ...... others are at a signifi cant disadvantage. www.linuxuser.co.uk Outside the UK - If you live outside of the UK, But times are changing, and hot on the heels The Open Access Now newsletter results as PLEASE BEGIN MY SUBSCRIPTION WITH ISSUE NUMBER...... we regret to say that the free software offer of the open source revolution for software, is www.biomedcentral.com/openaccess doesn’t apply. However, you can subscribe for reliable I ENCLOSE A CHEQUE PAYABLE TO LINUXUSER & DEVELOPER FOR...... £74.99* if you live within Europe and £89.99* if you live elsewhere, including the US and Australia. Both prices include P&P. ALTERNATIVELY PLEASE BILL ME FOR...... *Charged at local currency exchange rate.

56 LinuxUser & Developer