Buying in to Bioinformatics: an Introduction to Commercial Sequence Analysis Software David Roy Smith

BRIEFINGS IN BIOINFORMATICS. VOL 16. NO 4. 700 ^709 doi:10.1093/bib/bbu030 Advance Access published on 1 September 2014 Buying in to bioinformatics: an introduction to commercial sequence analysis software David Roy Smith Submitted: 25th June 2014; Received (in revised form) : 7th August 2014 Abstract Advancements in high-throughput nucleotide sequencing techniques have brought with them state-of-the-art bioinformatics programs and software packages. Given the importance of molecular sequence data in contemporary life science research, these software suites are becoming an essential component of many labs and classrooms, and as such are frequently designed for non-computer specialists and marketed as one-stop bioinformatics toolkits. Although beautifully designed and powerful, user-friendly bioinformatics packages can be expensive and, as more arrive on the market each year, it can be difficult for researchers, teachers and students to choose the right software for their needs, especially if they do not have a bioinformatics background. This review highlights some of the currently available and most popular commercial bioinformatics packages, discussing their prices, usability, features and suitability for teaching. Although several commercial bioinformatics programs are arguably overpriced and overhyped, many are well designed, sophisticated and, in my opinion, worth the investment. If you are just beginning your foray into molecular sequence analysis or an experienced genomicist, I encourage you to explore proprietary software bundles.They have the potential to streamline your research, increase your productivity, ener- gize your classroom and, if anything, add a bit of zest to the often dry detached world of bioinformatics. Keywords: bioinformatics software; CLC bio; Geneious; genome assembly; nucleotide alignment; phylogenetics software INTRODUCTION Anyone who has ever had something sequenced, Most mornings I wake up to a slew of spam email such as a genome, transcriptome, gene or PCR from biotech companies offering unbeatable bargains product, or used nucleotide or protein sequence on next-generation sequencing (NGS). Yesterday, data in their research has probably dabbled in for example, Beckman Coulter kindly offered to bioinformatics. Not long after scientists started ‘take the stress out of sequencing’ for only a few generating molecular sequence information, com- thousand dollars. Illumina recently provided me puter-savvy biologists and biology-savvy computer with ‘a glimpse into the future of genomics’, just scientists began developing programs to analyse by clicking on their buyer’s guide. And Macrogen, those data [3]. Given the breadth and depth of ques- a South Korean sequencing conglomerate, dared me tions that can be addressed with primary biological to race the HiSeq ‘Xpressway to the $1000 genome’. sequence information, many of these programs have These irritating emails underscore an important become immensely popular. For example, the jour- point: massively parallel sequencing has arrived to nal article describing the basic local alignment search the masses. NGS is now standard fare in almost all tool (BLAST), which allows a query nucleotide facets of life science research [1]. It is also big busi- or amino acid sequence to be compared against a ness and intimately tied to another burgeoning database of sequences, has been cited >50 000 industry—bioinformatics [2]. times [4]. Corresponding author. David Roy Smith, University of Western Ontario, London, Ontario N6A 5B7, Canada. E-mail: [email protected] David Roy Smith is an assistant professor of biology at the University of Western Ontario, where he studies genome evolution of eukaryotic microbes. He can be found online at www.arrogantgenome.com and @arrogantgenome. ß The Author 2014. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, Downloaded from https://academic.oup.com/bib/article-abstract/16/4/700/348707please contact [email protected] by Sveriges Lantbruksuniversitet user on 26 February 2018 Buying in to bioinformatics 701 Today’s omics-obsessed scientific marketplace running concurrently. I would be desperately editing is overflowing with bioinformatics programs. and assembling Sanger sequences with Phred, Phrap Whatever your sequence analysis problem (assem- and Consed [11, 12] while blasting the resulting con- bling, aligning, annotating, folding, etc.), there is tigs locally against custom databases and annotating probably a program or online application to solve the output on an ongoing GenBank entry. A chug of it—skim through the community-maintained list of coffee and I would switch to gene and genome bioinformatics software at SEQanswers.com to alignments with ClustalW [13] and MAUVE [14], see what I mean: http://seqanswers.com/wiki/ which were pumped directly into PAML [15] to Software/list. The majority of these tools are open measure genetic diversity and substitution rates. source, but they can be difficult to learn, install and A quick blink of the eyes and I was plotting slid- run; some require an in-depth knowledge of com- ing-window GC contents across entire organelle puters [5]. There are, however, various commercial chromosomes, all while mainlining a medley of alternatives, which bring together multiple bioinfor- phylogenetic and tree-building programs, from matics programs into user-friendly stand-alone pack- MrBayes [16] to PhyML [17] to PAUP [18] to ages. Although beautifully designed, these software MacClade [19]. And because some of these applica- suites can come with a hefty price tag, meaning that tions worked only on a PC and others on a Mac, most researchers, teachers and students are lucky if I had Windows and Apple operating systems running they can afford just one. Like buying a car, choosing at the same time, with command-line terminals piled between different suites can be challenging, and on top of graphical user interfaces (GUIs). there is surprisingly little information appraising ‘There’s got to be an easier, more efficient way of the different programs. Here, I describe my own doing this’, I would say to myself, as I tossed another experiences with using commercial bioinformatics empty coffee cup into the trash. Sensing my angst, a packages, focusing on their cost, functions and edu- colleague recommended that I invest in a commer- cational utility. cial, cross-platform, GUI-based bioinformatics pack- I have no affiliation, past or present, with any of age, arguing that it would streamline and simplify my the programs, software or companies described in work. I was reluctant to take his advice. I felt that this manuscript, but being a longtime genomics en- paying for such programs went against the spirit of thusiast, I use many of these applications daily, and I academic research and that using GUI software am a strong proponent of ease of use and accessibility would weaken my computational skills. However, in bioinformatics [5]. Although the focus of this after failing for the fourth time to correctly install article is commercial software, there are a number and run an open-source genome assembly algorithm, of free browser-based bioinformatics toolkits worth I gave in and bought a user-friendly bioinformatics considering, e.g. [6, 7]. Two toolkits that I use regu- bundle, and have not regretted it. larly and recommend are MEGA [8] and Unipro UGENE [9]. See Vincent and Charette [10] for a Show me the money succinct but compelling summary of the drawbacks In 2007, with the grant support of my former PhD of commercial tools and arguments for freedom in supervisor, I purchased my first bioinformatics soft- bioinformatics. ware package. At the time, there was a small but strong cohort of commercial options available, Abioinformaticsmagicbullet most of which offered a free 30-day trial—a practice During my PhD I spent hours a day at the computer that, with few exceptions, continues today, although assembling and analysing organelle genomes. Friends in some cases the trial period has been reduced to 2 and colleagues would poke their heads into my weeks. After testing an assortment of programs, cubbyhole of an office and recoil at the sight of I decided on Geneious (Biomatters Ltd., Auckland, my bloodshot eyes and the genomic chaos playing New Zealand), which was first released in 2005 and out on the high-definition dual monitors that sur- is now among the more widely used cross-platform rounded me. ‘Dave, how many analyses are you commercial bioinformatics packages (Table 1). running?’ they would ask, gaping at the mosaic dis- I chose Geneious not because it was necessarily array of program windows scattered across the better than other software, but because the company screens. Like any decent genomics junkie, I usually offered, and continues to offer, student discounts. had half a dozen different bioinformatics applications Seven years ago, I paid approximately $200 (all Downloaded from https://academic.oup.com/bib/article-abstract/16/4/700/348707 by Sveriges Lantbruksuniversitet user on 26 February 2018 on 26 February 2018 by Sveriges Lantbruksuniversitet user Downloaded fromhttps://academic.oup.com/bib/article-abstract/16/4/700/348707 702 Ta b l e 1: Examples, features and comparisons of

Buying in to Bioinformatics: an Introduction to Commercial Sequence Analysis Software David Roy Smith

Current Status and Future Perspectives of Bioinformatics in Tanzania

White Paper on CLC Read Mapper

Fpgas in Bioinformatics

CLC Sequence Viewer

HMMER User's Guide

A Novice's Guide to Analyzing NGS-Derived Organelle And

CLC Genomics Workbench Features & Benefits

CLC Sequence Viewer Manual for CLC Sequence Viewer 6.5 Windows, Mac OS X and Linux

CLC Science Server Administrator

Consequences of Normalizing Transcriptomic and Genomic Libraries of Plant Genomes Using a Duplex- Specific Nuclease and Tetramethylammonium Chloride

Supplement on Visualizing Biological Data

Diploma Thesis a Module for Semi-Automated Annotation of Megabase-Sized DNA Sequences by Homolgy Search Stefan Michael Schuster