The Anatomy of Successful Computational Biology Software

FEATURE The anatomy of successful computational biology software Stephen Altschul1, Barry Demchak2, Richard Durbin3, Robert Gentleman4, Martin Krzywinski5, Heng Li6, Anton Nekrutenko7, James Robinson6, Wayne Rasband8, James Taylor9 & Cole Trapnell10 Creators of software widely used in computational biology discuss the factors that contributed to their success he year was 1989 and Stephen Altschul What factors determine whether scientific tant thing is that Thad a problem. Sam Karlin, the brilliant software is successful? Cufflinks, Bowtie mathematician whose help he needed, was Stephen Altschul: BLAST was the first program (which is mainly Ben so convinced of the power of a mathemati- to assign rigorous statistics to useful scores of Langmead’s work) cally tractable but biologically constrained local sequence alignments. Before then people and TopHat were in measure of protein sequence similarity that had derived many large part at the right he would not listen to Altschul (or anyone different scoring sys- place at the right time. else for that matter). So Altschul essentially tems, and it wasn’t We were stepping into tricked him into solving the problem sty- clear why any should Cole Trapnell fields that were poised mying the field of computational biology have a particular developed the Tophat/ to explode, but which by posing it in terms of pure mathematics, advantage. I had made Cufflinks suite of short- really had a vacuum in devoid of any reference to biology. The treat a conjecture that read analysis tools. terms of usable tools. from that trick became known as the Karlin- every scoring system You get two things Altschul statistics that are a key part of that people proposed from being first. One is a startup user base. The Nature America, Inc. All rights reserved. America, Inc. Nature BLAST, arguably the most successful piece of Stephen Altschul co- using was implicitly a second is the opportunity to learn directly from 4 computational biology software of all time. developed BLAST. log-odds scoring sys- people what the right way, or one useful way, to Nature Biotechnology spoke with Altschul tem with particular do the analysis would be. © 201 and several other originators of computa- ‘target frequencies’, tional biology software programs widely and that the best scoring system would be one Heng Li (developed MAQ, BWA, SAMtools used today (Table 1). The conversations where the target frequencies were those you and other genomics tools): I agree timing is npg explored what makes certain software tools observed in accurate alignments of real proteins. important. When MAQ came out, there was no successful, the unique challenges of develop- It was the mathematician Sam Karlin who other software that could do integrated mapping ing them for biological research and how the proved this conjecture and derived the for- and SNP [single-nucleotide polymorphism] field of computational biology, as a whole, mula for calculating the statistics of the scores calling. BWA was among the first batch of can move research agendas forward. What [E-values] output by BLAST. This was the Burrows-Wheeler–based aligners (BWA, Bowtie follows is an edited compilation of inter- gravy to the algorithmic innovations of David and SOAP2 were all developed at about the same views. Lipman, Gene Myers, Webb Miller and Warren time). Similarly, SAMtools was the first generic Gish that yielded BLAST’s unprecedented SNP caller that worked with any aligner, as long combination of sensitivity and speed. as the aligner output SAM format. 1National Center for Biotechnology Another great aspect of the popularity of Information, Bethesda, Maryland. 2University BLAST was that over time it was seamlessly Robert Gentleman: of California San Diego, La Jolla, California. linked to NCBI’s sequence and literature data- The real big success of 3Wellcome Trust Sanger Institute, Hinxton, bases, which were updated daily. When we R, I think, was around UK. 4Genentech, South San Francisco, developed BLAST, the databases available were the package system. California. 5British Columbia Cancer Research in relatively poor shape. In many instances, you Anybody that wanted Centre, Vancouver, Canada. 6Broad Institute, had to wait for over a year between the publica- to could write a pack- Cambridge, Massachusetts. 7Penn State tion of a paper and when its sequences appeared age to carry out a par- University, University Park, Pennsylvania. in a database. A lot of very talented and dedi- ticular analysis. At the 8 National Institute of Mental Health, Bethesda, cated people worked to construct the infra- Robert Gentleman is same time, this sys- 9 Maryland. Emory University, Atlanta, structure at NCBI that allowed you to search co-creator of the R tem allowed the stan- Georgia. 10Harvard University, Cambridge, up-to-date databases online. language for statistical dard R language to be Massachusetts. Cole Trapnell: Pro bably the most impor- analyses. developed, designed 894 VOLUME 31 NUMBER 10 OCTOBER 2013 NATURE BIOTECHNOLOGY featURE and driven forward by a core group of people. Windows. And fourth, ImageJ is open source, to build the maps available to external scientists For Bioconductor, which provides tools in so users can inspect, modify and fix the source so they could look at the evidence and the data. R for analyzing genomic data, interoperability code. It’s important, I think, to generate data at the was essential to its success. We defined a hand- same time as writing software for those data, ful of data structures that we expected people to Richard Durbin: I think a key thing is that being driven by problems that are at hand. use. For instance, if everybody puts their gene software or a data format does a clean job cor- expression data into the same kind of box, it rectly, that it works. For the software I’ve been Does scientific software development doesn’t matter how the data came about, but involved with, I think differ from that for other types of software? that box is the same and can be used by ana- support isn’t a criti- Durbin: Scientific software often requires quite lytic tools. Really, I think it’s data structures that cal thing, in a strange a strong insight—that is, algorithmic develop- drive interoperability. way. Rather, it’s the ment. The algorithm implements novel ideas, is lack of need for sup- based on deep scientific understanding of data Wayne Rasband: Several factors have contrib- port that’s important. and the problem, and takes a step beyond what uted to the usability of ImageJ. First, it has a Also, I have always has been done previously. In contrast, a lot of relatively simple graphical user interface, simi- wanted to use the commercial software is doing specific cases of lar to popular desk- software I developed, fairly straightforward things—book-keeping top software, such as Richard Durbin led or my group has and moving things around and so on. Photoshop. Second, the development of wanted to use it to there is a large com- many tools and data do our own job. John Gentleman: I have found that real hardcore munity of users and standards in genomics. Sulston taught me to software engineers tend to worry about prob- developers willing try to write something lems that are just not existent in our space. They to answer questions, that does the best possible job for yourself and keep wanting to write clean, shiny software, contribute plugins enables others to see what you would want to get when you know that the software that you’re and macros, and find out of the data. Don’t think that you’ll produce using today is not the software you’re going and fix bugs. Third, one version for yourself and then somehow to be using this time next year. At Genentech Wayne Rasband because it is written have a different tool for others. For example, (S. San Francisco, California), we develop test- developed the ImageJ in Java, ImageJ runs the people who built the C. elegans physical ing and deployment paradigms that are on image analysis software. on Linux, Macs and maps in the ’80s made the same software used somewhat shorter cycles. Table 1 Software for the ages Software Purpose Creators Key capabilities Year released Citationsa BLAST Sequence alignment Stephen Altschul, Warren Gish, First program to provide statistics for 1990 35,617 Nature America, Inc. All rights reserved. America, Inc. Nature Gene Myers, Webb Miller, sequence alignment, combination of 4 David Lipman sensitivity and speed R Statistical analyses Robert Gentleman, Ross Ihaka Interactive statistical analysis, 1996 N/A © 201 extendable by packages ImageJ Image analysis Wayne Rasband Flexibility and extensibility 1997 N/A Cytoscape Network visualization and analysis Trey Ideker et al. Extendable by plugins 2003 2,374 npg Bioconductor Analysis of genomic data Robert Gentleman et al. Built on R, provides tools to enhance 2004 3,517 reproducibility of research Galaxy Web-based analysis platform Anton Nekrutenko, James Taylor Provides easy access to high- 2005 309b performance computing MAQ Short-read mapping Heng Li, Richard Durbin Integrated read mapping and SNP call- 2008 1,027 ing, introduced mapping quality scores Bowtie Short-read mapping Ben Langmead, Cole Trapnell, Fast alignment allowing gaps and 2009 1,871 Mihai Pop, Steven Salzberg mismatches based on Burrows- Wheeler Transform Tophat RNA-seq read mapping Cole Trapnell, Lior Pachter, Discovery of novel splice sites 2009 817 Steven Salzberg BWA Short-read mapping Heng Li, Richard Durbin Fast alignment allowing gaps and 2009 1,556 mismatches based on Burrows- Wheeler Transform Circos Data visualization Martin Krzywinski et al. Compact representation of similarities 2009 431 and differences arising from compari- son between genomes SAMtools Short-read data format and utilities Heng Li, Richard Durbin Storage of large nucleotide sequence 2009 1,551 alignments Cufflinks RNA-seq analysis Cole Trapnell, Steven Salzberg, Transcript assembly and quantification 2010 710 Barbara Wold, Lior Pachter IGV Short-read data visualization James Robinson et al.

The Anatomy of Successful Computational Biology Software

BIOGRAPHICAL SKETCH NAME: Berger

Computational Methods Addressing Genetic Variation In

Big Data, Moocs, and ... (PDF)

Lior Pachter Genome Informatics 2013 Keynote

Modeling and Analysis of RNA-Seq Data: a Review from a Statistical Perspective

Cloud Computing and the DNA Data Race Michael Schatz

BENG181/CSE 181/BIMM 181 Molecular Sequence Analysis Instructor: Pavel Pevzner

Steven L. Salzberg

Human and Mouse Gene Structure: Comparative Analysis and Application to Exon Prediction

Eric S. Lander

THE BIG CHALLENGES of BIG DATA As They Grapple with Increasingly Large Data Sets, Biologists and Computer Scientists Uncork New Bottlenecks

UNIVERSITY of CALIFORNIA RIVERSIDE RNA-Seq