feature

The anatomy of successful software

Stephen Altschul1, Barry Demchak2, Richard Durbin3, Robert Gentleman4, Martin Krzywinski5, Heng Li6, Anton Nekrutenko7, James Robinson6, Wayne Rasband8, James Taylor9 & Cole Trapnell10 Creators of software widely used in computational biology discuss the factors that contributed to their success

he year was 1989 and Stephen Altschul What factors determine whether scientific tant thing is that Thad a problem. Sam Karlin, the brilliant software is successful? Cufflinks, Bowtie mathematician whose help he needed, was Stephen Altschul: BLAST was the first program (which is mainly Ben so convinced of the power of a mathemati- to assign rigorous to useful scores of Langmead’s work) cally tractable but biologically constrained local sequence alignments. Before then people and TopHat were in measure of protein sequence similarity that had derived many large part at the right he would not listen to Altschul (or anyone different scoring sys- place at the right time. else for that matter). So Altschul essentially tems, and it wasn’t We were stepping into tricked him into solving the problem sty- clear why any should fields that were poised mying the field of computational biology have a particular developed the Tophat/ to explode, but which by posing it in terms of pure mathematics, advantage. I had made Cufflinks suite of short- really had a vacuum in devoid of any reference to biology. The treat a conjecture that read analysis tools. terms of usable tools. from that trick became known as the Karlin- every scoring system You get two things Altschul statistics that are a key part of that people proposed from being first. One is a startup user base. The BLAST, arguably the most successful piece of Stephen Altschul co- using was implicitly a second is the opportunity to learn directly from computational biology software of all time. developed BLAST. log-odds scoring sys- people what the right way, or one useful way, to Nature Biotechnology spoke with Altschul tem with particular do the analysis would be. Nature America, Inc. All rights reserved. America, Inc. © 201 4 Nature and several other originators of computa- ‘target frequencies’, tional biology software programs widely and that the best scoring system would be one Heng Li (developed MAQ, BWA, SAMtools used today (Table 1). The conversations where the target frequencies were those you and other genomics tools): I agree timing is npg explored what makes certain software tools observed in accurate alignments of real proteins. important. When MAQ came out, there was no successful, the unique challenges of develop- It was the mathematician Sam Karlin who other software that could do integrated mapping ing them for biological research and how the proved this conjecture and derived the for- and SNP [single-nucleotide polymorphism] field of computational biology, as a whole, mula for calculating the statistics of the scores calling. BWA was among the first batch of can move research agendas forward. What [E-values] output by BLAST. This was the Burrows-Wheeler–based aligners (BWA, Bowtie follows is an edited compilation of inter- gravy to the algorithmic innovations of David and SOAP2 were all developed at about the same views. Lipman, Gene Myers, and Warren time). Similarly, SAMtools was the first generic Gish that yielded BLAST’s unprecedented SNP caller that worked with any aligner, as long combination of sensitivity and speed. as the aligner output SAM format. 1National Center for Biotechnology Another great aspect of the popularity of Information, Bethesda, Maryland. 2University BLAST was that over time it was seamlessly Robert Gentleman: of California San Diego, La Jolla, California. linked to NCBI’s sequence and literature data- The real big success of 3Wellcome Trust Sanger Institute, Hinxton, bases, which were updated daily. When we R, I think, was around UK. 4Genentech, South San Francisco, developed BLAST, the databases available were the package system. California. 5British Columbia Cancer Research in relatively poor shape. In many instances, you Anybody that wanted Centre, Vancouver, Canada. 6Broad Institute, had to wait for over a year between the publica- to could write a pack- Cambridge, Massachusetts. 7Penn State tion of a paper and when its sequences appeared age to carry out a par- University, University Park, Pennsylvania. in a database. A lot of very talented and dedi- ticular analysis. At the 8 National Institute of Mental Health, Bethesda, cated people worked to construct the infra- Robert Gentleman is same time, this sys- 9 Maryland. Emory University, Atlanta, structure at NCBI that allowed you to search co-creator of the R tem allowed the stan- Georgia. 10Harvard University, Cambridge, up-to-date databases online. language for statistical dard R language to be Massachusetts. Cole Trapnell: Pro­bably the most impor- analyses. developed, designed

894 volume 31 NUMBER 10 OCTOBER 2013 nature biotechnology feature

and driven forward by a core group of people. Windows. And fourth, ImageJ is open source, to build the maps available to external scientists For Bioconductor, which provides tools in so users can inspect, modify and fix the source so they could look at the evidence and the data. R for analyzing genomic data, interoperability code. It’s important, I think, to generate data at the was essential to its success. We defined a hand- same time as writing software for those data, ful of data structures that we expected people to Richard Durbin: I think a key thing is that being driven by problems that are at hand. use. For instance, if everybody puts their gene software or a data format does a clean job cor- expression data into the same kind of box, it rectly, that it works. For the software I’ve been Does scientific software development doesn’t matter how the data came about, but involved with, I think differ from that for other types of software? that box is the same and can be used by ana- support isn’t a criti- Durbin: Scientific software often requires quite lytic tools. Really, I think it’s data structures that cal thing, in a strange a strong insight—that is, algorithmic develop- drive interoperability. way. Rather, it’s the ment. The algorithm implements novel ideas, is lack of need for sup- based on deep scientific understanding of data Wayne Rasband: Several factors have contrib- port that’s important. and the problem, and takes a step beyond what uted to the usability of ImageJ. First, it has a Also, I have always has been done previously. In contrast, a lot of relatively simple graphical user interface, simi- wanted to use the commercial software is doing specific cases of lar to popular desk- software I developed, fairly straightforward things—book-keeping top software, such as Richard Durbin led or my group has and moving things around and so on. Photoshop. Second, the development of wanted to use it to there is a large com- many tools and data do our own job. John Gentleman: I have found that real hardcore munity of users and standards in genomics. Sulston taught me to software engineers tend to worry about prob- developers willing try to write something lems that are just not existent in our space. They to answer questions, that does the best possible job for yourself and keep wanting to write clean, shiny software, contribute plugins enables others to see what you would want to get when you know that the software that you’re and macros, and find out of the data. Don’t think that you’ll produce using today is not the software you’re going and fix bugs. Third, one version for yourself and then somehow to be using this time next year. At Genentech Wayne Rasband because it is written have a different tool for others. For example, (S. San Francisco, California), we develop test- developed the ImageJ in Java, ImageJ runs the people who built the C. elegans physical ing and deployment paradigms that are on image analysis software. on Linux, Macs and maps in the ’80s made the same software used somewhat shorter cycles.

Table 1 Software for the ages Software Purpose Creators Key capabilities Year released Citationsa BLAST Sequence alignment Stephen Altschul, Warren Gish, First program to provide statistics for 1990 35,617 Gene Myers, Webb Miller, sequence alignment, combination of David Lipman sensitivity and speed R Statistical analyses Robert Gentleman, Ross Ihaka Interactive statistical analysis, 1996 N/A Nature America, Inc. All rights reserved. America, Inc. © 201 4 Nature extendable by packages ImageJ Image analysis Wayne Rasband Flexibility and extensibility 1997 N/A Cytoscape Network visualization and analysis Trey Ideker et al. Extendable by plugins 2003 2,374 npg Bioconductor Analysis of genomic data Robert Gentleman et al. Built on R, provides tools to enhance 2004 3,517 reproducibility of research Galaxy Web-based analysis platform Anton Nekrutenko, James Taylor Provides easy access to high- 2005 309b performance computing MAQ Short-read mapping Heng Li, Richard Durbin Integrated read mapping and SNP call- 2008 1,027 ing, introduced mapping quality scores Bowtie Short-read mapping Ben Langmead, Cole Trapnell, Fast alignment allowing gaps and 2009 1,871 Mihai Pop, mismatches based on Burrows- Wheeler Transform Tophat RNA-seq read mapping Cole Trapnell, , Discovery of novel splice sites 2009 817 Steven Salzberg BWA Short-read mapping Heng Li, Richard Durbin Fast alignment allowing gaps and 2009 1,556 mismatches based on Burrows- Wheeler Transform Circos Data visualization Martin Krzywinski et al. Compact representation of similarities 2009 431 and differences arising from compari- son between genomes SAMtools Short-read data format and utilities Heng Li, Richard Durbin Storage of large nucleotide sequence 2009 1,551 alignments Cufflinks RNA-seq analysis Cole Trapnell, Steven Salzberg, Transcript assembly and quantification 2010 710 Barbara Wold, Lior Pachter IGV Short-read data visualization James Robinson et al. Scalability, real-time data exploration 2011 335 aCitations in Web of Science as of September 2013 of the first paper describing the software. bA comprehensive list of publications (1,150 as of September 2013) related to Galaxy is at http://www.citeulike.org/group/16008/order/group_rating. N/A, paper not available in Web of Science.

nature biotechnology volume 31 NUMBER 10 OCTOBER 2013 895 feature

James Taylor: A lot of program, such as IGV or the UCSC browser, can isn’t necessarily even meant for adoption by a traditional software help with this. However, it is also important to community. It’s more a vehicle for producing engineering is about understand what the developers of the visual- data to argue that a computational method how to build software ization have chosen to emphasize, through the is sound or that it has the properties that are effectively with large use of color and other techniques, and what they being claimed. teams, whereas the have chosen to de-emphasize. way most scientific Gentleman: I’ll point to the ‘bump hunt- software is developed Martin Krzywinski: In a way, a fixed visualiza- ing’ tools for finding peaks in [chromatin is (and should be) tion is only one answer. It’s one projection, one immunoprecipitation] ChIP‑seq data. There James Taylor developed different. Scientific encoding, one view of the data. Depending on must be a hundred of those. Why are there so the Galaxy platform. software is often the complexity of the data and the number of many? They either all work equally well, and developed by one or dimensions, there are many views. We have it doesn’t matter which one you use. Or, each a handful of people. to accept that what we’re seeing is akin to a one of them does something that’s a little bit shadow on the wall. different, and we simply have not figured out Barry Demchak (lead Cytoscape software An object can cast how to decide which one is best. I argue that architect): The status quo of software devel- many different shad- it’s more the latter than the former. What’s opment in the 1990s is where computational ows, depending on its missing is, ‘How do we generate large data biologists are today—for loops, variables and shape. We can’t look sets with enormous numbers of false positives function calls. has moved at the shadow and say and false negatives?’ You need a sufficiently on, particularly in three areas: functional pro- that that’s the object. big and complicated data set, where you know gramming, service-oriented architectures and We have to remember the truth, to understand whether one method domain-specific languages. that that’s the shadow is better than another, whether I’m getting of the object and that exactly the same answer but I’m just getting it What misconceptions does the research Martin Krzywinski the object has some faster or whether I’m getting different answers community have about software developed the Circos higher dimensional that are both flawed. Those sorts of things are data visualization tool. development or use? properties. part of what can help you drive from a diver- Trapnell: I think the biggest misconception sity of computational tools down to a relative is that something you put on the Internet for Li: People not doing the computational work few that work better. people to use is a finished product. Each version tend to think that you can write a program very of Cufflinks and Tophat, for instance, offered fast. That, I think, is frankly not true. It takes Taylor: I don’t think there are good incen- performance improvements and bug fixes. But a lot of time to implement a prototype. Then tives for contributing to and improving exist- they often also had substantial new features that it actually takes a lot of time to really make it ing software instead of inventing something actually reflected a new understanding about better. new. The latter is more likely to be publish- what’s going on in the computer science and the able. There is also a problem with discovering mathematics and the statistics that are attached Demchak: Quite often, users don’t appreciate software that exists; often people reinvent the to RNA‑seq. Really fundamental stuff. The way the opportunities. Noncomputational biologists wheel just because they don’t know any better. Nature America, Inc. All rights reserved. America, Inc. © 201 4 Nature that translates into the software that people use don’t know when to complain about the status Good repositories for software and best prac- is that they download one version and they run quo. With modest amounts of computational tice workflows, especially if citable, would be the analysis and then they upgrade and then the consulting, long or impossible jobs can become a start. npg results change, sometimes a little bit, sometimes much shorter or richer. a lot. That creates the impression, which I think Anton Nekrutenko (co-creator of Galaxy): This is the wrong impression, that one or both of Are too many new software tools developed is the key idea behind the Galaxy Tool Shed, our those sets of results is just totally wrong. People that ultimately don’t get used? app store. So far, it contains about 2,700 tools. don’t, I think, frequently understand just how Durbin: This is kind of a debate of top down The goal of this mechanism is to make it easy much those programs are research projects that versus bottom up. In science, always there are to try each tool and then vote on which ones are continually evolving. lots of people looking at the same thing in dif- perform well. ferent ways. There are people trying out all James Robinson: Perhaps that the software is sorts of crazy things. It’s extremely successful How is the field of computational biology more sophisticated than it actually is, leading to not have top-down control. It can look a evolving? to too much faith in the results without critical little bit redundant when you have a person Durbin: Now there are a lot of strong, young, thinking. For analy- write yet another read mapper, but some- faculty members who label themselves as com- sis software, such as times things will be influential. New ideas putational analysts, yet very often want wet-lab mutation calling, it’s will come. Sometimes things can be relevant space. They’re not content just working off data important to know at to individual projects. I think for sure things sets that come from other people. They want to some level what the are done inefficiently. I accept that. It’s a bit be involved in data generation and experimental algorithms are, the like . Random mutation and testing design and mainstreaming computation as a valid biases in them, what is very powerful. research tool. Just as the boundaries of biochem- they assume and how istry and cell biology have kind of blurred, I think James Robinson they fail. Visualizing Trapnell: Maybe the way to look at it is the the same will be true of computational biology. It’s developed the Integrative algorithmic output software that gets produced is, in a sense, the going to be alongside biochemistry, or molecular Genomics Viewer. with a critical eye in a piece of supporting data for those papers, and biology or microscopy as a core component.

896 volume 31 NUMBER 10 OCTOBER 2013 nature biotechnology feature

Box 1 Does every new biology PhD student need to learn how to program?

Durbin: That’s a little like asking is it good for a molecular biologist But a similar question would be: does every graduate student in to know chemistry. I would say that computation is now as important biology need to learn grammar? Clearly, yes. Do they all need to to biology as chemistry is. Both are useful background knowledge. learn to speak? Clearly, yes. We just don’t leave it to the literature Data manipulation and use of information are part of the technology experts. That’s because we need to communicate. Do students of biology research now. Knowing how to program also gives people need to tie their shoes? Yes. It has now come to the point where some idea about what’s going on inside data analysis. It helps using a computer is as essential as brushing your teeth. If you them appreciate what they can and can’t expect from data analysis want some kind of a competitive edge, you’re going to want to software. make as much use of that computer as you can. The complexity of the task at hand will mean that canned solutions don’t exist. Trapnell: It’s probably not just that experimental biologists need to It means that if you’re using a canned solution, you’re not at the program, but it’s also enormously helpful when computational folks edge of research. learn how to do experiments. For me, for example, coming from a computer science background, the opposite way of thinking was hard Robinson: Yes. Even if they don’t program in their research, they to learn. How do I learn to argue with wet-lab data? How do I learn will have to use software and likely will communicate with software what to trust, what to distrust, how to cross‑validate things? That’s developers. It helps tremendously to have some basic knowledge. a radically different way of thinking when you’re used to proofs and Additionally, in the research environment, the ability to do basic writing code and validating it on a computer. tasks in the Linux/Unix environment is essential. Krzywinski: To some, the answer might be “no” because that’s left to the experts, to the people downstairs who sit in front of a computer. Rasband: All scientists should learn how to program.

Nekrutenko: Many people can learn how to that everybody needs in order to deal with sorts of complicated information in them, and program in C, but they still write horrific code that kind of data (Box 1). The computational a few big files with complicated information in that nobody can understand. Most of the biol- folks need to learn more about statistics. The them. How to deal with that is a different prob- ogy graduate students who can program, they’re biology folks need to understand basic compu- lem. We start with data in one format. We run it more dangerous than people who cannot pro- tation in order to even be able to communicate through a very complex set of transformations gram because they produce these things. It’s with the biostatistics crowd. in a very complicated computing environment. horrible, but that’s what you expect from a new Tracking it, knowing which output of which ver- field. It will change, and it needs to change just What are the emerging trends in sion of which tool we’re actually going to use, through graduate education. For example, at computational biology and software tools? and being sure that something that we started Penn State, we have a scientific programming Robinson: The emergence of sequencing, and actually finished—those are the really com- course designed for life science people with even very high–resolution SNP and expression plex cases for us. Those sorts of organizational sequence analysis in mind. It builds on ‘software platforms, means that virtually all computa- details are challenging to get right, but I think carpentry’ [http://software-carpentry.org/] by tional biologists working with human sam- most academics don’t have the problem on the teaching people that you need to version your ples now need to understand and deal with same scale. Nature America, Inc. All rights reserved. America, Inc. © 201 4 Nature software. You need to write tests. All these skills, implications of individual identification and that’s the missing part. privacy. Krzywinski: In terms of data visualization, the idea that we can show all the data that we are npg Trapnell: If you break down past work in Trapnell: The areas of computer science that will collecting is long gone. We now need to look computational biology, there’ve been a couple be needed to solve these privacy issues are ones at the differences in the data sets, and help the of historically really rich areas. One of them that biologists have never even been exposed to. user focus on the things that are important. stems from sequence alignment. From there, It has to do with secret sharing and public‑key Differences, and differences of differences, are you get paleogenetics and certain molecular cryptography and all these other areas that we now the data. In addition, you cannot dump that evolution studies. Then with microarrays, just never worry about because they don’t come output from a program on a user, otherwise they you had the advent of genomics as a measure- up. Now they’re coming up in a huge way. So I will become lost in this sea of detail. I think what ment science, where you’re actually trying to would expect that that is going to be a major software needs is a series of output filters that measure something about what’s happening in driver of a lot of serious computational work. can be used to select for the level of detail in some samples. With DNA sequencing, we are the output. seeing the convergence of those two things. Taylor: Crowdsourcing is potentially a deep You get all of the possibilities in terms of state- trend. We’ve seen a lot with people having Durbin: I once heard Nathan Myhrvold and ments you might make with alignment and the success with verification of results. If we can Sydney Brenner talk about “exponential tech- stuff that follows from that, but you also get develop infrastructures to allow greater partici- nologies,” which were all characterized by pro- all the crazy statistical issues and numerical pation and to take advantage of large communi- viding an exponential increase in information. analysis problems that arise when you’re mak- ties, the decision-making capabilities of groups Sequencing is like that. In the future, I think we ing quantitative measurements of biological is going to be a continuing trend. will get into cell biology, away from the genome. activity. I think the end‑stage result is that, I think it’s going to be done through high- now, sequencing is used not as a cataloging Gentleman: At Genentech, we have petabytes throughput data acquisition from instrumen- technology, but really as a routine, day‑to‑day of data, but it’s not ‘big data’ like at Amazon or tation of cell biological measurements of some measurement technology. That just reorients, Walmart or in the airline industry. Our problem sort. I’m not exactly sure how, but I’m interested, I think, the baseline computational skill set is that we have lots of little, tiny files that have all and excited by that. Corrected after print 9 May 2014.

nature biotechnology volume 31 NUMBER 10 OCTOBER 2013 897 errata

Erratum: The anatomy of successful computational biology software Stephen Altschul, Barry Demchak, Richard Durbin, Robert Gentleman, Martin Krzywinski, Heng Li, Anton Nekrutenko, James Robinson, Wayne Rasband, James Taylor & Cole Trapnell Nat. Biotechnol. 31, 894–897 (2013); published online 8 October 2013; corrected after print 9 May 2014

In the version of this article initially published, in Table 1, Steven Salzberg should have been listed as the second, and not the last, of the creators of the Cufflinks software. The error has been corrected in the HTML and PDF versions of the article. Nature America, Inc. All rights reserved. America, Inc. © 201 4 Nature npg

nature biotechnology