PROFILE John Cottrell, David Creasy & Darryl Pappin

Pioneers in proteomics data analysis reflect on what to focus on in computational biology.

n 1993, five papers—authored independently at greater scales. For instance, when we first Iby William Henzel, Peter James, Matthias started developing Mascot, we thought that Mann, Darryl Pappin and John Yates—showed 300 spectra in a single search would be a rea- that masses measured by a mass spec- sonable limit. How wrong we were! By 2003, it trometer could be used as a ‘fingerprint’ to was possible to analyze 1 million spectra in a identify . This insight spurred global single search, and now there is effectively no analyses of the proteome and the need for new limit. When there have been big advances in computational tools. At the 2012 international experimental technology, as there have been in meeting of the Human Proteome Organization proteomics and genomics, new ideas in compu- (HUPO), John Cottrell and David Creasy tational biology have been needed. received the HUPO Award for Science and Technology for developing an algorithm cre- Are there big opportunities today for ated by Pappin into the widely used Mascot commercialization of academic ‘omics David Creasy & John Cottrell software package for peptide identification and software? inference. Nature Biotechnology spoke D.C., J.C. & Darryl Pappin: In August 2001, the supposed benefits of running Mascot ‘on with Cottrell, Creasy and Pappin to understand Genome Technology magazine profiled ‘pro- the cloud’. In a presentation made in 2004, Mascot’s route to successful commercialization teinformatics’—not a term that ever caught Darryl León of Lion Biosciences [Heidelberg, and glean insights into current challenges in on—in an article that featured search engines Germany] noted that this model of supplying computational biology. from seven companies. Only three are still software suffers from secure data transfer issues around today. Matthias Mann was prescient in and offers limited control in the development How was Mascot commercialized? saying, “I don’t think there’s a big space for a lot and implementation of applications. All of the David Creasy and John Cottrell: In the early of companies thriving on developing this soft- software service companies surveyed in his 1990s, Darryl Pappin at the Imperial Cancer ware.” In general, there has been a low survival presentation have either gone bust or moved Research Fund (now Cancer Research UK) rate of scientific software companies. Is genom- away from selling software. In contrast, what was one of the first people to recognize the ics different? I don’t know. But what hasn’t has succeeded are relatively small companies possibility of protein identification by digest- changed is that existing computational bottle- offering specialized applications for which the ing a protein with an , such as , necks and high-activity areas of computational core technology existed at the inception of the measuring the molecular masses of the methods development often hint at the need for company. and matching these measured masses against better experimental techniques. For instance, masses calculated from the entries in a protein genome assembly is an important computa- What advice would you give to those © 2012 Nature America, Inc. All rights reserved. America, Inc. © 2012 Nature sequence database—the technique we now call tional problem in genomics today that is being looking for important computational peptide mass fingerprinting. These ideas were addressed by new computational methods and problems to solve? implemented in a program called MOWSE, by new experimental methods for assaying long D.C., J.C. & D.P.: Take a road less traveled. For npg for Molecular Weight Search, and published in DNA molecules. instance, we know of at least 50 search engines 1993. We licensed MOWSE and founded the for database matching of MS/MS data. There company Matrix Science [London] in 1998 and How did you keep your software company are a great many interesting and important a year later launched the Mascot product. viable for 14 years? problems in –based pro- D.C. & J.C.: We’ve deliberately stayed very teomics that have not received anything like Did you take outside funding? small. We started with two people and now have this amount of attention. For example, we really D.C. & J.C.: Using our savings and not paying 11. This seems about right for the size of our need some new ideas on protein inference. ourselves for the first year or two, it was pos- market. The number of labs who are potential Data-independent acquisition and scoring high sible to get going without any outside invest- customers for our software is probably in the accuracy MS/MS data are two areas where it ment. This may not be an option if you want tens of thousands. The Genome Technology could be said that the software is lagging behind to start a hardware company or sell products article speculated that proteomics software the hardware. Search software for other bio- or services to large numbers of consumers. For could become a multi-billion dollar per year polymers, such as sugars and lipids, is lacking. scientific software, it is very possible, and we market; the reality is more like a multi-million As we mentioned, important computational consider starting without outside investors the dollar per year market. Of course, we shouldn’t problems usually arise from the need to process smartest decision we ever made. underestimate the advantage of entering the large quantities of interesting biological data. field early enough to get well established. But perhaps a high-risk, high-reward approach What lessons might your experiences would be to identify the so-called ‘important’ hold for other fields of biology, such as What about cloud computing and offering computational problems and eliminate them genomics? software as a service? by developing new experimental technologies. D.C. & J.C.: Both genomics and proteomics D.C. & J.C.: People often suggest offering have been driven by advances in measurement Mascot as a service (e.g., $1 per search). In Interviewed by H. Craig Mak, Associate Editor, technology, which generated new types of data recent years, this is usually associated with Nature Biotechnology

nature biotechnology volume 30 number 10 OCTOber 2012 963