Challenges in Funding and Developing Genomic Software: Roots and Remedies Adam Siepel

Siepel Genome Biology (2019) 20:147 https://doi.org/10.1186/s13059-019-1763-7 OPINION Open Access Challenges in funding and developing genomic software: roots and remedies Adam Siepel in 2019 [6]. There were no formal mechanisms at the time Abstract for grant applications for scientific research. Instead, The computer software used for genomic analysis has William Herschel simply approached the King directly become a crucial component of the infrastructure for with a request for royal patronage. The 40-ft tele- life sciences. However, genomic software is still scope is one of the earliest examples of government typically developed in an ad hoc manner, with investment in the infrastructure for scientific research, inadequate funding, and by academic researchers not to enable a project that simply would not have been trained in software development, at substantial costs possible with private funds alone. to the research community. I examine the roots of the The model of government investment in scientific in- incongruity between the importance of and the frastructure became increasingly well-established degree of investment in genomic software, and I throughout the 19th and 20th centuries, culminating in suggest several potential remedies for current the “Big Science” of the World War II and Post-War problems. As genomics continues to grow, new eras. Science in modern times has been dominated, in strategies for funding and developing the software many ways, by these massive public investments. Prom- that powers the field will become increasingly inent examples include the Manhattan project (equiva- essential. lent to $22 billion in 2016 [7, 8]), the Apollo program (equivalent to $107 billion [9]), the Space Shuttle program (equivalent to $219 billion [10]), the Large Hadron A traveler in late-eighteenth-century England who passed Collider (equivalent $4.8 billion [11]) and, more pertin- through the town of Slough—located just west of London ent to this article, the Human Genome Project (equiva- and not far from present-day Heathrow Airport—might lent to $5.0 billion [12]; Fig. 1b). have come upon a massive 40-ft-long telescope, sus- Indeed, we now live in a world where much of the day- pended in a wooden frame more than 50 ft tall (Fig. 1a). to-day work in science depends on a publicly funded in- The telescope was located at the home of William frastructure. In particular, many working in genomics rely Herschel and his sister Caroline, two of the greatest heavily on data sets such as those generated by the astronomers of their day. It was the largest telescope ENCODE, Roadmap Epigenomics, 1000 Genomes, in the world until it was dismantled in 1839. Weigh- Genotype-Tissue Expression (GTEx), The Cancer Genome ing over 1000 lbs., the “40-ft telescope”,asitwas Atlas (TCGA), Genetic European Variation in Disease known, was sufficiently impressive to the general pub- (GEUVADIS), and most recently, Human Cell Atlas pro- lic to emerge as a regional tourist attraction. Its auda- jects. We store and search sequence data using GenBank, cious scale inspired prominent thinkers and writers of EMBL-Bank, DDBJ, UniProt, and Pfam, examine three- the time, including Erasmus Darwin and William dimensional protein structures in PDB or EMDB, scour Blake [1, 5]. the literature using PubMed, and view genomic annota- The 40-ft telescope took 5 years to build and was paid tions using the UCSC Genome Browser and Ensembl for by a grant of £4000 from King George III, who was Browser. All these resources have been maintained for de- strongly committed to scientific research throughout his cades either directly by government agencies or through reign. This grant represented a substantial sum at the long-term public funding to universities and research time, roughly equivalent to £600,000 (about US$800,000) institutes. Correspondence: [email protected] The computer software on which millions of scientists Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold rely for genomic analysis is no less an essential part of the Spring Harbor, NY 11724, USA © The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Siepel Genome Biology (2019) 20:147 Page 2 of 14 Fig. 1 a William and Caroline Herschel’s 40-foot telescope in Slough, England, 1789 [1]. b Two of the teams of scientists that contributed to the Human Genome Project: (top) Sanger Center, Hinxton, UK [2]; (bottom) Washington University Genome Sequencing Center, St. Louis, USA, both circa 2000. c Some relics of the pre-Internet world: (clockwise from top left) the author learning to program in BASIC on a Commodore 64 in a cold upstate New York basement, 1983 (credit: Virginia Siepel), as one of the millions of children who were introduced to home computers during the 1980s and 1990s, some of whom would go on to write much of the software that powers genomics today; floppy disk for PAUP version 3.1.1,©1993; Sun Microsystems SPARCstation 1 with Mosaic web browser faintly visible on screen, 1994 [3]; screen shot from the MASE alignment program [4]. d Prof. David Haussler of UC Santa Cruz with the original Dell computer cluster that his team used to assemble the human genome, 2000. Photo (c) UC Santa Cruz, used with permission infrastructure of biological research than large shared data arrive at this odd situation? Why is scientific software sup- sets or public databases, yet the model for funding and de- ported differently from other forms of scientific infrastruc- veloping computer software differs substantially. Most ture? Why are adequate funds not set aside for this widely used genomic software is developed by independ- important work? ent investigators working in academic or not-for-profit in- In this article, I offer my perspective on the unique stitutions with support from government grants. This problem of funding and developing software for genom- software is generally freely available to the community, ics, based on my 25 years in the field—as a developer typically with no subscription or licensing fees and nonre- and user of software, a professional programmer and strictive terms of use. At the same time, it is often mea- principal investigator, an applicant for and reviewer of gerly funded, unreliable, hard-to-use, poorly documented, grant proposals, and an employee of government, uni- and/or poorly supported. How did we, as a community, versity, and private research institutions. I first examine Siepel Genome Biology (2019) 20:147 Page 3 of 14 what makes genomic software development unusual and suggesting approximately $16 billion is spent on basic how the field has come to be the way it is. Overall, I life sciences research [153]). If even 10% of these funds argue that, despite some important strengths of our are devoted to projects that rely in part on genomics current model for software development, we as a com- and genomic software, which seems plausible, then this munity have “painted ourselves into a corner” in terms software would be instrumental in supporting more than of developing robust, well-engineered software and are a billion dollars per year in research. Furthermore, total paying for it; we are, in a sense, addicted to free soft- R&D expenditures in the US are estimated at about four ware. Finally, I suggest some possible remedies that at- times those of the federal government, and scaling up to tempt to strike a balance between addressing important worldwide R&D expenditures requires about another deficiencies in the current model and maintaining its factor of three. (Total R&D expenditures in the US, in- core strengths. My discussion of these topics necessarily cluding those in the private sector and at other govern- have a US bias, but I believe that many of my points are mental levels, are estimated at about $500 billion internationally valid. Also, although this article focuses annually. The US leads the world in spending on sci- on genomics, similar trends occur in other areas of com- ence, but China is not far behind, and several other putational biology, such as structural biology and prote- countries—including Japan, Germany, South Korea, omics, as well as in some other areas of scientific France, and the UK—also account for substantial software development. amounts. Together, the top ten countries spend about $1.5 trillion per year on R&D [154]). Therefore, a rough Software for genomics is critical to the research calculation suggests that the worldwide research that de- infrastructure for the life sciences pends, at least in part, on genomic software is likely to During the past 25 years, genomic software development cost tens of billions of dollars annually. has grown from an obscure cottage industry to an essential part of the infrastructure of biological research. Re- Software for genomics lacks a sustainable model searchers across the globe rely on computational tools for development and maintenance for read mapping, genome assembly, multiple alignment, Despite the overwhelming importance of genomic soft- phylogenetics, population genomics, and visualization of ware, there is broad agreement among practitioners that genomic data, among many other applications. Import- the current model for its development has serious flaws. antly, these tools are no longer used only by genomic As noted above, most genomic software is developed by specialists, but across all the life sciences, including dis- academic groups and funded by government grants, yet parate fields such as ecology and evolution, molecular there are relatively few dedicated granting opportunities and cell biology, clinical genetics, plant breeding, bio- for genomic software development, and those that exist physics, and bioengineering.

Challenges in Funding and Developing Genomic Software: Roots and Remedies Adam Siepel

Evidence of Selection at the Ramosa1 Locus During Maize Domestication

Computational Methods Addressing Genetic Variation In

Big Data, Moocs, and ... (PDF)

DNA Sequencing

Cloud Computing and the DNA Data Race Michael Schatz

BENG181/CSE 181/BIMM 181 Molecular Sequence Analysis Instructor: Pavel Pevzner

New Softwares for Automated Microsatellite Marker Development

Steven L. Salzberg

A Tool for Detecting Base Mis-Calls in Multiple Sequence Alignments by Semi-Automatic Chromatogram Inspection

Annual Scientific Report 2011 Annual Scientific Report 2011 Designed and Produced by Pickeringhutchins Ltd

THE BIG CHALLENGES of BIG DATA As They Grapple with Increasingly Large Data Sets, Biologists and Computer Scientists Uncork New Bottlenecks

UNIVERSITY of CALIFORNIA RIVERSIDE RNA-Seq