Will Computers Crash Genomics?
Total Page:16
File Type:pdf, Size:1020Kb
on February 10, 2011 you-go service, accessible from one’s own desktop, that provides rented time on a large Will Computers Crash Genomics? cluster of machines that work together in par- allel as fast as, or faster than, a single pow- New technologies are making sequencing DNA easier and cheaper than ever, but the erful computer. “Surviving the data deluge ability to analyze and store all that data is lagging means computing in parallel,” says Michael www.sciencemag.org Schatz, a bioinformaticist at Cold Spring Lincoln Stein is worried. For decades, com- Funding agencies have neglected bio- Harbor Laboratory (CSHL) in New York. puters have improved at rates that have bog- informatics needs, Stein and others argue. gled the mind. But Stein, a bioinformaticist “Traditionally, the U.K. and the U.S. have Dizzy with data at the Ontario Institute for Cancer Research not invested in analysis; instead, the focus The balance between sequence generation (OICR) in Toronto, Canada, works in a fi eld has been investing in data generation,” and the ability to handle the data began to that is moving even faster: genomics. says computational biologist Chris shift after 2005. Until then, and even today, Downloaded from The cost of sequencing DNA Ponting of the University of Oxford most DNA sequencing occurred in large has taken a nosedive in the decade in the United Kingdom. “That’s centers, well equipped with the computer since the human genome was pub- got to change.” personnel and infrastructure to support the lished—and it is now dropping by Within a few years, Ponting analysis of a genome’s data. DNA sequences 50% every 5 months. The amount predicts, analysis, not sequencing, churned out by these centers were deposited of sequence available to research- will be the main expense hurdle to and stored in centralized public databases, ers has consequently skyrocketed, many genome projects. And that’s such as those run by the European Bioin- setting off warnings about a “data tsunami.” assuming there’s someone who can do it; formatics Institute (EBI) in Hinxton, U.K., A single DNA sequencer can now gener- bioinformaticists are in short supply every- and the National Center for Biotechnology ate in a day what it took 10 years to collect where. “I worry there won’t be enough peo- Information (NCBI) in Bethesda, Mary- for the Human Genome Project. Computers ple around to do the analysis,” says Ponting. land. Researchers elsewhere could then are central to archiving and analyzing this Recent reviews, editorials, and scientists’ download the data for study. By 2007, NCBI information, notes Stein, but their process- blogs have echoed these concerns (see Per- had 150 billion bases of genetic information ing power isn’t increasing fast enough, and spective on p. 728). They stress the need for stored in its GenBank database. their costs are decreasing too slowly, to keep new software and infrastructures to deal with Then several companies in quick suc- up with the deluge. The torrent of DNA data computational and storage issues. cession introduced “next-generation” and the need to analyze it “will swamp our In the meantime, bioinformaticists machines, faster sequencers that spit out storage systems and crush our computer are trying new approaches to handle the data more cheaply. But the technologies clusters,” Stein predicted last year in the data onslaught. Some are heading for the behind these machines generate such short journal Genome Biology. clouds—cloud computing, that is, a pay-as- stretches of sequence—typically just 50 /ALVAREJO.COM ARTEAGA ALVARO CREDIT: 666 11 FEBRUARY 2011 VOL 331 SCIENCE www.sciencemag.org Published by AAAS HUMAN GENOME 10TH ANNIVERSARY | NEWSFOCUS to 120 bases—that far more sequencing is may be stretching their computing, and their they worked with Penn State colleague Kat- required to assemble those fragments into laborpower, to new limits, they basically still eryna Makova, who wanted to look at how a cohesive genome, which in turn greatly have the means to interpret what they fi nd. the genomes of mitochondria vary from ups the computer memory and processing But small labs, many of which underesti- cell to cell in an individual. That involved required. It once was enough to sequence mate computational needs when budgeting sequencing the mitochondrial genomes a genome 10 times over to put together an time and resources for a sequencing project, from the blood and cheek swabs of three accurate genome; now it takes 40 or more could be in over their heads, he warns. mother-child pairs, generating in one study passes. In addition, the next-generation some 1.8 gigabases of DNA sequence, about machines produce their sequence data at Clouds on the horizon 1/10 of the amount of information generated incredible rates that devour computer mem- James Taylor, a bioinformaticist at Emory for the fi rst human genome. ory and storage. “We all had a moment of University in Atlanta, saw some of the Analyzing these data on the Penn State panic when we saw the projections for next- demands for data analysis coming. In 2005, computers would have been a long and costly generation sequencing,” recalls Schatz. he and Anton Nekrutenko of Pennsylva- process. But when they uploaded their data Those projections are already being real- nia State University (Penn State), Univer- to the cloud system, the processing took just ized. A massive study of genetic variation, sity Park, pulled together various computer an hour and cost $20, Taylor reported in May the 1000 Genomes Project, generated more genomics tools and databases under one 2010 at the Biology of Genomes meeting in DNA sequence data in its first 6 months easy-to-use framework. The goal was “to Cold Spring Harbor, New York. “This is a than GenBank had accumulated in its entire make collaborations between experimental particularly cost-effective solution when you 21-year existence. And ambitious projects and computational researchers easier and need a lot of computing power on an occa- like ENCODE, which aims to characterize more effi cient,” Taylor explains. They cre- sional basis,” he says. With the help of the every DNA sequence in the human genome ated Galaxy, a software package that can cloud, he has access to many computers but that has a function, offer jaw-dropping data be downloaded to a personal computer or doesn’t have the overhead costs of maintain- challenges. Among other efforts, the proj- accessed on Penn State’s computers via any ing a powerful computer network in-house. ect has investigated dozens of cell lines to Internet-connected machine. Galaxy allows “We’re going to encourage more people to identify every DNA sequence to which 40 any investigator to do basic genome analy- move to the cloud,” he adds. on February 10, 2011 transcription factors bind, yielding a com- ses without in-house computer clusters or CSHL’s Schatz and Ben Langmead, a plex matrix of data that needs to be not only bioinformaticists. The public portal for Gal- computer scientist at the Johns Hopkins stored but also represented in a way that axy works well, but, as a shared resource, Bloomberg School of Public Health in Bal- makes sense to researchers. “We’re moving it can get bogged down, says Taylor. So last timore, Maryland, are already there and are very rapidly from not having enough data to year, he and his colleagues tried a cloud- helping to make that shift possible for oth- going, ‘Oh, where do we start?’ ” says EBI computing approach to Galaxy. ers. In 2009, the pair published one of the bioinformaticist Ewan Birney. Cloud computing Moreover, as so-called third generation can mean various things, machines—which promise even cheaper, including simply renting www.sciencemag.org faster production of DNA sequences off-site computing mem- (Science, 5 March 2010, p. 1190)—become ory to store data, running available, more, and smaller, labs will start one’s own software on genome projects of their own. As a result, another facility’s comput- the amount and kinds of DNA-related ers, or exploiting software data available will grow even faster, and programs developed and the sheer volume could overwhelm some hosted by others. Amazon Downloaded from databases and software programs, says Web Services and Micro- Katherine Pollard, a biostatistician at the soft are among the heavy- Gladstone Institutes of the University of weights running cloud- California (UC), San Francisco. Take computing facilities, and Genome Browser, a popular UC Santa Cruz there are not-for-profit Web site. The site’s programs can compare ones as well, such as the 50 vertebrate genomes by aligning their Open Cloud Consortium. sequences and looking for conserved or For Taylor’s team, nonconserved regions, which reveal clues entering the cloud meant about the evolutionary history of the human developing a version of genome. But the software, like most avail- Galaxy that would tap able genome analyzers, “won’t scale to into rented off-site com- thousands of genomes,” says Pollard. puting power. They set The spread of sequencing technology up a “virtual computer” to smaller labs could also increase the dis- that could run the Galaxy connect between data generation and anal- software on remote hard- ysis. “The new technology is thought of ware using data uploaded [as] being democratizing, but the analyti- temporarily into the cal capacity is still focused in the hands of a cloud’s off-site comput- CREDIT: ALVARO ARTEAGA /ALVAREJO.COM ARTEAGA ALVARO CREDIT: few,” warns Ponting. Although large centers ers. To test their strategy, www.sciencemag.org SCIENCE VOL 331 11 FEBRUARY 2011 667 Published by AAAS NEWSFOCUS fi rst results from marrying cloud comput- the connections among the cloud’s proces- byte for high-end storage on a local system,” ing and genomics.