on February 10, 2011

you-go service, accessible from one’s own desktop, that provides rented time on a large Will Computers Crash ? cluster of machines that work together in par- allel as fast as, or faster than, a single pow- New technologies are making sequencing DNA easier and cheaper than ever, but the erful computer. “Surviving the data deluge

ability to analyze and store all that data is lagging means computing in parallel,” says Michael www.sciencemag.org Schatz, a bioinformaticist at Cold Spring is worried. For decades, com- Funding agencies have neglected bio- Harbor Laboratory (CSHL) in New York. puters have improved at rates that have bog- informatics needs, Stein and others argue. gled the mind. But Stein, a bioinformaticist “Traditionally, the U.K. and the U.S. have Dizzy with data at the Ontario Institute for Cancer Research not invested in analysis; instead, the focus The balance between sequence generation (OICR) in Toronto, Canada, works in a fi eld has been investing in data generation,” and the ability to handle the data began to that is moving even faster: genomics. says computational biologist Chris shift after 2005. Until then, and even today, Downloaded from The cost of sequencing DNA Ponting of the University of Oxford most DNA sequencing occurred in large has taken a nosedive in the decade in the United Kingdom. “That’s centers, well equipped with the computer since the human genome was pub- got to change.” personnel and infrastructure to support the lished—and it is now dropping by Within a few years, Ponting analysis of a genome’s data. DNA sequences 50% every 5 months. The amount predicts, analysis, not sequencing, churned out by these centers were deposited of sequence available to research- will be the main expense hurdle to and stored in centralized public databases, ers has consequently skyrocketed, many genome projects. And that’s such as those run by the European Bioin- setting off warnings about a “data tsunami.” assuming there’s someone who can do it; formatics Institute (EBI) in Hinxton, U.K., A single DNA sequencer can now gener- bioinformaticists are in short supply every- and the National Center for ate in a day what it took 10 years to collect where. “I worry there won’t be enough peo- Information (NCBI) in Bethesda, Mary- for the Human Genome Project. Computers ple around to do the analysis,” says Ponting. land. Researchers elsewhere could then are central to archiving and analyzing this Recent reviews, editorials, and scientists’ download the data for study. By 2007, NCBI information, notes Stein, but their process- blogs have echoed these concerns (see Per- had 150 billion bases of genetic information ing power isn’t increasing fast enough, and spective on p. 728). They stress the need for stored in its GenBank database. their costs are decreasing too slowly, to keep new software and infrastructures to deal with Then several companies in quick suc- up with the deluge. The torrent of DNA data computational and storage issues. cession introduced “next-generation” and the need to analyze it “will swamp our In the meantime, bioinformaticists machines, faster sequencers that spit out storage systems and crush our computer are trying new approaches to handle the data more cheaply. But the technologies clusters,” Stein predicted last year in the data onslaught. Some are heading for the behind these machines generate such short

journal Genome Biology. clouds—cloud computing, that is, a pay-as- stretches of sequence—typically just 50 /ALVAREJO.COM ARTEAGA ALVARO CREDIT:

666 11 FEBRUARY 2011 VOL 331 SCIENCE www.sciencemag.org Published by AAAS CREDIT: ALVARO ARTEAGA /ALVAREJO.COM few,” warns Ponting. Although large centers cal capacity is still focused inthe hands ofa the analyti-[as] being democratizing, but ysis. “The new technologyis thoughtof betweenconnect anal- and generation data to smaller labs could also increase the dis- genomes,”thousands of says Pollard. able analyzers, genome “won’t to scale genome. But the software, like most avail- about the evolutionary ofthe human history regions,nonconserved which reveal clues orsequences and lookingfor conserved genomes by50 vertebrate aligning their Web site. site’sThe programs compare can Genome Browser, a popular UC Santa Cruz (UC), San Francisco.California Take Gladstone Institutes of the University of Katherine Pollard, a biostatistician at the databases andsoftware programs, says the sheer volume could overwhelm some data available will grow even faster, and the amount and kinds ofDNA-related genome projects of their own. As a result, available, more, and smaller, labs will start ( DNA of production faster sequences machines—which even promise cheaper, Ewanbioinformaticist Birney. going, ‘Oh, where dowe start?’ ” says EBI rapidlyvery from nothaving enough data to makes “We’re researchers. to sense moving also represented in a waystored but that plex matrix of data that needs tobe notonly bind,transcription factors yielding a com- identify every DNA sequence towhich 40 investigated has ect to lines cell of dozens challenges. Amongother efforts, the proj- that has a function, offer jaw-dropping data every DNA sequence inthe human genome like ENCODE, which aimsto characterize 21-year existence. And ambitiousprojects entire its in accumulated GenBank had than 6months DNA sequence data inits first Genomes 1000 the Project, generatedmore ized. A massive studyof genetic variation, generation sequencing,” recalls Schatz. panic when we saw the projections for next- and storage. “Weory all had a moment of incredible rates that devour computer mem- at data sequence their produce machines passes. Inaddition, the next-generation now genome; accurate takes it more or 40 a genome 10timesover toputtogether an required. Itonce was enough to sequence and processing ups the computer memory a cohesive genome, which greatly inturn required toassemble those fragments into more sequencing isto 120bases—that far Science, 5March 2010,p.1190)—become The spread of sequencing technology Moreover, as so-called third generation being real- already are projections Those www.sciencemag.org ers. To test their strategy,their ers. Totest cloud’s off-site comput- temporarily intothe ware using data uploaded software onremote hard- the Galaxy thatcould run computer” up a “virtual power.puting set They into rented off-site com- Galaxy that would tap developing a version of entering the cloud meant Open CloudConsortium. well,as ones the as such there are not-for-profit andcomputing facilities, weights cloud- running soft are among the heavy- Web Micro- and Services others. Amazon by hosted programs developedand ers, or exploiting software another facility’scomput- one’s own software on tostore data, running ory off-site computing mem- including simply renting can mean various things, Galaxy.approach to computing year, he and his colleagues tried a cloud- can boggedit get down, says Taylor. Solast axy works well, as a shared resource, but, bioinformaticists. publicThe forGal- portal or clusters computer in-house without ses any investigator todobasic genome analy- allowsmachine. Galaxy Internet-connected accessed on Penn State’s computers via any be downloadedto a personal computer or ated Galaxy, a software package that can more effi cient,” Taylor explains. They cre- and computational researchers easier and make collaborations between experimental easy-to-use framework. The goal was “to genomics tools and databases under one sity Park, pulled together various computer nia State University(Penn State), Univer- he and of Pennsylva-Anton Nekrutenko data analysisfor demands 2005, In coming. University in Atlanta, saw some of the James Taylor, at Emory a bioinformaticist horizon Clouds onthe over in be could warns. he heads, their project, asequencing for resources and time when needs computational mate budgeting But small labs,many ofwhich underesti- have what they the means tointerpret fi nd. laborpower, to new limits, they basically still may be stretching their computing, and their VOL 331 SCIENCE VOL 11 FEBRUARY 2011 For Taylor’steam, Cloud computing HUMAN GENOME 10TH ANNIVERSARY P u b l i s h e d b y

A A A S ers. In 2009,the pair published one ofthe helping tomake that shift possible for oth- timore,Maryland, are already thereandare Bloomberg School of Public Health inBal- computer scientist at the Johns Hopkins move tothe cloud,” he adds. “We’re going toencourage more people to ing a powerful computer network in-house. doesn’t havethe overheadcosts of maintain- cloud, he has access tomany computers but sional basis,” he says. With theof help the need a lotof computingpower onan occa- particularly cost-effectivewhen solution you Cold Spring Harbor, New York. “This is a 2010 at the Biology of Genomes meeting in an hour and cost $20, Taylorin May reported to the cloud system, the processing took just process. But when they uploaded theirdata computers would have been a long and costly for the first human genome. generated information of the amount 1/10 of some 1.8 gigabases of DNA sequence, about mother-child pairs, generating inone study from the blood and cheek swabs of three sequencing the mitochondrial genomes cell tocell inan individual. That involved thegenomes of mitochondria vary from Makova,eryna who wanted tolook at how they worked with PennState colleague Kat- CSHL’s Schatz and Ben Langmead, a Analyzing these data onthe Penn State | FOCUS NEWS 667

Downloaded from www.sciencemag.org on February 10, 2011 NEWSFOCUS

fi rst results from marrying cloud comput- the connections among the cloud’s proces- byte for high-end storage on a local system,” ing and genomics. They wanted to identify sors can be fairly slow, so computations Schatz says. For NCBI, however, it’s still common sites of DNA variation known as requiring processors to talk to each other more cost-effective to keep GenBank and its single-nucleotide polymorphisms (SNPs), can get bogged down, says Langmead. Some other databases in-house, says Don Preuss but to do so they needed to hunt through researchers worry that the burgeoning cloud- of NCBI. short sequences of human DNA totaling an computing industry won’t agree on standards Putting data in a cloud may help in other amount equivalent to 38 copies of the human that will allow for connections between ways as well. Right now, anyone wanting to genome. With the help of a cloud-based clus- clouds, such that data stored on one cloud analyze a genome has to download it from ter of 320 computers, they identifi ed 3.7 mil- can be accessible to another. “Cloud comput- a public archive such as GenBank—and lion SNPs in less than 4 hours and for less ing is hot and sexy,” says Bonazzi. “But it’s as these data sets get larger, such transfers than $100. “We estimate it would have taken not the answer to everything.” become slower. Moreover, downloaded cop- a single computer several hundred hours for ies of these data sets, some now out of date, the analysis,” says Schatz. Storage issues have proliferated around the world, each one At the Biology of Genomes meet- Cloud computing offers a possible solution taking up storage space that eats into bioin- ing, Langmead and Schatz unveiled two to other problems facing the bioinformat- formatics budgets. In his vision, says Stein, new cloud-computing initiatives. Lang- ics community: data storage and transfer. “you have one copy of the data located in mead described a computer program called Because storage costs are dropping much this common cloud that everyone uses” and Myrna that determines the differential more slowly than the costs of generating it won’t be necessary to download or upload expression of genes from RNA sequence sequence data, “there will come a point when the data between computers for processing.

data and is designed 300 100,000 Encouraged by the genomics commu- for the parallel pro- Cost and Growth of Bases nity, NCBI has put a copy of the data from cessing performed 250 $10,000 10,000 the pilot project of the 1000 Genomes effort

by cloud-comput- 200 into off-site storage run by a cloud-comput- ing facilities. Schatz Cost per million $1,000 1,000 ing provider. And U.S. East Coast users of base pairs of sequence 150 GenBank

introduced another Ensemble, the EBI sequence database, are on February 10, 2011 (log scale) $100 100 Cost ($) program, Contrail, 100 automatically funneled into a cloud environ- that can assemble $10 10 ment as part of a test of the strategy. Billions of bases 50 genomes from data SOURCE: NCBI One worry about this approach is the secu- that next-generation $1 1 rity of the data. Data involving the health of 0 sequencing machines 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 human subjects, which is being linked more generate and deposit and more to genome information, requires into a cloud. So much for so little. The decline in sequencing costs (red line) has led to extra precautions that make some research- Low cost and a surge in stored DNA data. ers hesitant about clouds. However, at least

speed aren’t the only one cloud-computing company already has www.sciencemag.org advantages of the cloud approach, says we will have to spend an exponential amount clients whose human data are covered by the Langmead. “The cloud user never has to on data storage,” says Birney. strict health information protection laws of replace hard drives, renew service contracts, That has created pressure to let go of the the United States, so there are indications worry about electricity usage and cooling, fi eld’s long-standing tendency to archive all that this concern can be allayed. deal with fl ooding or other natural disasters, raw sequence data. Because the raw mate- All these issues came to the fore last et cetera,” he points out. For small labs that rial from next-generation machines is in the year, when NHGRI hosted several meet- lack their own powerful computer clusters, form of high-resolution images, it soaks up ings on cloud computing and on informatics Downloaded from “cloud computing may represent the democ- huge amounts of computer storage. So sci- and analysis, says Bonazzi. Also, at a retreat ratization of computation,” says Schatz. entists are considering discarding the origi- last summer, the case was made for more But cloud computing is “not mature,” nal image fi les once they produce the pre- training and education. “One cautions Vivien Bonazzi, program direc- liminarily processed sequence data, which is thing that is clear is that as computation tor for and bioinfor- more easily kept. Eventually, it may be more becomes more and more necessary through- matics at NHGRI. Putting data into a cloud economical to save no raw data and just rese- out biomedical research, the way these cluster by way of the Internet can take many quence a DNA sample if necessary. But for [infrastructure] resources are funded will hours, even days, so cloud providers and now, as to what should be kept, “there’s a lot have to change to be more effi cient,” says their customers often resort to the “sneaker of thrashing still to happen,” says Bonazzi. Taylor. For now, NHGRI has no programs in net”: overnight shipment of data-laden hard Putting the data in an off-site facility place to address these needs. “But they are on drives. And with the exception of Galaxy, could relieve some of the pressure, says our radar,” says Bonazzi. Myrna, and a few other computer tools, not OICR’s Stein. The economies of scale avail- Like Stein, she worries about swamped much genomics software is confi gured for able to large cloud-providing companies can storage systems and overwhelmed computer the massively parallel processing approach produce signifi cant cost savings, meaning clusters. But Bonazzi remains sanguine. “Do taken by cloud computers. “It is currently it might be cheaper to rent transient storage I think these problems will be solved?” she too diffi cult to develop cloud software that’s space from the cloud in some cases. Stor- says. “I’m optimistic.” And even Stein is try- truly easy to use,” says Langmead. age costs at the Amazon Web Server top off ing to think positively. “I’m very good at pre- Also, cloud computing works best if an at 14 cents a gigabyte per month, according dicting disasters that never happen,” he says. analysis can be divided into many separate to Amazon’s Deepak Singh. “In comparison, There’s always sunlight above the clouds. tasks handled by multiple processors. But it commonly costs 50 cents to $1 per giga- –ELIZABETH PENNISI

668 11 FEBRUARY 2011 VOL 331 SCIENCE www.sciencemag.org Published by AAAS