TECHNOLOGY FEATURE THE BIG CHALLENGES OF BIG DATA As they grapple with increasingly large data sets, biologists and computer scientists uncork new bottlenecks. EMBL–EBI

Extremely powerful computers are needed to help biologists to handle big-data traffic jams.

BY VIVIEN MARX and how the genetic make-up of different can- year, particle-collision events in CERN’s Large cers influences how cancer patients fare2. The Hadron Collider generate around 15 petabytes iologists are joining the big-data club. European Institute (EBI) in of data — the equivalent of about 4 million With the advent of high-throughput Hinxton, UK, part of the European Molecular high-definition feature-length films. But the genomics, life scientists are starting to Biology Laboratory and one of the world’s larg- EBI and institutes like it face similar data- Bgrapple with massive data sets, encountering est biology-data repositories, currently stores wrangling challenges to those at CERN, says challenges with handling, processing and mov- 20 petabytes (1 petabyte is 1015 bytes) of data , associate director of the EBI. He ing information that were once the domain of and back-ups about genes, proteins and small and his colleagues now regularly meet with astronomers and high-energy physicists1. molecules. Genomic data account for 2 peta- organizations such as CERN and the European With every passing year, they turn more bytes of that, a number that more than doubles Space Agency (ESA) in Paris to swap lessons often to big data to probe everything from every year3 (see ‘Data explosion’). about data storage, analysis and sharing. the regulation of genes and the of This data pile is just one-tenth the size of the All labs need to manipulate data to yield genomes to why coastal algae bloom, what data store at CERN, Europe’s particle-physics research answers. As prices drop for high- microbes dwell where in human body cavities laboratory near Geneva, Switzerland. Every throughput instruments such as automated

13 JUNE 2013 | VOL 498 | NATURE | 255 © 2013 Macmillan Publishers Limited. All rights reserved BIG DATA TECHNOLOGY

genome sequencers, small biology labs can Republic, for example, might have an idea become big-data generators. And even labs about how to reprocess cancer data to help the

DNANEXUS without such instruments can become big- hunt for cancer drugs. If he or she lacks the data users by accessing terabytes (1012 bytes) computational equipment to develop it, he or of data from public repositories at the EBI or she might not even try. But access to a high- the US National Center for Biotechnology powered cloud allows “ideas to come from any Information in Bethesda, Maryland. Each place”, says Birney. day last year, the EBI received about 9 mil- Even at the EBI, many scientists access lion online requests to query its data, a 60% databases and software tools on the Web increase over 2011. and through clouds. “People rarely work on Biology data mining has challenges all of straight hardware anymore,” says Birney. One its own, says Birney. Biological data are much heavily used resource is the Ensembl Genome more heterogeneous than those in physics. Browser, run jointly by the EBI and the Well- They stem from a wide range of experiments come Trust Sanger Institute in Hinxton. Life that spit out many types of information, such scientists use it to search through, down- as genetic sequences, interactions of proteins load and analyse genomes from armadillo to or findings in medical records. The complexity zebrafish. The main Ensembl site is based on is daunting, says , a compu- hardware in the United Kingdom, but when tational biologist at the University of Colo- users in the and Japan had dif- rado Denver. “Getting the most from the data Andreas Sundquist says amounts of data are now ficulty accessing the data quickly, the EBI requires interpreting them in light of all the larger than the tools used to analyse them. resolved the bottleneck by hosting mirror relevant prior knowledge,” he says. sites at three of the many remote data centres That means scientists have to store large data efforts also address the infrastructure needs of that are part of Amazon Web Services’ Elastic sets, and analyse, compare and share them — big-data biology. With the new types of data Compute Cloud (EC2). Amazon’s data centres not simple tasks. Even a single sequenced traffic jam honking for attention, “we now have are geographically closer to the users than the human genome is around 140 gigabytes in size. non-trivial engineering problems”, says Birney, EBI base, giving researchers quicker access to Comparing human genomes takes more than the information they need. a personal computer and online file-sharing LIFE OF THE DATA-RICH More clouds are coming. Together with applications such as DropBox. Storing and interpreting big data takes both CERN and ESA, the EBI is building a cloud- In an ongoing study, Arend Sidow, a com- real and virtual bricks and mortar. On the EBI based infrastructure called Helix Nebula putational biologist at Stanford University in campus, for example, construction is under — The Science Cloud. Also involved are infor- California, and his team are looking at specific way to house the technical command centre mation-technology changes in the genome sequences of tumours of ELIXIR, a project to help scientists across “If I could, I companies such from people with breast cancer. They wanted Europe safeguard and share their data, and to would routinely as Atos in Bezons, to compare their data with the thousands of support existing resources such as databases look at all France; CGI in Mon- other published breast-cancer genomes and and computing facilities in individual coun- sequenced treal, Canada; SixSq look for similar patterns in the scores of dif- tries. Whereas CERN has one super­collider cancer genomes. in Geneva; and T-Sys- ferent cancer types. But that is a tall order: producing data in one location, biological With the current tems in Frankfurt, downloading the data is time-consuming, research generating high volumes of data is infrastructure, Germany. and researchers must be sure that their com- distributed across many labs — highlighting that’s Cloud computing is putational infrastructure and software tools the need to share resources. impossible.” particularly attractive are up to the task. “If I could, I would routinely Much of the construction in big-data biol- in an era of reduced look at all sequenced cancer genomes,” says ogy is virtual, focused on cloud computing research funding, says Hunter, because cloud Sidow. “With the current infrastructure, that’s — in which data and software are situated in users do not need to finance or maintain hard- impossible.” huge, off-site centres that users can access on ware. In addition to academic cloud projects, In 2009, Sidow co-founded a company demand, so that they do not need to buy their scientists can choose from many commercial called DNAnexus in Mountain View, Califor- own hardware and maintain it on site. Labs that providers, such as Rackspace, headquartered nia, to help with large-scale genetic analyses. do have their own hardware can supplement it in San Antonio, Texas, or VMware in Palo Numerous other commercial and academic with the cloud and use both as needed. They Alto, California, as well as larger companies can create virtual spaces for data, software and including Amazon, headquartered in Seattle, results that anyone can access, or they can lock Washington, IBM in Armonk, New York, or DATA EXPLOSION the spaces up behind a firewall so that only a Microsoft in Redmond, Washington. The amount of genetic sequencing data stored select group of collaborators can get to them. at the European Bioinformatics Institute takes BIG-DATA PARKING less than a year to double in size. Working with the CSC — IT Center for Sci-

SOURCE: EMBL–EBI ence in Espoo, Finland, a government-run Clouds are a solution, but they also throw 200 high-performance computing centre, the up fresh challenges. Ironically, their prolif- 160 EBI is developing Embassy Cloud, a cloud- eration can cause a bottleneck if data end computing component for ELIXIR that offers up parked on several clouds and thus still Sequencers begin 120 secure data-analysis environments and is cur- need to be moved to be shared. And using giving flurries of data rently in its pilot phase. External organizations clouds means entrusting valuable data to a 80

Terabases can, for example, run data-driven experiments distant service provider who may be subject in the EBI’s computational environment, close to power outages or other disruptions. “I use 40 to the data they need. They can also download cloud services for many things, but always data to compare with their own. keep a local copy of scientifically important 0 2004 2006 2008 2010 2012 The idea is to broaden access to computing data and software,” says Hunter. Scientists power, says Birney. A researcher in the Czech experiment with different constellations to

13 JUNE 2013 | VOL 498 | NATURE | 257 © 2013 Macmillan Publishers Limited. All rights reserved TECHNOLOGY BIG DATA suit their needs and trust levels. HEAD IN THE CLOUDS Most researchers tend to download remote In cloud computing, large data sets are processed on remote Internet servers, data to local hardware for analysis. But this Storage method is “backward”, says Andreas Sundquist, rather than on researchers’ local computers. SOURCE: ASPERA chief technology officer of DNAnexus. “The Computer data are so much larger than the tools, it makes no sense to be doing that.” The alternative is to use the cloud for both data storage and com- puting. If the data are on a cloud, researchers can harness both the computing power and the Firewall tools that they need online, without the need to move data and software (see ‘Head in the Cloud platform clouds’). “There’s no reason to move data out- side the cloud. You can do analysis right there,” says Sundquist. Everything required is avail- able “to the clever people with the clever ideas”, regardless of their local computing resources, Large says Birney. data les Various academic and commercial ventures are engineering ways to bring data and analysis tools together — and as they build, they have to address the continued data growth. Xing Xu, director of cloud computing at BGI (formerly the Beijing Genomics Institute) in Shenzen, data for film-production studios and the oil periodically, it doesn’t make any economic China, knows that challenge well. BGI is one and gas industry as well as the life sciences. sense for us to have this infrastructure, espe- of the largest producers of genomic data in the In an experiment last year, BGI tested a fasp- cially if the user wants that for free,” he says. world, with 157 genome sequencing instru- enabled data transfer between China and the Data-sharing among many collaborators ments working around the clock on samples University of California, San Diego (UCSD). also remains a challenge. When BGI uses fasp from people, plants, animals and microbes. It took 30 seconds to move a 24-gigabyte file. to share data with customers or collaborators, Each day, it generates 6 terabytes of genomic “That’s really fast,” says Xu. it must have a software licence, which allows data. Every instrument can decode one human Data transfer with fasp is hundreds of times customers to download or upload the data for genome per week, an effort that used to take quicker than methods using the normal Inter- free. But customers who want to share data with months or years and many staff. net protocol, says software engineer Michelle each other using this transfer protocol will need Munson, chief executive and co-founder of their own software licences. Putting the data on DATA HIGHWAY Aspera. However, all transfer protocols share the cloud and not moving them would bypass Once a genome sequencer has cranked out its challenges associated with transferring large, this problem; teams would go to the large data snippets of genomic information, or ‘reads’, unstructured data sets. sets, rather than the other way around. Xu and they must be assembled into a continuous The test transfer between BGI and UCSD his team are exploring this approach, alongside stretch of DNA using computing and software. was encouraging because Internet connec- the use of Globus Online, a free Web-based file- Xu and his team try to automate as much of tions between China and the United States are transfer service from the Computation Institute this process as possible to enable scientists to “riddled with challenges” such as variations at the University of Chicago and the Argonne get to analyses quickly. in signal strength that interrupt data transfer, National Laboratory in Illinois. In April, Next, either the reads or the analysis, or says Munson. The protocol has to handle such the Computation Institute team launched a both, have to travel to scientists. Generally, road bumps and ensure speedy transfer, data genome-sequencing-analysis service called researchers share biological data with their integrity and privacy. Data transfer often slows Globus Genomics on the Amazon cloud. peers through public repositories, such as the when the passage is Munson says that Aspera has set up a EBI or ones run by the US National Center “There’s no bumpy, but with fasp pay-as-you-go system on the Amazon cloud for Biotechnology Information in Bethesda, reason to move it does not. Trans- to address the issue of data-sharing. Later Mary­land. Given the size of the data, this data outside the fers can fail when a this year, the company will begin selling an travel often means physically delivering hard cloud. You can file is partially sent; updated version of its software that can be drives — and risks data getting lost, stolen or do analysis right with ordinary Inter- embedded on the desktop of any kind of com- damaged. Instead, BGI wants to use either its there.” net connections, puter and will let users browse large data sets own clouds or others of the customer’s choos- this relaunches the much like a file-sharing application. Files can ing for electronic delivery. But that presents a entire transfer. By contrast, fasp restarts where be dragged and dropped from one location to problem, because big-data travel often means the previous transfer stopped. Data that are another, even if those locations are commercial big traffic jams. already on their way do not get resent, but or academic clouds. Currently, BGI can transfer about 1 tera- continue on their travels. The cost of producing, acquiring and dis- byte per day to its customers. “If you transfer Xu says that he liked the experiment with seminating data is decreasing, says James one genome at a time, it’s OK,” says Xu. “If you fasp, but the software does not solve the data- Taylor, a computational biologist at Emory sequence 50, it’s not so practical for us to trans- transfer problem. “The main problem is not University in Atlanta, Georgia, who thinks fer that through the Internet. That takes about technical, it is economical,” he says. BGI that “everyone should have access to the 20 days.” would need to maintain a large Internet con- skills and tools” needed to make sense of all BGI is exploring a variety of technologies nection bandwidth for data transfer, which the information. Taylor is a co-founder of an to accelerate electronic data transfer, among would be prohibitively expensive, especially academic platform called Galaxy, which lets them fasp, software developed by Aspera in given that Xu and his team do not send out big scientists analyse their data and share soft- Emeryville, California, which helps to deliver data in a continuous flow. “If we only transfer ware tools and workflows for free. Through

258 | NATURE | VOL 498 | 13 JUNE 2013 © 2013 Macmillan Publishers Limited. All rights reserved BIG DATA TECHNOLOGY

Web-based access to computing facilities at valuable feedback to their computationally flu- Pennsylvania State University (PSU) in Uni- ent colleagues because of different needs and versity Park, scientists can download Galaxy’s approaches to the science, she says. platform of tools to their local hardware, or Increasingly, big genomic data sets are being use it on the Galaxy cloud. They can then plug used in biotechnology companies, drug firms in their own data, perform analyses and save and medical centres, which also have specific

STANFORD UNIV./DNANEXUS STANFORD the steps in them, or try out workflows set up needs. Robert Mulroy, president of Merrimack by their colleagues. Pharmaceuticals in Cambridge, Massachu- Spearheaded by Taylor and Anton setts, says that his teams handle mountains Nekrutenko, a molecular biologist at PSU, of data that hide drug candidates. “Our view the Galaxy project draws on a community of is that biology functions through systems around 100 software developers. One feature dynamics,” he says. is Tool Shed, a virtual area with more than Merrimack researchers focus on interro- 2,700 software tools that users can upload, gating molecular signalling networks in the try out and rate. Xu says that he likes the col- healthy body and in tumours, hoping to find lection and its ratings, because without them, new ways to corner cancer cells. They generate scientists must always check if a software tool and use large amounts of information from the actually runs before they can use it. genome and other factors that drive a cell to become cancerous, says Mulroy. The company Arend Sidow wants to move data mountains KNOWLEDGE IS POWER stores its data and conducts analysis on its own without feeling pinched by infrastructure. Galaxy is a good fit for scientists with some computing infrastructure, rather than a cloud, computing know-how, says Alla Lapidus, a to keep the data private and protected. be used by the next tool in a workflow. And if computational biologist in the algorithmic Drug developers have been hesitant about software tools are not easily installed, computer biology lab at St Petersburg Academic Univer- cloud computing. But, says Sundquist, that fear specialists will have to intervene on behalf of sity of the Russian Academy of Sciences, which is subsiding in some quarters: some companies those biologists without computer skills. is led by Pavel Pevzner, a computer scientist at that have previously avoided clouds because of Even computationally savvy researchers can UCSD. But, she says, the platform might not security problems are now exploring them. To get tangled up when wrestling with software be the best choice for less tech-savvy research- assuage these users’ concerns, Sundquist has and big data. “Many of us are getting so busy ers. When Lapidus wanted to disseminate the engineered the DNAnexus cloud to be compli- analysing huge data sets that we don’t have software tools that she developed, she chose ant with US and European regulatory guide- time to do much else,” says Steven Salzberg, to put them on DNAnexus’s newly launched lines. Its security features include encryption a computational biologist at Johns Hopkins second-generation commercial cloud-based for biomedical information, and logs to allow University in Baltimore, Maryland. “We have analysis platform. users to address potential queries from audi- to spend some of our time figuring out ways to That platform is also designed to cater to tors such as regulatory agencies, all of which is make the analysis faster, rather than just using non-specialist users, says Sundquist. It is pos- important in drug development. the tools we have.” sible for a computer scientist to build his or her Yet other big-data pressures come from own biological data-analysis suite with software CHALLENGES AND OPPORTUNITIES the need to engineer tools for stability and tools on the Amazon cloud, but DNAnexus Harnessing powerful computers and numer- longevity. Too many software tools crash too uses its own engineering to help researchers ous tools for data analysis is crucial in drug dis- often. “Everyone in the field runs into similar without the necessary computer skills to get to covery and other areas of big-data biology. But problems,” says Hunter. In addition, research the analysis steps. that is only part of the problem. Data and tools teams may not be able to acquire the resources Catering for non-specialists is important need to be more than close — they must talk to they need, he says, especially in countries when developing tools, as well as platforms. The one another. Lapidus says that results produced such as the United States, where an academic Biomedical Information Science and Technol- by one tool are not always in a format that can does not gain as much recognition for soft- ogy Initiative (BISTI) run by the US National ware engineering as for publishing a paper. Institutes of Health (NIH) in Bethesda, Mary- With its dedicated focus on data and software land, supports development of new computa- infrastructure designed to serve scientists, the ASPERA tional tools and the maintenance of existing EBI offers an “interesting contrast to the US ones. “We want a deployable tool,” says Vivien model”, says Hunter. Bonazzi, programme director in computational US funding agencies are not entirely ignor- biology and bioinformatics at the National ing software engineering, however. In addi- Human Genome Research Institute, who is tion to BISTI, the NIH is developing Big Data involved with BISTI. Scientists who are not to Knowledge (BD2K), an initiative focused on heavy-duty informatics types need to be able to managing large data sets in biomedicine, with set up these tools and use them successfully, she elements such as data handling and standards, says. And it must be possible to scale up tools informatics training and software sharing. And and update them as data volume grows. as the cloud emerges as a popular place to do Bonazzi says that although many life research, the agency is also reviewing data- scientists have significant computational use policies. An approved study usually lays skills, others do not understand computer out specific data uses, which may not include lingo enough to know that in the tech world, placing genomic data on a cloud, says Bonazzi. Python is not a snake and Perl is not a gem When a person consents to have his or her data (they are programming languages). But even if used in one way, researchers cannot suddenly biologists can’t develop or adapt the software, change that use, she says. In a big-data age that says Bonazzi, they have a place in big-data sci- Various data-transfer protocols handle problems uses the cloud in addition to local hardware, ence. Apart from anything else, they can offer in different ways, says Michelle Munson. new technologies in encryption and secure

13 JUNE 2013 | VOL 498 | NATURE | 259 © 2013 Macmillan Publishers Limited. All rights reserved TECHNOLOGY BIG DATA transmission will need to address such privacy concerns. Big data takes large numbers of people. BGI employs more than 600 engineers and software developers to manage its information-technol- ogy infrastructure, handle data and develop software tools and workflows. Scores of infor- maticians look for biologically relevant mes- PHARMACEUTICALS MERRIMACK sages in the data, usually tailored to requests from researchers and commercial customers, says Xu. And apart from its stream of research collaborations, BGI offers a sequencing and analysis service to customers. Early last year, the institute expanded its offerings with a cloud-based genome-analysis platform called Easy­Genomics. In late 2012, it also bought the faltering US company Complete Genomics (CG), which offered human genome sequencing and analysis for customers in academia or drug discovery. Although the sale dashed hopes for earnings among CG’s investors, it doesn’t seem to have dimmed their view of the pros- pects for sequencing and analysis services. “It is now just a matter of time before sequencing data are used with regularity in clinical prac- tice,” says one investor, who did not wish to be identified. But the sale shows how difficult it can be to transition ideas into a competitive A simplified array of breast-cancer subtypes, produced by researchers at Merrimack Pharmaceuticals, marketplace, the investor says. who use their own computational infrastructure to hunt for new cancer drugs. When tackling data mountains, BGI uses not only its own data-analysis tools, but also data mountains grow, he says. viruses in the genomes of other species, includ- some developed in the academic community. They are still ironing out some issues with ing humans. Her work generates terabytes of To ramp up analysis speed and capacity as data Gaea, comparing its performance on the cloud data, which she shares with other researchers. sets grow, BGI assembled a cloud-based series with its performance on local infrastructure. Given that big-data analysis in biology is of analysis steps into a workflow called Gaea, Once testing is complete, BGI plans to mount incredibly difficult, Hunter says, open science which uses the Hadoop open-source software Gaea on a cloud such as Amazon for use by the is becoming increasingly important. As he framework. Hadoop was written by volunteer wider scientific community. explains, researchers need to make their data developers from companies and universities, Other groups are also trying to speed up available to the scientific community in a use- and can be deployed on various types of com- analysis to cater to scientists who want to use ful form, for others to mine. New science can puting infrastructure. BGI programmers built big data. For example, Bina Technologies in emerge from the analysis of existing data sets: on this framework to instruct software tools to Redwood City, California, a spin-out from Stan- McClure generates some of her findings from perform large-scale ford University and the University of Califor- other people’s data. But not everyone recog- data analysis across “The cultural nia, Berkeley, has developed high-performance nizes that kind of biology as an equal. “The many computers at baggage of computing components for its genome-analysis cultural baggage of biology that privileges data the same time. biology that services. Customers can buy the hardware, generation over all other forms of science is If 50 genomes are privileges data called the Bina Box, with software, or use Bina’s holding us back,” says Hunter. to be analysed and generation analysis platform on the cloud. A number of McClure’s graduate students the results com- over all other are microbial ecologists, and she teaches them pared, hundreds of forms of science FROM VIVO TO SILICO how to rethink their findings in the face of so computational steps is holding us Data mountains and analysis are altering the many new data. “Before taking my class, none are involved. The back.” way science progresses, and breeding biologists of these students would have imagined that steps can run either who get neither their feet nor their hands wet. they could produce new, meaningful knowl- sequentially or in parallel; with Gaea, they “I am one of a small original group who made edge, and new hypotheses, from existing data, run in parallel across hundreds of cloud-based the first leap from the wet world to the in silico not their own,” she says. Big data in biology computers, reducing analysis time rather like world to do biology,” says Marcie McClure, a add to the possibilities for scientists, she says, many people working on a single large puzzle computational biologist at Montana State Uni- because data sit “under-analysed in databases at once. The data are on the BGI cloud, as are versity in Bozeman. “I never looked back,” all over the world”. ■ the tools. “If you perform analysis in a non- During her graduate training, McClure ana- parallel way, you will maybe need two weeks lysed a class of viruses known as retroviruses in Vivien Marx is technology editor at Nature to fully process those data,” says Xu. Gaea takes fish, doing the work of a “wet-worlder”, as she and Nature Methods. around 15 hours for the same number of data. calls it. Since then, she and her team have dis- To leverage Hadoop’s muscle, Xu and his covered 11 fish retroviruses without touching 1. Mattmann, C. Nature 493, 473–475 (2013). 2. Greene, C. S. & Troyanskaya, O. G. PLoS Comput. team needed to rewrite software tools. But the water in lake or lab, by analysing genomes com- Biol. 8, e1002816 (2012). investment is worth it because the Hadoop putationally and in ways that others had not. She 3. EMBL–European Bioinformatics Institute EMBL-EBI framework allows analysis to continue as the has also developed software tools to find such Annual Scientific Report 2012 (EMBL–EBI, 2013).

260 | NATURE | VOL 498 | 13 JUNE 2013 © 2013 Macmillan Publishers Limited. All rights reserved