TECHNOLOGY FEATURE THE BIG CHALLENGES OF BIG DATA As they grapple with increasingly large data sets, biologists and computer scientists uncork new bottlenecks. EMBL–EBI Extremely powerful computers are needed to help biologists to handle big-data traffic jams. BY VIVIEN MARX and how the genetic make-up of different can- year, particle-collision events in CERN’s Large cers influences how cancer patients fare2. The Hadron Collider generate around 15 petabytes iologists are joining the big-data club. European Bioinformatics Institute (EBI) in of data — the equivalent of about 4 million With the advent of high-throughput Hinxton, UK, part of the European Molecular high-definition feature-length films. But the genomics, life scientists are starting to Biology Laboratory and one of the world’s larg- EBI and institutes like it face similar data- Bgrapple with massive data sets, encountering est biology-data repositories, currently stores wrangling challenges to those at CERN, says challenges with handling, processing and mov- 20 petabytes (1 petabyte is 1015 bytes) of data Ewan Birney, associate director of the EBI. He ing information that were once the domain of and back-ups about genes, proteins and small and his colleagues now regularly meet with astronomers and high-energy physicists1. molecules. Genomic data account for 2 peta- organizations such as CERN and the European With every passing year, they turn more bytes of that, a number that more than doubles Space Agency (ESA) in Paris to swap lessons often to big data to probe everything from every year3 (see ‘Data explosion’). about data storage, analysis and sharing. the regulation of genes and the evolution of This data pile is just one-tenth the size of the All labs need to manipulate data to yield genomes to why coastal algae bloom, what data store at CERN, Europe’s particle-physics research answers. As prices drop for high- microbes dwell where in human body cavities laboratory near Geneva, Switzerland. Every throughput instruments such as automated 13 JUNE 2013 | VOL 498 | NATURE | 255 © 2013 Macmillan Publishers Limited. All rights reserved BIG DATA TECHNOLOGY genome sequencers, small biology labs can Republic, for example, might have an idea become big-data generators. And even labs about how to reprocess cancer data to help the DNANEXUS without such instruments can become big- hunt for cancer drugs. If he or she lacks the data users by accessing terabytes (1012 bytes) computational equipment to develop it, he or of data from public repositories at the EBI or she might not even try. But access to a high- the US National Center for Biotechnology powered cloud allows “ideas to come from any Information in Bethesda, Maryland. Each place”, says Birney. day last year, the EBI received about 9 mil- Even at the EBI, many scientists access lion online requests to query its data, a 60% databases and software tools on the Web increase over 2011. and through clouds. “People rarely work on Biology data mining has challenges all of straight hardware anymore,” says Birney. One its own, says Birney. Biological data are much heavily used resource is the Ensembl Genome more heterogeneous than those in physics. Browser, run jointly by the EBI and the Well- They stem from a wide range of experiments come Trust Sanger Institute in Hinxton. Life that spit out many types of information, such scientists use it to search through, down- as genetic sequences, interactions of proteins load and analyse genomes from armadillo to or findings in medical records. The complexity zebrafish. The main Ensembl site is based on is daunting, says Lawrence Hunter, a compu- hardware in the United Kingdom, but when tational biologist at the University of Colo- users in the United States and Japan had dif- rado Denver. “Getting the most from the data Andreas Sundquist says amounts of data are now ficulty accessing the data quickly, the EBI requires interpreting them in light of all the larger than the tools used to analyse them. resolved the bottleneck by hosting mirror relevant prior knowledge,” he says. sites at three of the many remote data centres That means scientists have to store large data efforts also address the infrastructure needs of that are part of Amazon Web Services’ Elastic sets, and analyse, compare and share them — big-data biology. With the new types of data Compute Cloud (EC2). Amazon’s data centres not simple tasks. Even a single sequenced traffic jam honking for attention, “we now have are geographically closer to the users than the human genome is around 140 gigabytes in size. non-trivial engineering problems”, says Birney, EBI base, giving researchers quicker access to Comparing human genomes takes more than the information they need. a personal computer and online file-sharing LIFE OF THE DATA-RICH More clouds are coming. Together with applications such as DropBox. Storing and interpreting big data takes both CERN and ESA, the EBI is building a cloud- In an ongoing study, Arend Sidow, a com- real and virtual bricks and mortar. On the EBI based infrastructure called Helix Nebula putational biologist at Stanford University in campus, for example, construction is under — The Science Cloud. Also involved are infor- California, and his team are looking at specific way to house the technical command centre mation-technology changes in the genome sequences of tumours of ELIXIR, a project to help scientists across “If I could, I companies such from people with breast cancer. They wanted Europe safeguard and share their data, and to would routinely as Atos in Bezons, to compare their data with the thousands of support existing resources such as databases look at all France; CGI in Mon- other published breast-cancer genomes and and computing facilities in individual coun- sequenced treal, Canada; SixSq look for similar patterns in the scores of dif- tries. Whereas CERN has one super­collider cancer genomes. in Geneva; and T-Sys- ferent cancer types. But that is a tall order: producing data in one location, biological With the current tems in Frankfurt, downloading the data is time-consuming, research generating high volumes of data is infrastructure, Germany. and researchers must be sure that their com- distributed across many labs — highlighting that’s Cloud computing is putational infrastructure and software tools the need to share resources. impossible.” particularly attractive are up to the task. “If I could, I would routinely Much of the construction in big-data biol- in an era of reduced look at all sequenced cancer genomes,” says ogy is virtual, focused on cloud computing research funding, says Hunter, because cloud Sidow. “With the current infrastructure, that’s — in which data and software are situated in users do not need to finance or maintain hard- impossible.” huge, off-site centres that users can access on ware. In addition to academic cloud projects, In 2009, Sidow co-founded a company demand, so that they do not need to buy their scientists can choose from many commercial called DNAnexus in Mountain View, Califor- own hardware and maintain it on site. Labs that providers, such as Rackspace, headquartered nia, to help with large-scale genetic analyses. do have their own hardware can supplement it in San Antonio, Texas, or VMware in Palo Numerous other commercial and academic with the cloud and use both as needed. They Alto, California, as well as larger companies can create virtual spaces for data, software and including Amazon, headquartered in Seattle, results that anyone can access, or they can lock Washington, IBM in Armonk, New York, or DATA EXPLOSION the spaces up behind a firewall so that only a Microsoft in Redmond, Washington. The amount of genetic sequencing data stored select group of collaborators can get to them. at the European Bioinformatics Institute takes BIG-DATA PARKING less than a year to double in size. Working with the CSC — IT Center for Sci- SOURCE: EMBL–EBI ence in Espoo, Finland, a government-run Clouds are a solution, but they also throw 200 high-performance computing centre, the up fresh challenges. Ironically, their prolif- 160 EBI is developing Embassy Cloud, a cloud- eration can cause a bottleneck if data end computing component for ELIXIR that offers up parked on several clouds and thus still Sequencers begin 120 secure data-analysis environments and is cur- need to be moved to be shared. And using giving flurries of data rently in its pilot phase. External organizations clouds means entrusting valuable data to a 80 Terabases can, for example, run data-driven experiments distant service provider who may be subject in the EBI’s computational environment, close to power outages or other disruptions. “I use 40 to the data they need. They can also download cloud services for many things, but always data to compare with their own. keep a local copy of scientifically important 0 2004 2006 2008 2010 2012 The idea is to broaden access to computing data and software,” says Hunter. Scientists power, says Birney. A researcher in the Czech experiment with different constellations to 13 JUNE 2013 | VOL 498 | NATURE | 257 © 2013 Macmillan Publishers Limited. All rights reserved TECHNOLOGY BIG DATA suit their needs and trust levels. HEAD IN THE CLOUDS Most researchers tend to download remote In cloud computing, large data sets are processed on remote Internet servers, data to local hardware for analysis. But this Storage method is “backward”, says Andreas Sundquist, rather than on researchers’ local computers. SOURCE: ASPERA chief technology officer of DNAnexus. “The Computer data are so much larger than the tools, it makes no sense to be doing that.” The alternative is to use the cloud for both data storage and com- puting. If the data are on a cloud, researchers can harness both the computing power and the Firewall tools that they need online, without the need to move data and software (see ‘Head in the Cloud platform clouds’).
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages5 Page
-
File Size-