
Work / Technology & tools ILLUSTRATION BY THE PROJECT TWINS THE PROJECT BY ILLUSTRATION ASTRONOMY TAKES TO THE CLOUD Telescope users reveal six lessons of migrating big data from custom servers to the cloud. By Charles Q. Choi stronomers typically work by asking 2028. And the next-generation Very Large Array the LSST project from scratch could approach observatories for time on a telescope (ngVLA) will generate hundreds of petabytes US$150 million over 10 years. Instead, they and downloading the resulting annually, roughly 1,000 times more than the — like much of the rest of the astronomy data. But as the amount of data that VLA does today, says Brian Glendenning, assis- community — looked to the cloud. Here are telescopes produce grows, well, tant director for data management and software six lessons from astronomers’ experiences. Aastronomically, old methods can’t keep pace. at the National Radio Astronomy Observatory, The Vera C. Rubin Observatory in Chile is who is based in Albuquerque, New Mexico. Invest in computing power geared up to collect 20 terabytes per night as Such data sets are out of reach for It’s not enough to migrate data to the cloud; part of its 10-year Legacy Survey of Space and conventional workflows: it isn’t feasible to researchers need to be able to interact with Time (LSST), once it becomes operational in download that much data and store them them. “Instead of the traditional model where 2022. That’s as much as the Sloan Digital Sky locally, says Mario Juric, an astronomer at the astronomers brought their data to their Survey — which created the most detailed 3D University of Washington in Seattle. Building computer, we want them to upload their code maps of the Universe so far — collected in total and maintaining local computing resources to the data” and perform analyses remotely, between 2000 and 2010. The Square Kilometre to cope is similarly impractical. William says Frossie Economou, who manages the Array, which is set to become the world’s largest O’Mullane, project manager for LSST data Rubin Observatory’s science platform. radio telescope, with sites in Australia and South management in Tucson, Arizona, estimates The LSST, for instance, will provide free Africa, will generate 100 times that amount — that the cost of developing the computing online access to its science platform — a col- up to 2 petabytes daily — when it goes online in infrastructure and personnel needed to run lection of Jupyter computational notebooks, Nature | Vol 584 | 6 August 2020 | 159 ©2020 Spri nger Nature Li mited. All ri ghts reserved. Work / Technology & tools web portals and application programming he estimates, and the servers would have had whole new ecosystem of apps to develop that interfaces (APIs) for data analysis, browsing to be active 87% of the time over three years. we didn’t even know were possible”. and retrieval, says Leanne Guy, LSST data man- That level of usage is “highly unlikely”, he says. agement scientist at the Rubin Observatory in Time savings can influence decision-making. Train and train again Tucson. Using a web browser, the LSST’s users “If it takes nine months to process your analyses To set up a cloud project, users need to create will be able to write and run code in the Python at your data centre but one month on the cloud an account with a cloud provider, choose from programming language to analyse the entire for the same cost, that difference of eight a bewildering array of options and install their LSST data set remotely on servers hosted at the months becomes very interesting,” Juric notes. software, usually tweaking it so that it can run National Center for Supercomputing Applica- The choice isn’t necessarily binary. on many machines at once. Mistakes can be tions in Urbana, Illinois, rather than download- Projects can use local data centres for routine costly, warns Bourne. “Inexperienced grad ing the data to their own computer. storage and computing but supplement students have inadvertently burnt thousands Other disciplines have also found success those resources through ‘cloud bursting’ of CPU hours with runaway processes,” he with this approach. The Pangeo project, for when demand spikes and an extra boost of says, referring to computer tasks that never instance, which is a platform for analysing big complete owing to coding errors. geosciences data, has partnered with Google “Cloud computing allows To avoid that, Bruce Berriman, a senior Cloud to make petabytes of climate data pub- scientist at the Infrared Processing and licly available and computable — which makes even small institutions to Analysis Center at the California Institute of it easier for researchers to collaborate, scale make big discoveries.” Technology in Pasadena, advises users to train and reproduce their work, says climate scien- beforehand, for instance by running small- tist Joe Hamman at the National Center for scale pilot projects using local machines or Atmospheric Research in Boulder, Colorado. computing power is needed, says O’Mullane. academic clouds. In the cloud, he says, “the In the meantime, funding agencies might meter is always running”. No big data? No problem be able to help researchers negotiate better Don’t neglect security considerations, Juric “There are definitely times when projects rates, says Philip Bourne, dean of data science adds. Although privacy and security in the cloud involving just mid-size data can see a lot of at the University of Virginia in Charlottesville. surpass those of local resources, configuring benefits from cloud computing,” says Ivelina The US National Institutes of Health (NIH) does cloud resources can be tricky, and a mistake by Momcheva, a mission scientist at the Space Tele- this with its Science and Technology Research an inexperienced programmer can expose your scope Science Institute in Baltimore, Maryland. Infrastructure for Discovery, Experimentation data to the world. “Private data centres tend Researchers can access computational and Sustainability (STRIDES) Initiative — which to be more ‘institutionally locked down’,” Juric resources that greatly exceed those of their uses cloud resources to streamline NIH data. notes, whereas a commercial provider might laptops for relatively little cost, Momcheva “With STRIDES, if an institution commits a allow such a mistake to proceed unimpeded. notes. And some cloud providers offer free given amount of their dollars on grants, then computing resources for educational purposes. the Googles and Microsofts and Amazons of Focus on outreach In 2015, when Momcheva and her colleagues the world can compete with each other so By providing computing resources for little or had only an 8-core server available for investigators can get the best deals with ven- no cost, cloud computing allows even small their 3D-HST project, which analysed data dors,” Bourne says. Since its inception in 2018, institutions to make big discoveries. “I could from the Hubble Space Telescope to better STRIDES has helped researchers with more than set up a notebook in South Africa to run on the understand the forces that shape galaxies 225 projects exceeding 20 million total comput- LSST Science Platform that had all the same in the distant Universe. So they turned to ing hours, and saved roughly $6 million, says tools as if I was in Princeton,” O’Mullane says. Amazon Web Services (AWS) instead. They Susan Gregurick, who leads the NIH’s strategic “All I’d need is a web browser.” ended up renting five 32-core machines. “Our plan for data science in Bethesda, Maryland. But to do so effectively requires education, back-of-the-envelope calculations suggested says Dara Norman, a research astronomer everything we had to do would have taken us Consolidate data at the National Optical-Infrared Astronomy three months on our machines,” Momcheva By bringing multiple data sets together, cloud Research Laboratory in Tucson. Good starting says. “With a cloud provider, it took us five days computing can reveal insights that might not be points include Cloud Computing for Science and less than $1,000.” apparent from each data set alone. “Astronomi- and Engineering (go.nature.com/338hdpt), cal data becomes exponentially more useful the which Berriman calls “the best practical guide Price isn’t everything more there is of it in one place,” Momcheva says. to getting started on the cloud”. There is also Whether commercial cloud services are Inspired by the NIH’s Data Commons, a pilot the experimental MAST Labs project run by the cheaper than a researcher’s local data centre project in which researchers store and share Mikulski Archive for Space Telescopes (MAST), remains debatable. The US Department biomedical and behavioural data and software, which has sample notebooks for accessing of Energy’s 2011 Magellan report on cloud Juric and others have requested funding to build MAST data on AWS (go.nature.com/314gxyo); computing found that the department’s com- an Astronomy Data Commons to co-locate and an AWS tutorial using Hubble data puting centres were typically 3–7 times less astronomical data sets and tools in the cloud. (go.nature.com/33bad0a). expensive than commercial cloud providers. The hope is to “eliminate the infrastructure and Work with researchers at smaller institutions The cost difference remains roughly the same software barrier to entry for big-data analysis”, to make sure proposed ideas will actually work today. Yet with code optimizations, research- Juric says. He and his colleagues have already for them, Norman says. And encourage network- ers can reduce those differences. According to released one data set called the Zwicky Transient ing with their students. “If a lot of really good estimates from the University of Washington, Facility, which encompasses 100 billion students that you want in your institution are for instance, cloud-based processes that cost observations of some 2 billion celestial objects; coming from smaller, underserved places, it’s to $43 per experiment cost only $6 after a few if they can show that their work yields benefits, your benefit to involve them in your research to months of optimization, Juric says.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages3 Page
-
File Size-