Software As a Service for Data Scientists
Total Page:16
File Type:pdf, Size:1020Kb
Doi:10.1145/2076450.2076468 Globus Online manages fire-and-forget file transfers for big-data, high-performance scientific collaborations. By BRyce Allen, John BResnahan, Lisa childeRs, ian fosTeR, GoPi KanDasWaMy, RaJ KettiMuThu, JacK Kordas, MiKe LinK, Stuart Martin, Karl PicKett, anD SteVen TuecKe software as a service for Data scientists As Big Data emerges as a force in science,2,3 so, too, do new, onerous tasks for researchers. Data from specialized instrumentation, numerical simulations, and downstream manipulations must be collected, indexed, archived, shared, replicated, and analyzed. These tasks are not new, but the complexities involved in performing them for terabyte or when data volumes were measured larger datasets (increasingly common in kilobytes. The result is a computa- across scientific disciplines) are quite tional crisis in many laboratories and different from those that applied a growing need for far more powerful data-management tools, yet the typi- key insights cal researcher lacks the resources and expertise to operate these tools. The costs of research data life-cycle The answer may be to deliver re- management are growing dramatically as search data-management capabili- data becomes larger and more complex. ties to users as hosted “software as a saas approaches are a promising service,” or SaaS,18 a software-delivery solution, outsourcing time-consuming model in which software is hosted research data management tasks to third-party services. centrally and accessed by users using a thin client (such as a Web browser) Globus online demonstrates the potential over the Internet. As demonstrated in of saas for research data management, simplifying data movement for research- many business and consumer tools, ers and research facilities alike. SaaS leverages intuitive Web 2.0 in- February 2012 | VoL. 55 | No. 2 | communicaTions of The acM 81 contributedarticles terfaces, deep domain knowledge, mand is issued; that is, third-party and economies of scale to deliver ca- transfers may be (indeed, frequently pabilities that are easier to use, more are) involved. Our goal in designing capable, and/or more cost-effective GO is a solution that provides extreme than software accessed through other ease-of-use without compromising means. The opportunity for continu- a common question reliability, speed, or security. A 2008 ous improvement via dynamic deploy- about Go is whether report by Childers et al.5 of Argonne ment of new features and bug fixes National Laboratory makes clear the is also significant, as is the potential data can be moved importance of usability. In particu- for expert operators to intervene and more effectively lar, failure recovery is often a human- troubleshoot on the user’s behalf. intensive process, as reported by We report here on a research data- through the Childers et al.5: “The tools that we use management system called Globus to move files typically are the standard Online, or GO, that adopts this ap- physical movement Unix tools included with ssh… it’s just proach, focusing on GO’s data-move- of media rather painful. Painful in the sense of having ment functions (“GO-Transfer”). We to manage the transfers by hand, re- describe how GO leverages modern than through starting transfers when they fail—all Web 2.0 technologies to provide in- communication of this is done by hand.” tuitive interfaces for fire-and-forget In addition, datasets may have file transfers between GridFTP end- over networks. nested structures and contain many points1 while leveraging hosting for files of different size (see the side- automatic fault recovery, high per- bar “Other Approaches”). Source and formance, simplified security con- destination may have different se- figuration, and no client software in- curity requirements and authentica- stallation. We also describe a novel tion interfaces. Networks and storage approach for providing a command- servers may suffer transient failures. line interface (CLI) to SaaS without Transfers must be tuned to exploit distributing client software and Glo- high-speed research networks. Direc- bus Connect to simplify installation tories may have to be mirrored across of a personal GridFTP server for use multiple sites, but only some files dif- with GO. Our experiments show low fer between source and destination. overhead for small transfers and high Firewalls, Network Address Transla- performance for large transfers, rela- tion, and other network complexities tive to conventional tools. may have to be addressed. For these Adoption of this new service has and other reasons, it is not unusual to been notable. One year after product hear of even modest-scale wide-area launch, November 2010, more than data transfers requiring days of care- 3,000 registered users had in aggre- ful “babysitting” or of being aban- gate moved more than two petabytes doned for high-bandwidth but high- (2×1015B) and 150 million files; nu- latency (frequently labor-intensive merous high-performance comput- and error-prone) “sneakernet.”11 ing (HPC) facilities and experimental facilities recommend GO to their us- Why Move Data at all? ers; and several “science gateways” Why not just leave data where it is are integrating GO as a data upload/ created? Such an option is certainly download solution. GO has also been preferred when possible, and we may adopted as a foundational element hope that over time moving computa- of the National Science Foundation’s tion to data rather than the other way new (as of 2011) Extreme Science and round will be more common. How- Engineering Discovery Environment ever, in practice, data scientists often (XSEDE) supercomputer network find data is “in the wrong place” and (http://www.xsede.org/). thus must be moved for a variety of reasons. Data may be produced at a Data Movement location (such as a telescope or sen- Researchers often must copy many sor array) where large-scale storage files with potentially large aggregate cannot be located easily. It may be size among two or more network- desirable to collocate data from many connected locations, or “endpoints,” sources to facilitate analysis—a com- that may or may not include the com- mon requirement in, say, genomics. puter from which the transfer com- Remote copies may be required for 82 communicaTions of The acM | February 2012 | VoL. 55 | No. 2 contributedarticles disaster recovery. Data analysis may require computing power or special- ized computing systems not avail- able locally. Policy or sociology may Other Approaches require replication of data sets in One alternative for data movement involves running tools on the user’s computer; for example, rsync,20 scp, file transfer program (FtP), secure FtP, and bbftp13 are all distinct geographical regions; for ex- used to move data between a client computer and a remote location. Other software ample, in high-energy physics, all data (such as globus-url-copy, reliable File transfer, File transfer Service, and Lightweight produced at the Large Hadron Col- Data replicator) can each manage large numbers of transfers. however, the need to lider, Geneva, Switzerland, must es- download, install, and run software is a significant barrier to use. Users spend much time configuring, operating, and updating such tools though rarely have theIT and networking sentially be replicated in the U.S. and knowledge necessary to fix things when they do not “just work,” which is all too often. elsewhere for independent analysis. Some big-science projects have developed specialized solutions to the problem; for It is also frequently the case that the example, the PheDex high-throughput data-transfer-management system9 manages data movement among sites participating in the Compact Muon Solenoid experiment aggregate data-analysis requirements at Cern, and the Laser interferometer gravitational wave Observatory (LigO) project of a community exceed the analysis developed the LigO Data replicator.4 these centrally managed systems allow users capacity of a data provider, in which to hand off data-movement tasks to a third-party service that performs them on their case data must be downloaded for lo- behalf. however, these services require professional operators functioning only among carefully controlled endpoints within these communities. cal analysis. This is the case in, for ex- Managed services (such as YouSendit and DropBox) also provide data-management ample, the Earth System Grid, which solutions but do not address researchers’ need for high-performance movement of delivers climate simulation output to large quantities of data. Bittorrent8 and Content Distribution networks21 are good at distributing a relatively stable set of large files (such as movies) but do not address data its 25,000 users worldwide. scientists’ need for many frequently updated files managed in directory hierarchies. Another common question about the integrated rule-Oriented Data System17 is often run in hosted configurations, GO is whether data can be moved but, though it performs some data-transfer operations (such as for data import), data more effectively through the physi- transfer is not its primary function or focus. the Kangaroo,19 Stork,14 and CATCh15 systems all manage data movement over cal movement of media rather than wide-area networks using intermediate storage systems where appropriate to optimize through communication over net- end-to-end reliability and/or performance. they are not designed as SaaS data- works. After all, no network can ex- movement solutions, but their methods could be incorporated into gO. ceed the bandwidth of a FedEx truck. web and reSt interfaces to centrally operated services are conventional in business, underpinning such services as Salesforce.com (customer relationship The answer is, again, that while physi- management), google Docs, Facebook, and twitter—an approach not yet common in cal shipment has its place (and may science. two exceptions are the PheDex Data Service,9 with both reSt and CLis, and be much cheaper if the alternative is the national energy research Supercomputing Center, or nerSC, web toolkit called newt7 that enables reStful operations against hPC center resources.