An Open Source Managed File Transfer Framework for Science Gateways
Total Page:16
File Type:pdf, Size:1020Kb
An Open Source Managed File Transfer Framework for Science Gateways Dimuthu Wannipurage Suresh Marru Eroma Abeysinghe Cyberinfrastructure Integration Cyberinfrastructure Integration Cyberinfrastructure Integration Research Center, Pervasive Technology Research Center, Pervasive Technology Research Center, Pervasive Technology Institute Institute Institute Indiana University Indiana University Indiana University Bloomington, IN 47408 Bloomington, IN 47408 Bloomington, IN 47408 [email protected] [email protected] [email protected] Isuru Ranawaka Anna Branam Marlon Pierce Cyberinfrastructure Integration Cyberinfrastructure Integration Cyberinfrastructure Integration Research Center, Pervasive Technology Research Center, Pervasive Technology Research Center, Pervasive Technology Institute Institute Institute Indiana University Indiana University Indiana University Bloomington, IN 47408 Bloomington, IN 47408 Bloomington, IN 47408 [email protected] [email protected] [email protected] Abstract—Managed File Transfer (MFT) systems are networking, support for extremely large data cyberinfrastructure that provide higher level transfers, reliable data transfers, scheduled transfers, functionalities on top of basic point-to-point data transfer protocols such as FTP, HTTP, and SCP. This and centralized audit logging across many systems. paper presents an open source MFT system that incorporates science gateway requirements. We describe All of these capabilities can be customized the system requirements, system architecture, and core extensively for different users and scenarios, and the capabilities. exploration of their optimizations is a topic for Keywords—Science gateways, cyberinfrastructure, data distributed systems research. There is a gap in the management, managed file transfer, open source software, cyberinfrastructure ecosystem today in providing an Apache Airavata open platform for enabling community contributions, providing software-level and operational-level I. INTRODUCTION transparency, supporting distributed systems research, and supporting science-gateway specific Managing data transfers between distributed user scenarios. resources across administrative domains is a fundamental capability of cyberinfrastructure, with Science gateways provide science-centric many long-standing solutions [1][2][3][4]. This interfaces to cyberinfrastructure. The key element of problem appears in business-to-business and other a science gateway system is that it integrates diverse domains as well [5], where it is commonly called backend resources that cross multiple administrative Managed File Transfer (MFT). domains on behalf of communities of users. This leads to several issues that a gateway-focused MFT MFT systems build on and provide a layer above system must accommodate. First, authenticated user point-to-point transfer protocols such as FTP, HTTP, identities may not be communicated end-to-end to all and their secure equivalents. Examples of additional systems; delegation, such as with community capabilities include separation of the control and data accounts, is common [6]. Second, gateway users need layers, end-to-end security for multi-hop transfers, ways to grant fine-grained access to and resulting optimized transfers using software-defined 1 operational permissions on particular data sets to StorkCloud [2] was an open source MFT their collaborators. Third, science gateways are implementation that provided an extensible inherently ad-hoc collections of diverse resources that multi-protocol transfer job scheduler, a directory may include multiple cloud providers, so minimizing listing service for prefetching and caching remote unnecessary data ingress and egress is important. directory metadata to minimize response time to Finally, science gateways can be used as mechanisms users, a web API, pluggable transfer modules, and for providing controlled access to sensitive data of all pluggable protocol-agnostic optimization modules types. Transfers of data to, from, and between which could be used to dynamically optimize various resources need to account for security requirements transfer settings to improve performance. arising from standards such as NIST 800-53 and NIST 800-171 that are used to establish HIPAA Rclone [7] is an open source project that supports alignment and for handling controlled unclassified multi-protocol data transfers, including SFTP, FTP, data, respectively. Google Drive, Box and Amazon S3. It has clean credential management APIs to keep track of all the Movement of data across data centers by MFT credentials of different sources. However, the data cyberinfrastructure must address multiple factors: 1) and control paths are interleaved so data must go access protocols to data on different resources may be through the Rclone server irrespective of the source different; 2) transfers crossing multiple and destination. administrative domains need end-to-end security, including authentication, access control, and Alluxio [8] is a data orchestration framework that encryption; 3) data must be discoverable and is being used heavily in “big data” and data analytics accessible to the MFT system, even when the applications where the streaming performance is resources are on local or otherwise private resources; critical. Alluxio is a fully contained distributed 4) transfer paths should be optimized to (depending system where it has its own data catalog and on the scenario) reduce transfer times, provide credential management system. This is good design required security, and minimize or eliminate costs from the product perspective, but it makes Alluxio such as egress charges from commercial clouds; 5) challenging to integrate with other systems, such as recurring transfers should be schedulable; 6) Apache Airavata, that have their own catalog and monitoring and control should be separate from the credential management components. actual data transfer path; 7) the system should Following the evaluation of these systems, we account for data replicas; 8) diverse storage systems identified an opportunity to build a light-weight MFT with, for example, highly varying latencies and system that could be open source, could bandwidths, should be seamlessly integrated; and 9) accommodate our requirements, and could be the system should be inherently asynchronous. integratable with other projects. II. PRIOR WORK III. AIRAVATA MFT ARCHITECTURE Globus [1] is well known for scientific data Airavata MFT has three design goals. First, the management, particularly in the transfer of very large system should separate the data path and the control data sets. Globus data transfer services utilize a high path to provide better flexibility for handling performance, GridFTP point-to-point data transfer transfers and utilizing resources efficiently. Second, protocol. The main advantage of this protocol is that the systems should provide APIs that can be cleanly the data and control paths are clearly separated. This integrated with multiple systems, such as other way, data can be transferred directly from one point Apache Airavata components and other frameworks. to another while keeping the control path for another Finally, we should reuse and build on standard third party. Globus software is closed source, and its transfer protocols where possible. If the data source operations are proprietary. supports a standard transfer protocol, support it as it 2 is and convert it into a common format inside the science gateways and middleware systems that want MFT engine. to use Airavata MFT as a standalone service. To implement the design goals, we identified the Consul: We use HashiCorp’s Consul as a following major components, illustrated in Fig. 1. mediator for state changes, as described below. Fig. 1 includes a detailed view of the total message path among all components of MFT. 1. A Client (such as a science gateway) submits a transfer request to the MFT API Service 2. The API Service delivers the transfer request to the Consul message store. 3. The MFT Controller fetches the request, determines the target agent to do the transfer, and puts another message to the Consul message store for delivering that to the agent. 4. The target agent fetches the message and determines the resource IDs and secret IDs required for transfer. 5. The Agent talks to the Resource Service to fetch resource information from received IDs. Fig. 1. MFT internal components. Numerical labels give steps in a 6. Agent fetches credentials required to talk to data transfer, as described in the text. resource endpoints by talking to the secret Agent: This is the entity that handles the data path service. of transfers. An Agent should know the protocol to 7. Once both resource and credential data are talk to the data source, but it does not store any collected, the agent starts the transfer using resource information or credential locally to compatible protocols. communicate with the data source. IV. DATA TRANSFER SCENARIOS Resource Service: This is the extension point for Fig. 2 illustrates the main concepts of an Airavata plugging the resource APIs of the external gateway MFT deployment: a controller communicates with system. The Agent fetches the resource metadata agents to issue control methods, and the agents from this endpoint. implement the data transfer. Agents can be deployed in several scenarios. Black solid lines indicate data Secrets Service: This is the extension point for paths while dotted lines indicate control paths. Any resource credentials. External systems can