The Iplant Foundation API

Software-as-a-Service: The iPlant Foundation API Rion Dooley, Matthew Vaughn, Dan Stanzione, Edwin Skidmore Steve Terry iPlant Collaborative Texas Advanced Computing Center University of Arizona The University of Texas Tucson, USA Austin, USA [email protected] [dooley, vaughn, dan, sterry1]@tacc.utexas.edu Abstract—The iPlant Foundation API (fAPI) is a hosted, Multiple high performance computing (HPC) resources Software-as-a-Service (SaaS) resource for the computational through allocations on XSEDE [5], FutureGrid [6], the Open biology field. Unlike many other grid-based approaches to Science Grid (OSG) [7], and the University of Texas System distributed infrastructure, the fAPI provides a holistic view of [8] are provided free of charge. iPlant also manages its own core computing concepts such as security, data, applications, high throughput computing (HTC) resources at the University jobs, and systems. It also provides the support services, such as of Arizona. Each of these resources comes with its own set of unified accounting, provenance, metadata, and global events, interfaces, allocations, and access mechanisms. needed to bind the core concepts together into a cohesive interface. Operating as a set of RESTful web services, the fAPI To lower the barrier of entry to these resources and bridges the gap between the HPC and web worlds by supporting facilitate broader adoption, community-led iPlant steering synchronous and asynchronous interfaces, web-based callbacks, committees identified a need for a programmatic access layer. unified access control lists (ACL), and a publication- subscription In-person surveys and discussions from over 100 subsequent (pubsub) model that allows modern applications to interact with community workshops validated this recommendation and the underlying infrastructure using technologies and access provided an initial set of requirements for the desired access patterns familiar to them. In its first year of operation, the fAPI layer to meet. Obvious capabilities such as the ability to submit has grown to support thousands of users across both the Plant jobs, track provenance, move data, and share work topped the and Animal Science domains. In this paper we describe the fAPI, list. Additionally common requests came from the community its underlying architecture, and briefly describe its adoption to apply and manage metadata, interoperate with their existing before concluding with future plans. private resources, interact with 3rd party services, and support Keywords— cloud; biology; cyberinfrastructure; iplant; data; ontological discovery. Other, larger development projects api; saas; web service; rest requested some developer-friendly features such as a Representational State Transfer (REST) [9] interface, callback I. INTRODUCTION notifications, token-based access, and support for asynchronous The iPlant Collaborative (iPlant) is a United States National communication. Science Foundation (NSF) funded project that has created an Driven by this initial set of requirements, the development innovative, comprehensive, and foundational team conducted an evaluation of current technologies to cyberinfrastructure (CI) in support of plant biology research identify projects they could adopt or extend to provide such an [1]. iPlant is developing cyberinfrastructure that uniquely access layer. This evaluation is discussed as background enables scientists throughout the diverse fields that comprise material in Section II. plant biology to address Grand Challenges in new ways, to stimulate and facilitate cross-disciplinary research, to promote II. BACKGROUND biology and computer science research interactions, and to train The evaluation began with hosted services. Globus Online the next generation of scientists on the use of [10] provided a production quality, hosted data movement cyberinfrastructure in research and education. The iPlant service, but at the time of this writing, required separate cyberinfrastructure design is based on an unprecedented period authentication mechanisms, provided no metadata support, and of research community input. It leverages developments in could not be embedded into existing services and applications. high-performance computing, data storage, and There is hope that these features will be provided in the near cyberinfrastructure for the physical sciences, and is an open- future, but until then, deep integration is not possible. The source project with application programming interfaces that Cyberaide project [11] showed great potential as a hosted job allow the community to extend the infrastructure to meet its submission service, however the project is currently needs. unsupported, the software suffers from scalability issues iPlant provides a wide range of computational resources, associated with large input data, and the model of wrapping aimed at supporting different use cases. The iPlant Data Store each individual application in a virtual appliance doesn’t match is a persistent, high performance, distributed storage solution well with usage scenarios involving users leveraging the built upon iRODS[2]. Atmosphere is a highly customized, on- service as a mechanism for rapid application development. demand virtualization solution built upon Eucalyptus [3][4]. Airavata [12] is a well supported workflow orchestration and execution service, but currently lacks key permission and data NEWT is a web service that allows users to access access concepts requested by the iPlant community. Several computing resources at the National Energy Research discussions with the Airavata project team indicated that Scientific Computing Center (NERSC) through a simple REST significant refactoring would be required to support these API. NEWT exposes services for authentication, system concepts and such work was not in the scope of their current monitoring, file uploads and downloads, directory listings, roadmap. running commands, submitting jobs to batch queues, obtaining accounting information, and accessing a persistent object Next, the development team looked at Infrastructure-as-a- storage. NEWT is an internal project at NERSC and, as such, is Service (IaaS) solutions. Commercial companies such as deeply tied to the underlying NERSC infrastructure. After Amazon, Rackspace, and Microsoft provided solutions with evaluating the NEWT architecture, it was found that while it mature web service offerings that allowing customers to did meet many of the requirements, it would require nearly a interact with their infrastructure. However the primary focus of complete rewrite to port NEWT onto the iPlant their web services was the administration of the resources they cyberinfrastructure. own. VMWare, OpenStack, Nimbus, and Eucalyptus all provided IaaS solutions solutions similar to the hosted The Simple API for Grid Applications (SAGA) is a commercial offerings, but run on local systems. These were programmatic interface for managing jobs and data on a attractive solutions iPlant could, and did adopt to expose their computational grid. It currently has bindings in Java, C++, and underlying cloud resource, but neither the hosted nor private Python, however it lacks any web service iterface. IaaS solutions provide services of sufficient abstraction or breadth to meet the requirements of the user community. Unicore is a comprehensive middleware solution similar to the Globus Toolkit, but exposed as a set of SOAP web services. Finally, the development team turned their evaluation focus One benefit of Unicore is its use of community established to popular "middleware" projects such as the Globus protocols for job submission, information management, and Toolkit[13], Grisu[14], Newt[15], SAGA[16], and Unicore authentication. The primary drawbacks to Unicore are the large [17]. Each project had its own advantages, but the focus was buy-in required to leverage it, the technology stack required by generally on a subset of the functionality desired by the iPlant clients to interact with the SOAP services, and the system-level user community. The Globus Toolkit provided solutions for installations required by both the clients and resource cross-site authentication, data movement, job submission, and providers. Given iPlant's lack of administrative access to the information management, but worked in a way largely underlying HPC hardware and the requirement for REST and disconnected from the underlying compute and data systems. web-accessible services, Unicore could not be an outward- Exposing the Globus GRAM scheduler as an outward facing facing technology in the desired iPlant programmatic access service to iPlant users would create the possibility of a layer. discrepancy between queue status reported by the system schedulers and the Globus scheduler. The development team After completing the evaluation process, the development team determined that, for lack of a better description, there was learned that in other projects, this led to accounting and monitoring difficulties that could never be completely worked abundant plumbing, but no kitchen sink. Despite the needs of the community, there was no solution that allowed significant out. Another roadblock to adopting the Globus Toolkit as a top level solution was that interaction with the Globus services re-use of existing code bases to meet the needs of the plant science community, and so the development team designed and required a separate technology stack. This technology stack would be required on every system communicating with iPlant. created

Load more