Software-as-a-Service: The iPlant Foundation API

Rion Dooley, Matthew Vaughn, Dan Stanzione, Edwin Skidmore Steve Terry iPlant Collaborative Texas Advanced Computing Center The University of Texas Tucson, USA Austin, USA [email protected] [dooley, vaughn, dan, sterry1]@tacc.utexas.edu

Abstract—The iPlant Foundation API (fAPI) is a hosted, Multiple high performance computing (HPC) resources Software-as-a-Service (SaaS) resource for the computational through allocations on XSEDE [5], FutureGrid [6], the Open biology field. Unlike many other grid-based approaches to Science Grid (OSG) [7], and the University of Texas System distributed infrastructure, the fAPI provides a holistic view of [8] are provided free of charge. iPlant also manages its own core computing concepts such as security, data, applications, high throughput computing (HTC) resources at the University jobs, and systems. It also provides the support services, such as of Arizona. Each of these resources comes with its own set of unified accounting, , metadata, and global events, interfaces, allocations, and access mechanisms. needed to bind the core concepts together into a cohesive interface. Operating as a set of RESTful web services, the fAPI To lower the barrier of entry to these resources and bridges the gap between the HPC and web worlds by supporting facilitate broader adoption, community-led iPlant steering synchronous and asynchronous interfaces, web-based callbacks, committees identified a need for a programmatic access layer. unified access control lists (ACL), and a publication- subscription In-person surveys and discussions from over 100 subsequent (pubsub) model that allows modern applications to interact with community workshops validated this recommendation and the underlying infrastructure using technologies and access provided an initial set of requirements for the desired access patterns familiar to them. In its first year of operation, the fAPI layer to meet. Obvious capabilities such as the ability to submit has grown to support thousands of users across both the Plant jobs, track provenance, move data, and share work topped the and Animal Science domains. In this paper we describe the fAPI, list. Additionally common requests came from the community its underlying architecture, and briefly describe its adoption to apply and manage metadata, interoperate with their existing before concluding with future plans. private resources, interact with 3rd party services, and support Keywords— cloud; biology; ; iplant; data; ontological discovery. Other, larger development projects ; saas; web service; rest requested some developer-friendly features such as a Representational State Transfer (REST) [9] interface, callback I. INTRODUCTION notifications, token-based access, and support for asynchronous The iPlant Collaborative (iPlant) is a United States National communication. Science Foundation (NSF) funded project that has created an Driven by this initial set of requirements, the development innovative, comprehensive, and foundational team conducted an evaluation of current technologies to cyberinfrastructure (CI) in support of plant biology research identify projects they could adopt or extend to provide such an [1]. iPlant is developing cyberinfrastructure that uniquely access layer. This evaluation is discussed as background enables scientists throughout the diverse fields that comprise material in Section II. plant biology to address Grand Challenges in new ways, to stimulate and facilitate cross-disciplinary research, to promote II. BACKGROUND biology and computer science research interactions, and to train The evaluation began with hosted services. Globus Online the next generation of scientists on the use of [10] provided a production quality, hosted data movement cyberinfrastructure in research and education. The iPlant service, but at the time of this writing, required separate cyberinfrastructure design is based on an unprecedented period authentication mechanisms, provided no metadata support, and of research community input. It leverages developments in could not be embedded into existing services and applications. high-performance computing, data storage, and There is hope that these features will be provided in the near cyberinfrastructure for the physical sciences, and is an open- future, but until then, deep integration is not possible. The source project with application programming interfaces that Cyberaide project [11] showed great potential as a hosted job allow the community to extend the infrastructure to meet its submission service, however the project is currently needs. unsupported, the software suffers from scalability issues iPlant provides a wide range of computational resources, associated with large input data, and the model of wrapping aimed at supporting different use cases. The iPlant Data Store each individual application in a virtual appliance doesn’t match is a persistent, high performance, distributed storage solution well with usage scenarios involving users leveraging the built upon iRODS[2]. Atmosphere is a highly customized, on- service as a mechanism for rapid application development. demand virtualization solution built upon Eucalyptus [3][4]. Airavata [12] is a well supported workflow orchestration and execution service, but currently lacks key permission and data NEWT is a web service that allows users to access access concepts requested by the iPlant community. Several computing resources at the National Energy Research discussions with the Airavata project team indicated that Scientific Computing Center (NERSC) through a simple REST significant refactoring would be required to support these API. NEWT exposes services for authentication, system concepts and such work was not in the scope of their current monitoring, file uploads and downloads, directory listings, roadmap. running commands, submitting jobs to batch queues, obtaining accounting information, and accessing a persistent object Next, the development team looked at Infrastructure-as-a- storage. NEWT is an internal project at NERSC and, as such, is Service (IaaS) solutions. Commercial companies such as deeply tied to the underlying NERSC infrastructure. After Amazon, Rackspace, and Microsoft provided solutions with evaluating the NEWT architecture, it was found that while it mature web service offerings that allowing customers to did meet many of the requirements, it would require nearly a interact with their infrastructure. However the primary focus of complete rewrite to port NEWT onto the iPlant their web services was the administration of the resources they cyberinfrastructure. own. VMWare, OpenStack, Nimbus, and Eucalyptus all provided IaaS solutions solutions similar to the hosted The Simple API for Grid Applications (SAGA) is a commercial offerings, but run on local systems. These were programmatic interface for managing jobs and data on a attractive solutions iPlant could, and did adopt to expose their computational grid. It currently has bindings in Java, C++, and underlying cloud resource, but neither the hosted nor private Python, however it lacks any web service iterface. IaaS solutions provide services of sufficient abstraction or breadth to meet the requirements of the user community. Unicore is a comprehensive middleware solution similar to the Globus Toolkit, but exposed as a set of SOAP web services. Finally, the development team turned their evaluation focus One benefit of Unicore is its use of community established to popular "middleware" projects such as the Globus protocols for job submission, information management, and Toolkit[13], Grisu[14], Newt[15], SAGA[16], and Unicore authentication. The primary drawbacks to Unicore are the large [17]. Each project had its own advantages, but the focus was buy-in required to leverage it, the technology stack required by generally on a subset of the functionality desired by the iPlant clients to interact with the SOAP services, and the system-level user community. The Globus Toolkit provided solutions for installations required by both the clients and resource cross-site authentication, data movement, job submission, and providers. Given iPlant's lack of administrative access to the information management, but worked in a way largely underlying HPC hardware and the requirement for REST and disconnected from the underlying compute and data systems. web-accessible services, Unicore could not be an outward- Exposing the Globus GRAM scheduler as an outward facing facing technology in the desired iPlant programmatic access service to iPlant users would create the possibility of a layer. discrepancy between queue status reported by the system schedulers and the Globus scheduler. The development team After completing the evaluation process, the development team determined that, for lack of a better description, there was learned that in other projects, this led to accounting and monitoring difficulties that could never be completely worked abundant plumbing, but no kitchen sink. Despite the needs of the community, there was no solution that allowed significant out. Another roadblock to adopting the Globus Toolkit as a top level solution was that interaction with the Globus services re-use of existing code bases to meet the needs of the plant science community, and so the development team designed and required a separate technology stack. This technology stack would be required on every system communicating with iPlant. created the Foundation API (fAPI). The remainder of this paper is organized as follows. Section 2 gives an overview of the Though much improved in the past year, the installation of this software is still challenging for novice users to install and fAPI design. Section 3 describes the individual services comprising the API. Section 4 provides usage and adoption configure. Given the 7000 users iPlant currently supports, the burden that adoption of the Globus Toolkit would potentially metrics from the fAPI's first year of operation. Section 5 concludes with a brief roadmap for future development. add to the iPlant support staff weighed significantly in the final decision to leverage some of the lower level Globus III. DESIGN AND ARCHITECTURE components such as GridFTP for internal use, but refrain from exposing any at the user-facing level. The Foundation API is designed to operate as a multitenant, cloud-based, Software-as-a-Service (SaaS) solution. Two Grisu is an open source project developed by the Australian physical instances of the fAPI services sit behind an Apache Research Collaborative Service's National Grid. It provides Load Balancer and direct requests between the primary SOAP, REST, and GUI interfaces for job submission. The instance at Texas Advanced Computing Center and the mirror approach of Grisu was in line with community requirements, instance at the University of Arizona. This single logical however the scope of the project was much smaller than that of instance of the fAPI is available at iPlant and the underlying technology was aging. Several of its https://foundation.iplantcollaborative.org and serves iPlant core dependencies had been deprecated and no effort was being users and partner projects alike. made to update to alternative technologies. At the time of the initial review the Grisu development team was phasing out The fAPI services are implemented in Java and PHP as support of the project due to lack of interest and funding. As a RESTful web services. The Java services leverage the Restlet result, there was little opportunity to collaborate. [18] framework and run in a Tomcat 6 container. Hibernate is used to handle the ORM to a shared MySQL database [19]. submission and data movement. Fig. 2 illustrates a typical flow of execution when the IO service is invoked. The scenario shown occurs when a user requests that the IO service fetch a file from an external data source, preprocess or convert it in some way, and store it in the iPlant Data Store. While fulfilling the request, the task will pass through several queues, keeping the front-end service aware of its progress. Once completed, an optional notification will be sent to the user via email or web hook alerting them that their file staging request has completed.

Figure 1. Conceptual overview of the Foundation API architecture.

The PHP services leverage portions of the CodeIgniter framework and utilize the native PHP 5 PDO support [20]. The fAPI has both public and protected interfaces depending on the nature of the data they expose. The protected endpoints are accessible via HTTP Basic over SSL. While iPlant uses LDAP as the backing identity store for the entire projects, the fAPI does not expose this directly, but rather supports token-based authentication. Consumers request a token from the Auth service using their client credentials and use that token to interact with the rest of the API until it expires. Supporting the front-end services is an elastic set of worker services. These workers handle the lengthy task requests made to the front-end services such as file staging, job submission, and data transformation. The workers are implemented in Java as headless services running on distributed virtual machines. The workers do not directly communicate with the front-end Figure 2. A typical flow of execution when an asynchronous call is made to services. The front-end services submit tasks to the underlying the fAPI. In this figure, the user requests a file be staged in from an external queue service and the workers watch their respective queues data source, transformed, and stored in the iPlant Data Store. and claim tasks as they appear. The queue service is Because iPlant does not own the underlying HPC, storage, implemented in Java using the Quartz scheduling framework or network resources needed to conduct much of the science it backed by a MySQL database [21]. A separate Event service enables, the quickest time to production came from running the complements the queue service and provides publication- fAPI under a single community account and brokering actions subscription (pubsub) functionality to the API. on behalf of its internally vetted users. The complexity of As the overall workload on the fAPI increases, scaling up is implementing the services in this way was significantly less trivial. Additional worker services are started as new virtual than supporting authentication and usage scenarios outside of machines (VM) are spun up. When the overall workload the project. Using a community account also made provenance decreases, the VM can be shut down. If a worker fails or achievable across the entire enterprise, and allowed for a much becomes unresponsive, another worker will claim the lost task smaller burden on the system administrators. The primary and carry on. Fig. 1 illustrates the overall fAPI architecture. disadvantage of this approach was that it placed a significant security responsibility on the development team and increased As mentioned above, the fAPI services support both the level of reporting needed at all levels of the system. It is synchronous and asynchronous interaction. This allows client also not a sustainable approach from a resource utilization applications to fire and forget long running tasks such as job standpoint. This was one of the lessons learned from the buy-in rather than forcing vendor lock-in. It promoted a CIPRES project [22]. When useful software is available to the grocery store approach to adoption by allowing users to take community, community requests will quickly and forevermore what they need and leave the rest for later. out-pace the available resources. Given the already oversubscribed state of the underlying shared HPC resources, it Lastly, the development team, though many conversations with the Discovery Environment development team as well as is not possible to continue to grant larger and larger percentages of the available cycles to a single project. Thus, in external projects, identified great value in providing a solution that, from a user perspective, worked the way the rest of the the long term, the fAPI must move to enable users to charge work to their individual system allocations when appropriate. web worked. It should be designed in such a way that it could be easily consumed and mashed up with other services in a This topic is addressed more in the section on future work. short amount of time. It should realize that polling for Three other considerations influenced the overall design of information is a bad design pattern and allow users to utilize the fAPI without impacting the architecture. One of the most better access patterns when they were ready. It should frequently requested features from the community during the understand that every application has its own concept of requirements gathering phase was collaborative support. Users "fresh" information and allow them to access information at wanted a "share anything" model that would allow them to whatever rate and level they felt necessary. work with past, present, and future colleagues regardless of their affiliation with iPlant. As a result of this requirement, These three considerations were major influences on the design of the fAPI. In the next section we bring together these flexible access control lists (ACL) were built into all relevant pieces of the fAPI allowing users to share anything at any time design decisions along with several more subtle ones to describe the individual services of the fAPI. with anyone. Second, the importance of a clean, friendly API became IV. FOUNDATION API SERVICES readily apparent for a variety of reasons. A 2008 study of A. Authentication Services nearly 2000 researchers found that over 96% of those surveyed said that self-study was important when learning how to The first interaction users have with the fAPI is with develop software while only 30% could say the same for authentication. Users access protected endpoints in the API formal education. That same study shows that 84% of the using an authentication token. They obtain that token from the researchers surveyed claim that software development is Auth service. The Auth service uses LDAP to authenticate critical to their research, yet they were only able to allocate users and issue renewable, short-term credentials in the form of 30% of their time to software development. [23]. Two things tokens. The service supports token revocation, blacklisting, jump out from this study. First, the vast majority of users will delegation, and fixed usage scenarios. Tokens are implemented not be professional programmers and will not have significant simply as alphanumeric strings and can be used in place of a experience nor the time to learn new techniques and user's password when making HTTP Basic requests to the methodologies. In many cases users are scientists who are protected service endpoints. simply programming out of necessity. For these researchers, One use case not addressed by the Auth service is that of the learning curve required to begin building their software can low-trust scenarios. Often times, user workflows include be daunting. They first need to discover what services are interaction with a third-party services. Granting these services available to them. Next, given a set of disjoint services, they unlimited access to the user's account is highly undesirable. must decipher how to piece them together using different The Auth service can issue single use tokens, but it does not communication protocols, authentication mechanisms, and restrict the URL to which that token is valid. As a result, that access patterns. Once they figure that out, they must then token could be used to invoke any action using the API, not familiarize themselves with several different in order to just the one intended by the user. The PostIt service was write the necessary integration code. Once all that is in place, created for this purpose. they can then begin debugging the idiosyncrasies of each scheduler, compute, and storage system to find out why their The PostIt service is a pre-authenticated URL shortening integration code doesn’t work as advertised. Finally, after all service. Similar to popular URL shortening service such as that is in place, they can begin building the application they TinURL.com and Bit.ly, the PostIt service creates an originally set out to build. Given the prevalence of graduate obfuscated, short URL out of any endpoint in the fAPI. In the students doing much of the work and the inherent turnover rate case of protected endpoints, the user can pre-authenticate the of such employees, setting the barrier to entry that high is URL by authenticating to the PostIt service when creating the detrimental to the goal of advancing science. URL. As with the Auth service, the user can restrict lifetime and number of uses of the generated URL, but using the PostIt The results of that study corresponded with observations service, that permission will be restricted to the registered made by the iPlant User Support and Engagement teams during URL. When the third-party service invokes the PostIt URL, the the first 2.5 years of the project. Thus, as part of the design PostIt service add the appropriate authentication token and process, the development team took lessons from other popular proxies the request to the original URL. Any query parameters, APIs used in the commercial space such as Yelp, Dropbox, form variables, or attachments are forwarded on and the Flickr, Facebook, Amazon, and Paypal. User stories were response is relayed back. As with the Auth service, PostIt constructed around mobile, web, desktop, and command-line URLs can be revoked as needed. usage scenarios. The result was an API that erred on the side of usability rather than compliance. It promoted progressive user B. Data Services evaluation. It is anticipated that the service will hold on the The IO service is the core data movement service of the order of a billion pieces of information in its first year. Foundation API. It serves as a value-added abstraction layer on Querying such a dataset becomes challenging as that number top of the iPlant Data Store.. The IO service, like the Data grows. To improve search times, various methods of data Store, provides basic file operations and exposes fine-grained reductions are being examined. We anticipate the Meta service ACL. Data can be transferred both synchronously via HTTP to be available in full production early in 2013. GET and POST, and asynchronously by requesting a data C. Execution Services transfer from an external data source using any of a number of protocols. Current protocols supported are FTP, GridFTP, The Apps service is designed to serve as a central registry for applications deployed in the iPlant software infrastructure. SFTP, HTTP, IRODS, and Amazon S3. When requesting data from the IO service, data can be requested in whole or in part In concert with the Jobs API, bioinformatic analysis applications can be deployed on a range of platforms, ranging using partial data queries. When scheduling long running transfers, a callback can be specified. Callbacks can be an from leadership cluster systems, to virtual machines running in an environment such as Atmosphere, to remote RESTful email address or a web hook in the form of a HTTP endpoint to which the service will POST a response. To allow services. Support for truly heterogeneous, user-defined computing infrastructure such as Amazon EC2 and Google customization of the notifications, a template system is provided for use in defining the callback URL query. App Engine is an area of future growth in development now. Users can search for applications by search terms such as The IO service also provides a chained pre-processing mechanism that allows users to transform data upon ontological terms, tags, name, and unique ID. As a convenience for users building their own web GUIs, the Apps upload/import before it reaches its final destination. This allows raw data to be imported, yet arrive in a polished form. service also generates HTML forms for registered applications that can be posted to the jobs service for execution. This is particularly useful when importing database dumps, raw experimental output, and data from deprecated file formats, but Applications are available both publicly and privately. Public applications are administered and maintained by iPlant staff also handy for automatically compressing or decompressing data and running metadata extraction on datasets as they come and made available for anyone to use. Private applications are registered by users and, by default, can be discovered and run in. only by their owners. Applications have their own ACL, so The Data service was created in response to the need for private applications may be shared with other users by setting scientists to collaborate without having to learn many different the appropriate permissions. software packages and data formats. The Data service acts as a Application registration is accomplished by posting a JSON Rosetta stone for plant biologists allowing them to convert their existing data from one format to another and stage it to a description of the application to the Apps service. The JSON description includes basic information about the application location they specify using any of the protocols supported by the IO service. There are currently 49 different data formats name, version, target system, invocation method, and any required inputs, parameters, and outputs. Currently supported supported by the Data service. It is not possible to transform in every pairwise manner, so the data service provides a invocation methods include CLI, SGE, PBS, LOADLEVELER, LSF, TORQUE, and CONDOR. The mechanism to determine the available transformations given a description of the inputs and outputs is used exclusively for source data format. Like the IO service, the Data service supports both synchronous and asynchronous requests, validating job submission requests and determining what information is needed by the executable to run. The process of callbacks, and partial data queries. wrapping an application is out of the scope of this paper. For While the IO, Transfer, and Data services exist to aid in the more information on this process, consult the API management of raw data, the Meta service allows for the documentation in [25]. management of metadata. The Meta service provides metadata The Job service is the core execution service in the API. Its support for all of the fAPI services. It exists as an unstructured data store backed by a column-oriented NoSQL database[24]. sole purpose is to start and manage jobs across the underlying systems. Users POST a job request to the service and the Because metadata is, by definition, data about data, it has no inherent meaning. Thus, the Meta service allows users to service validates that request against the application description registered with the Apps service. A job handle is returned to the register schemas that describe their data. Schemas are JSON or XML descriptions of data. These descriptions may be a series user and the job request is forwarded on to the submission queue. A job submission worker will pull the job off the queue, of tuples, or highly structured object definitions. The decision on how to describe their data is left up to the user. In order to transfer in any missing data using the IO service and stage all dependencies to the remote system before submitting the job to bridge the gap between different schemas, users can register mappings between schemas. These mappings allow the Meta the remote scheduler or forking the process. When the job completes, a callback will be made to the user. Like the IO service to identify opportunities to do apples-to-apples comparisons between metadata associated with different service, the Job service supports a template system to enable the user to dynamically construct a callback URL at run time. schemas. This is helpful when doing global searches, constructing user-derived responses, and when using metadata Jobs, like applications and files, have their own ACL and to reconcile different raw data sets. The Meta service is can be shared between iPlant users as well as the general currently in initial testing while undergoing scalability public. To support quick access to job data, the Job service provides convenience endpoints to easily locate and browse will change in the coming months as the size of data mandates their output folders. This is helpful given that a job’s output a move towards a NoSQL solution. folder may or may not exist and changes over time. When the The Event service provides pubsub capabilities to the rest job is running, the folder is on the execution host. Once a job has finished, the folder is archived in the Data Store. If the job of the fAPI. When a work action is taken in the fAPI such as a data or job operation, an event is triggered and pushed onto the was not archived or failed right away, there will not be an output folder associated with the job. user’s event queue. The message is then available to anyone subscribed to that queue. The fAPI currently uses a custom Archiving is built into the job service. By default, all data Event service implementation based on the OGCE WS- generated during job execution is archived back to the Data Message project [27]. This may change going forward in Store and placed in a location specified by the user or, if not anticipation of broad RabbitMQ [28] support across the rest of specified, in an archive folder generated by the service in the the project. user’s home space. The Manager service is a catchall service for controlling the D. User Services behavior of the other services. It provides reporting as well as The Profile service acts as a directory service for iPlant. It the ability to shut down and drain work from the IO, data, and allows users to view information about other iPlant users such jobs worker services. It is a critical service in the Foundation as their name, contact info, and institutional affiliation. The API that we restricted exclusively to administrative users. identity management within iPlant is currently going through V. USAGE AND ADOPTION an upgrade as the project moves towards providing OAuth2 [26] support. As such, the profile service is currently available The fAPI was released for public usage in September of as a read-only service. Please consult the Future Plans section 2011. Since then, over 250 users have leveraged the fAPI to for more information about fAPI roadmap for identity enable projects representing more than 10,000 scientists management. worldwide. Users burned more than half a million SUs running over 4000 jobs across HPC systems at PSC, SDSC, and TACC. The Audit service provides detailed usage and accounting The fAPI data services now serve access to nearly half a information across the iPlant cyberinfrastructure. From the petabyte of data. Fig. 3 shows the approximate usage history Audit service, users can check their allocation status and over the first year of operation. quotas, obtain reports on their system-level usage history, their API usage history, and obtain their remaining balances across resources. The Audit service is targeted towards individual users rather than the organization as a whole. As a result, separate administrative interfaces are available for staff to create systemwide and multi-user reports. E. Administration Services The Systems service is a discovery endpoint for the fAPI. It provides information on what systems are currently available through the Jobs service for execution and what the current status of those systems is. This service is currently read-only. The section on Future Plans describes the ongoing work of the development team to support user-defined dynamic resource registration. Figure 3. Foundation API monthly usage history. Usage increased consistently The Monitor service gives users a view into the health of over the year with anticipated dips during breaks in the academic calendar. the fAPI. Its main role is to tell users if any part of the fAPI is down and why. This information is obtained by running a suite In the remainder of this section we highlight three projects of tests as a cron process that check the availability and who have made significant contributions to this usage by response accuracy of the entire fAPI, its dependent services, engaging with the fAPI development team to integrate the fAPI and the underlying systems. Test results are pushed into a deeply into their existing applications. These projects are the separate MySQL database and exposed by the Monitor service. iPlant Discovery Environment, the BioExtract Server, and the Both the service and database are hosted physically separated Easy Terminal Alternative. from the rest of the services to ensure that it is accessibility The first adopter of the fAPI was the iPlant’s own even in the case of catastrophic failure in the rest of the fAPI. Discovery Environment (DE) team [29]. Said Lead Developer The Track service is part of the provenance solution in the Andrew Lenards, “The a la carte approach to the Foundation fAPI. It exists as a simple write-only service to track activity API was extremely attractive to an integrating system like the across the entire iPlant cyberinfrastructure. Every invocation of Discovery Environment. It primarily means that the system can a fAPI service calls out to the Track service to create a record leverage the portions of the framework that make the most of its activity. The Track service provides a central place where impact. For the Discovery Environment's current one can mine usage information and service activity. All data implementation, this includes the Foundation API's Data, Auth, from the service is currently stored in a MySQL database. This Apps, and Jobs services. “A token-based authentication approach available from the jobs run across systems at TACC and SDSC using the Jobs auth service allows actions to be completed on-behalf of the service. Discovery Environment's users. Aspects of the data service are The biggest lessons learned from working with the leveraged to enable users to manage their file-based data. And a simple "shim" script allows the Discovery Environment to BioExtract team was the need to respond quickly and manage change well. The BioExtract server already had an active user off-load computationally intensive applications to the appropriate resources available at TACC. Through this simple community relying on it for their daily work. Any changes made to the fAPI after integration needed to ensure backward script, the Discovery Environment can access applications deployed within the TACC and XSEDE environments, execute compatibility or their application would break. As a result the development team spent time developing processes for them, and monitor their status via commodity hardware using both the apps and jobs services. releasing updates, versioning the services, migrating users between versions, and ensuring that provenance was pervasive The greatest value-added feature gained through integration across the fAPI. The processes made the fAPI a better product with the Foundation API, however, was its monitoring overall and helped manage our larger users more effectively. capabilities. Experienced software developers are all-to-aware of the performance implications of polling, yet are often forced Easy Terminal Alternative (ETA) is a web based front end to any infrastructure that allows novice users to submit, to implement such functionality due to lack of effective alternatives. Using a Pub-Sub model, the Foundation API does monitor, and share jobs [31]. ETA is designed specifically to transition users from the terminal to the web by easing the not force an integrator to poll; they just need to operate a simple end-point that allows the integrating system to subscribe running of applications, streamlining the creation of scientific pipelines, and managing application lifecycles. The ETA to updates via web hooks available throughout the API. This makes bubbling the status of operations within the Foundation development team at the Center for Research and Biocomputing, led by Alex Boyd, used the fAPI Auth, Jobs, API easily achievable. The Discovery Environment has benefited tremendously from its adoption of the Foundation IO, and Apps services to provide overflow cycles for their campus users when their system reached full capacity. API and, as our needs grow over time, our adoption of other Foundation API services will increase.” Leveraging the webhook features of the Jobs service, ETA was able to reconstruct existing pipelines using their existing The primary lesson learned working with the DE team was workflow execution engine. the cost of technological debt. The DE team started The primary takeaway from engagement with the ETA development a full year before the fAPI. As a result, they had hundreds of thousands of lines of code already deployed and team was the need for multiple levels of documentation. While we had documentation in place for the fAPI, we didn’t have the working. Even small changes to their infrastructure could have big picture description that explained the concepts of how the significant impacts on the overall system both in terms of stability and developer effort. For them, the return on fAPI worked and the value proposition to its potential users. To that end, a developer’s website will be released with version 2 investment had to significantly outweigh the cost for any change they introduced. That placed a large responsibility on us of the fAPI. to ensure that our services were stable and that we could clearly VI. FUTURE PLANS communicate the value of individual fAPI services so they could find integration points that fit into their existing Looking forward, the Foundation API has an exciting development cycle. roadmap. In the coming months the development team will be releasing the 2.0 version of the API, expanding functionality in The BioExtract Server is an open, Web-based system several new directions while increasing performance and designed to aid researchers in the analysis of genomic data by reliability in the existing services. Provenance will continue to providing a platform to facilitate the creation of bioinformatic be improved across the entire iPlant cyberinfrastructure. Global workflows. Scientific workflows are created within the system UUID will be given to every digital object that touches the by recording tasks performed by the user. These tasks may iPlant cyberinfrastructure. The Meta service currently in testing include querying multiple, distributed data sources, saving will be rolled out as a first class service as well. query results as searchable data extracts, and executing local and Web-accessible analytic tools. The series of recorded tasks A new area of focus for the development team is can then be saved as a reproducible, sharable workflow authentication and identity management. A new OAuth 2 available for subsequent execution with the original or service will be deployed that will provide better accounting, modified inputs and parameter settings [30]. Over the last year, provenance, and control over individual services than was Dr. Carol Lushbough and the BioExtract team have leveraged previously available. At the same time, a new Systems service the fAPI to enable researchers to execute iPlant analytic tools will be deployed that allows users to register compute and data within the BioExtract Server, create iPlant workflows through resources external to iPlant for use through the fAPI. This will the BioExtract Server, and manage analysis and provenance move the fAPI closer towards a true SaaS offering and help data across both platforms. The result of this collaboration has bridge the gap for users between iPlant and their local and added a unified authentication mechanism between the two campus resources. projects using the Auth service, expansion of their data Through a joint effort with the HPC and HTC system capacity by several orders of magnitude using the IO and Data providers, a global RabbitMQ based event system will be services, the addition of over two dozen new scientific codes to deployed in the near future. This will allow for better their application through the Apps service, and over 500 HPC monitoring and information management across the entire Douma, Srinath Perera, and Sanjiva Weerawarana. 2011. Apache system. airavata: a framework for distributed applications and computational workflows. In Proceedings of the 2011 ACM workshop on Gateway Finally, the Apps and Jobs services will continue to expand computing environments (GCE '11). ACM, New York, NY, USA, 21- with additional functionality. Support for new platforms and 28. execution mechanisms is ongoing and will bring commercial [13] Ian Foster, Carl Kesselman, and Steven Tuecke. 2001. The Anatomy of the Grid: Enabling Scalable Virtual Organizations. Int. J. High Perform. cloud support to the fAPI in the coming months. In order to Comput. Appl. 15, 3 (August 2001), 200-222. support interoperability with other job execution services, [14] (grisu)Altinay, C.; Binsteiner, M.; Gross, L.; Weatherley, D.K.; , "High- additional job and application description formats will be Performance Scientific Computing for the Masses: Developing Secure supported. The addition of the new OAuth service will impact Grid Portals for Scientific Workflows," e-Science (e-Science), 2010 the Job service in a positive way. Once available, job IEEE Sixth International Conference on , vol., no., pp.254-260, 7-10 submission using the user’s personal system account and Dec. 2010. doi: 10.1109/eScience.2010.30 allocation will be supported. [15] S. Cholia, D. Skinner, J. Boverhof, “NEWT: A RESTful service for building High Performance Computing web applications”, Gateway This list of plans is in no way complete, but hopefully casts Computing Environments Workshop (GCE), 2010, January 1, 2010, 1-- a vision of the direction in which development is currently 11, headed. For more information on the fAPI and to follow along [16] T. Goodale, S. Jha, H. Kaiser, T. Kielmann, P. Kleijer, A. Merzky, J. Shalf, and C.Smith, A Simple API for Grid Applications (SAGA), OGF with its development, consult [25] and [32]. Document Series 90, http://www.ogf.org/documents/GFD.90.pdf. [17] Dietmar W. Erwin , David F. Snelling, UNICORE: A Grid Computing ACKNOWLEDGMENT Environment, Proceedings of the 7th International Euro-Par Conference The iPlant Collaborative is funded by a grant from the Manchester on Parallel Processing, p.825-834, August 28-31, 2001 National ScienceFoundation Plant Cyberinfrastructure Program [18] J. Louval, T. Templier, and T. Boileau. Restlet In Action. Manning (#DBI-0735191). This work was also partially supported by a Press. 2012. grant from the National Science Foundation Cybersecurity [19] Hibernate. http://www.hibernate.org. Program (#1127210). [20] PHP Data Objects. http://php.net/manual/en/book.pdo.php. [21] Quartz Enterprise Job Scheduler. http://quartz-scheduler.org. REFERENCES [22] Mark A. Miller, Wayne Pfeiffer, and Terri Schwartz. 2011. The CIPRES [1] Stanzione, Dan, "The iPlant Collaborative: Cyberinfrastructure to Feed science gateway: a community resource for phylogenetic analyses. In the World," IEEE Computer, November 2011. doi: Proceedings of the 2011 TeraGrid Conference: Extreme Digital 10.1109/MC.2011.297. Discovery (TG '11). ACM, New York, NY, USA, , Article 41 , 8 pages. DOI=10.1145/2016741.2016785 [2] Rajasekar, A., M. Wan, R. Moore, W. Schroeder, "A Prototype Rule- http://doi.acm.org/10.1145/2016741.2016785 based Distributed Data Management System", HPDC workshop on "Next Generation Distributed Data Management", May 2006, Paris, [23] Jo Erskine Hannay, Carolyn MacLeod, Janice Singer, Hans Petter France. Langtangen, Dietmar Pfahl, and Greg Wilson. 2009. How do scientists develop and use scientific software?. In Proceedings of the 2009 ICSE [3] iPlant Atmosphere: A Gateway to Cloud Infrastructure for the Plant Workshop on Software Engineering for Computational Science and Sciences. Skidmore E, Kim SJ, Kuchimanchi, S Singaram S, Merchant Engineering (SECSE '09). IEEE Computer Society, Washington, DC, N, Stanzione D. Proceedings from Gateway Computing Environments USA, 1-8.Goecks, J, Nekrutenko, A, Taylor, J and The Galaxy Team. 2011 at Supercomputing11 (2011) Galaxy: a comprehensive approach for supporting accessible, [4] Eucalyptus. http://www.eucalyptus.com. reproducible, and transparent computational research in the life [5] The Extreme Science and Engineering Discovery Environment. sciences.Genome Biol. 2010 Aug 25;11(8):R86. http://xsede.org. [24] Rick Cattell, Scalable SQL and NoSQL data stores, ACM SIGMOD [6] FutureGrid: An Experimental, High-Performance Grid Test-Bed. Record, v.39 n.4, December 2010 [doi>10.1145/1978915.1978919] http://portal.futuregrid.org. [25] iPlant Foundation API Developer Documentation. [7] Mine Altunay, Paul Avery, Kent Blackburn, Brian Bockelman, Michael https://foundation.iplantcollaborative.org/docs. Ernst, Dan Fraser, Robert Quick, Robert Gardner, Sebastien Goasguen, [26] OAuth2. http://oauth.net/2/. Tanya Levshina, Miron Livny, John Mcgee, Doug Olson, Ruth Pordes, [27] Y. Huang, A. Slominski, C. Herath, and D. Gannon, " WS-Messenger: A Maxim Potekhin, Abhishek Rana, Alain Roy, Chander Sehgal, Igor Web Services based Messaging System for Service-Oriented Grid Sfiligoi, and Frank Wuerthwein. 2011. A Science Driven Production Computing ," 6th IEEE International Symposium on Cluster Computing Cyberinfrastructure--the Open Science Grid. J. Grid Comput. 9, 2 (June and the Grid (CCGrid06) 2011), 201-218. [28] Rabbitmq. http://www.rabbitmq.com/blog/. [8] The University of Texas System. http://www.utsystem.edu. [29] Andrew Lenards, Nirav Merchant, and Dan Stanzione. 2011. Building [9] R. T. Fielding. Architectural Styles and the Design of Network-based an environment to facilitate discoveries for plant sciences. In Software Architectures. PhD thesis, Information and Computer Science, Proceedings of the 2011 ACM workshop on Gateway computing University of California, Irvine, California, USA, 2000. environments (GCE '11). ACM, New York, NY, USA, 51-58. [10] Ian Foster. 2011. Globus Online: Accelerating and Democratizing DOI=10.1145/2110486.2110494. Science through Cloud-Based Services. IEEE Internet Computing 15, 3 [30] Lushbough, C., Jennewein, D., Brendel, V., (2011), The BioExtract (May 2011), 70-73. Server: a web-based bioinfromatic workflow platform, Nucl. Acids Res, [11] Kurze, T.; Lizhe Wang; von Laszewski, G.; Jie Tao; Kunze, M.; Kramer, vol. 39, iss. W528-W532. D.; Karl, W.; , "Cyberaide onServe: Software as a Service on Production [31] Easy Terminal Alternative. http://eta- Grids," Parallel Processing (ICPP), 2010 39th International Conference pub.cgrb.oregonstate.edu/about.php. on , vol., no., pp.395-403, 13-16 Sept. 2010. doi: 10.1109/ICPP.2010.47 [32] iPlant Foundation API Forums. [12] Suresh Marru, Lahiru Gunathilake, Chathura Herath, Patanachai https://foundation.iplantcollaborative.org/forums. Tangchaisin, Marlon Pierce, Chris Mattmann, Raminder Singh, Thilina Gunarathne, Eran Chinthaka, Ross Gardler, Aleksander Slominski, Ate