An Execution Service for Grid Computing
Total Page:16
File Type:pdf, Size:1020Kb
An Execution Service for Grid Computing Warren Smith Chaumin Hu Computer Sciences Corporation Advanced Management Technology Inc. NASA Ames Research Center NASA Ames Research Center [email protected] [email protected] Abstract that users still need to know details about the resources This paper describes the design and implementation of they want to use so that they can configure their the IPG Execution Service that reliably executes complex applications to use the resources and users must handle jobs on a computational grid. Our Execution Service is even simple failures rather than the grid handling them. part of the IPG service architecture whose goal is to For the past two years, the NASA Information Power support location-independent computing. In such an Grid (IPG) project has been developing higher-level grid environment, once n user ports an npplicntion to one or services that attempt to create a grid to address these more hardware/softwareplatfrms, the user can describe problems. The services we are developing include this environment to the grid the grid can locate instances resource brokering, automatic software dependency of this platfrm, configure the platfrm as requiredfor the analysis and installation, configuring execution application, and then execute the application. Our environments, and policy-based access control. In Execution Service runs jobs that set up such environments addition, we have developed the service we describe in for applications and executes them. These jobs consist of this paper: An Execution Service to reliably execute a set of tasks for executing applications and managing complex jobs in a grid environment. data. The tasks have user-defined starting conditions that The jobs sent to our Execution Service consist of a set allow users to specih complex dependencies including of tasks for executing applications and managing data. A task to execute when tasks fail, afiequent occurrence in job can consist of only a few, or a large number of tasks. a large distributed system, or are cancelled. The Our service executes the tasks in a job based on user- execution task provided by our service also configures the defined starting conditions for each task where the application environment exactly as specified by the user starting conditions are based on the states of other tasks. and captures the exit code of the application,features that This formulation allows users to describe jobs that have many grid execution services do not support due to tasks that execute in parallel and also tasks to execute dflculties interfacing to local scheduling systems. when other tasks fail, a frequent occurrence in a large distributed system like a computational grid, or when the 1. Introduction user cancels tasks. Another important feature of our Execution Service is that when it executes an application, The NASA Information Power Grid (IPG) project [2, the application is executed in the environment exactly as 131 is one of the original grid computing projects and our specified by the user and the exit code of the application goal has been to integrate, develop, and deploy a set of is captured. This does not occur with many grid execution grid services to enable scientific discovery. The scientists services because of difficulties interfacing to local we support perform tasks such as designing and analyzing scheduling systems. aerospace vehicles, investigating the Earth’s climate, and This paper begins in the next section with a brief archiving and analyzing astronomical data. We have overview of the IPG service architecture and a description based our grid on the Globus toolkit [ 111 and we are of how our Execution Service fits within this architecture. currently in the process of migrating from version 2 of Section 3 provides an overview of the functionality of our Globus (GT2) to version 3 of Globus (GT3). We have Execution Service. Section 4 provides more information also deployed services such as the Storage Resource on the task-based job model our service supports. Section Broker [4]and Condor [5]. 5 describes how we are implementing our service as an While we have found existing grid services to be OGSI service using the Globus toolkit. Section 6 presents usable, they do not always satisfy all of our needs. In related work and we provide our conclusions and future particular, we have found that the collection of available work in Section I. grid services and software do not add up to a usable grid. There are many reasons for this, but a few examples are 1 for application execution and file management and 2. IPG Service Architecture composite tasks that contain other tasks. A workflow is sent to a Workflow Manager to execute. The Workflow Our experience with Grid Computing has been that Manager decides which portion of the workflow to while there is a large amount of software available from execute and asks the Resource Broker for resource various sources, this software does not add up to a very suggestions for each task. usable system once it is deployed. Functionality is The Resource Broker makes suggestions using user- missing from the software, the software is not as reliable specified requirements such as resource type and user- as we would like, and resource differences are not hidden specified preferences such as quick completion. The from our users so they end up needing to know a large requirements sent to the broker describe the amount of information about resources and their hardware/software platforms that are suitable for peculiarities. Our goal in the IPG is to provide a grid executing a task. To make selections, the Broker consults environment that addresses these problems and provides many other services. The Distributed Directory Service is value to our users. To accomplish this, we are focusing on used to search for resources with specific characteristics. making Grid computing location-independent. What we The Resource Pricing Service is contacted to determine mean by this is that once a user has an application that can the cost of using these resources. The Allocation execute on a certain hardware/software platform or Management Service is used to determine if the user has platforms, the user can describe this environment to the an allocation that can be charged to when executing on grid, the grid can locate instances of this platform that can specific resources. The Access Control Service is be used for the application, the grid can configure the accessed to determine which resources the user can platform as required for the application, and the grid can access. The Metadata Management Service is used to find then execute the application. virtual files that have the data the user requires. The Our approach to providing this location-independent Replica Management service is accessed to determine the environment is to build our own set of services and to use physical locations of the user’s data. The Software grid services implemented elsewhere. We more exactly Dependency Analysis Service is consulted to determine describe our problem as providing support for location- what software needs to be present on a system for an independent execution of workflows. Figure 1 shows our application to execute. The Software Catalog is used to architecture and provides an overview of the current locate where needed software is already installed or can status of our services. A workflow consists of set of tasks be obtained. The Prediction Service provides predictions and the dependencies between these tasks where the of application completion times and file transfer times. dependencies consist of both control and data Once resources have been selected, the Naturalization dependencies. Tasks consist of simple tasks such as those Service is used to make each task in a workflow compatible with the computer system(s) it will execute on failures. Further, even when the resources are available, by configuring environment variables, directories, and the software and services located on those resources may specifying any supporting software that needs to be be unavailable or not operating correctly. There are ways copied to the system. The purpose of the Execution to mitigate this inherent unreliability by techniques such Service, described in this paper, is to reliably execute a as pre-planning outages and monitoring the status of a task graph. A task graph is the resulting set of tasks after a grid [7, 161 so that failures can be quickly repaired, but workflow (or portion of a workflow) has had computer this will not eliminate the problem. To help our users deal systems selected for it and has been naturalized to those with failures, our Execution Service detects when tasks systems. The Execution Service uses a Remote Execution fail and retries them when appropriate. To determine how Service, such as the one provided by the Globus toolkit, to to handle a failure, information about the cause of the execute applications on remote resources. During this failure is needed. erecuticn, the ,n??,,ic dcc.... Sttrsrice is Used t!2 manr ------After a job has been siihmitte?. to niir Eweciitinn each grid user to a local account without the user having a Service, users can monitor it in several ways. While the pre-existing account. Event Management services are job is executing, users can either be notified when the used by the Execution service to notify clients of the state of the tasks in a job change or they can query to status of the execution of a task graph and are used by the obtain a history of state changes for each task in a job. Monitoring Service [17, 181 to notify clients of the status Further, many applications indicate whether they of resources and services. Finally the Management executed successfully or not using the exit code of the Service [17] is not visible to the general user but it application.