Chapter 1 GRID RESOURCE MANAGEMENT in LEGION

Chapter 1 GRID RESOURCE MANAGEMENT IN LEGION Anand Natrajan, Marty A. Humphrey, and Andrew S. Grimshaw Department of Computer Science, University of Virginia Abstract Grid resource management is not just about scheduling jobs on the fastest machines, but rather about scheduling all compute objects and all data objects on machines whose capabilities match the requirements, while preserving site au- tonomy, recognizing usage policies and respecting conditions for use. In this chapter, we present the Grid resource management of Legion, an object-based Grid infrastructure system. We argue that Grid resource management requires not a one-size-fits-all scheduler but an architectural framework that can accommodate different schedulers for different classes of problems. Keywords: Legion, Grid Scheduling, Resource Management 1. INTRODUCTION The Legion Project began in late 1993 with the recognition of the dra- matic increases in wide-area network bandwidth, truly low-cost processors, and cheap disks looming on the horizon. Given the expected changes in the physical infrastructure, we asked what sorts of applications would people want, and what system software infrastructure would be needed to support those applications. As a result of this analysis, we designed and implemented the Legion Grid Computing system, which is reflective, object-based to facilitate encapsulation, extensible, and is in essence an operating system for Grids. Whereas Globus is a collection of tools from a toolkit [FK99], Legion pro- vides standard operating system services – process creation and control, inter- process communication, persistent storage, security and resource management – on a Grid. By doing so, Legion abstracts the heterogeneity inherent in distributed resources and makes them look like part of one virtual machine. We feel strongly that having a common underlying architecture and set of neces- sary services built over it is critical for success in Grids, particularly as the line between computational Grids and data Grids blurs [AVD01]. In this sense, the 2 Legion architecture anticipates the drive to Web Services and the Open Grid Systems Architecture (OGSA) [FKNT02]. There are many papers describ- ing Legion's core architecture and use (e.g., [GW97, GFKH99, NCWD+01, LFH+03]); in this chapter, we focus on the Legion resource management system. 2. OBJECT PLACEMENT IN A GRID The scheduling process in Legion broadly translates to placing objects on processors. Scheduling is invoked not just for running users' jobs but also to create any object on a Grid, such as a Grid file, a Grid directory, a Grid application or even a Grid scheduler. After an object is created on a processor, it can perform its tasks, for example, respond to read/write calls if the object is a Grid file, or respond to status requests if it is a Grid job. Therefore, object placement is crucial to the design of the Legion run-time system because it can influence an object's run-time behavior greatly. An improper placement decision may impede an object from performing its tasks, for example, because it cannot start on any processor of a given architecture or because the processor is no longer available. Even if a placement decision ensures that an object starts correctly, it does not guarantee that the decision is beneficial to the user. A good placement decision is certain to vary depending on the object being placed and the user's goals as well as resource usage policies. Determining good object placements in a large distributed, heterogeneous environment, such as a Grid, is difficult because the underlying system is com- plex, and because object behavior can be influenced by many different factors, such as system status (number, type, and load of components), hardware capabilities (processor, network, I/O, memory, etc.), interactions between objects and object-specific characteristics (size, location of persistent state, es- timated performance, etc.). Factors such as security concerns, fault-tolerance objectives and special resource requirements may place hard restrictions on where an object can be placed. These factors are usually expressed as con- straints on the placement decision. In general, finding an optimal placement is prohibitively expensive. Several research efforts, such as Utopia [ZWZD93], NOW [ACP+94], Condor [LL90, PL95], Zoom [WASB95], Prophet [Wei95] and others [Cof76, FC90, GY93, WKN+92], have focused on algorithms and systems for near-optimal solutions or optimal solutions to very restricted prob- lem sub-types or user goals [Kar96]. In Legion, we designed a scheduling framework that can accommodate different placement strategies for different classes of applications. In addition to the expected Grid goals of support for heterogeneity and multi-organizational control, the goals included [Kar96]: Grid Resource Management in Legion 3 Support for multiple placement algorithms. The framework must be flex- ible enough to be able to incorporate placement algorithms developed by others. Support for user selection of placement. Users must be permitted to choose the placement approach that best matches their goals. Ease of use. It should be easy for developers to add new placement algorithms as schedulers in the framework. Additionally, it should be easy for end-users to access the schedulers for performing their tasks. Default placement mechanisms ease the use of the framework for inexperienced users. Ability to cope with uncertain, outdated or partial information. We expect that in Grids, information may be missing or inaccurate. The scheduling framework in general and the schedulers that are part of it must continue to perform acceptably even when the available system information is less than perfect. Ability to resolve conflicts. In a system that supports shared objects and resources, conflicts may arise over their use. The framework must have a well-defined resolution behavior in the case of conflicts. Scalability. The framework should not degrade significantly (or at least degrade gracefully) when the number of processors becomes large or when the requests for placement become more frequent. Low overhead. The framework should not impose penalties on users who choose not to use it. For users who do choose to use it, the overheads involved in invoking the scheduling process should be small compared to the duration of the task performed. Integration with other Legion services. As a direct relation to the Legion philosophy of providing an integrated Grid infrastructure, the scheduling framework must cooperate and communicate with other frameworks in Legion, such as those for security and fault-tolerance. The framework must also take into account persistent storage associated with the shared object space in Legion. In the following sub-sections, we describe the main tasks of the Legion scheduling framework. The goal of this description is not to advocate one placement policy over another. Although we did select a particular placement policy in order to validate (and populate) our framework, we did not and do not claim that placement policy to be optimal or best-suited for all objects. 4 2.1 Initiating Placement Placement decisions can be initiated in two ways. In the first case, an object can request the underlying infrastructure explicitly to place (or schedule) other objects with which it must interact. For example, a user (represented as an object in Legion) may request Legion to run a particular job on a particular machine. In this case, the placement initiation is explicit in two senses: the request to place is a direct implication of issuing the run command, and the placement location is provided by the user directly. Legion does not require a user to initiate placement in the latter sense – an undirected run command transfers the burden of finding a placement location from the user to Legion. In the second case, placement initiation may be implicit, and therefore must be automatic. In other words, an object, say a user, may access another object without realizing that the latter is currently inactive. In this case, Legion will re-activate the second object automatically, which in turn may require it to be placed on an available processor. The processor chosen in this case may or may not be the same processor on which that object was active previously. Implicit or automatic placement initiation occurs frequently in Legion; in order to conserve resources, Legion may deactivate infrequently-used objects. When a subsequent action does require such objects to be available, Legion will re-activate them. Seemingly mundane Grid operations such as logging in, checking the contents of a Grid directory and issuing a run may cause several objects to be re-activated. 2.2 Preparing for Placement Regardless of how placement is initiated, preparing for placement involves three tasks. The first task is selecting an appropriate scheduler from the framework. To be most effective, the scheduler chosen must factor in criteria that are important to the user [BW96, Ber99, LYFA02]. Since the scheduler itself may require time and CPU cycles to make a decision, its performance and cost must be weighed against its anticipated benefits. This selection may be made automatically by Legion, or may be specified by sophisticated users who choose to indicate which scheduler, or even which processor, must be used. When placement is initiated automatically, there exists a mechanism for indicating which scheduler to use. This mechanism is captured in attributes of class objects, which are managers or factories for creating instances of different kinds of objects. For example, when users decide to port an application to Legion, they use a tool that essentially creates an application class object in the Grid. A class object may be associated with any of the schedulers available in the Grid. When a user requests that this application be run, the class object consults its attributes, determines the associated scheduler and invokes this scheduler to Grid Resource Management in Legion 5 perform a placement decision for an instance of the application, namely the run desired by the user.

Load more