Scalable Fault-Tolerant Management in a Service Oriented Architecture

Scalable, Fault-tolerant Management in a Service Oriented Architecture Harshawardhan Gadgil, Geoffrey Fox, Shrideep Pallickara, Marlon Pierce Community Grids Lab, Indiana University, Bloomington IN 47404 (hgadgil, gcf, spallick, marpierc)@indiana.edu Abstract of XML based interactions that facilitate The service-oriented architecture has come a implementations in different languages, running on long way in solving the problem of reusability of different platforms and over multiple transports. existing software resources. Grid applications today are composed of a large number of loosely 1.1. Aspects of Resource / Service Management coupled services. While this has opened up new avenues for building large, complex applications, Before we proceed further, we clarify the use of it has made the management of the application term Resource in this paper. Distributed applications components a non-trivial task. Management is are composed of components which are digital entities further complicated when services exist on on the network. We consider a specific case of different platforms, are written in different distributed applications where such digital entities can languages, present in varying administrative be controlled by zero or modest state. We define domains restricted by firewalls and are susceptible modest state as being one which can be exchanged to failure. using very few messages (typically 1 message This paper investigates problems that emerge exchange). These digital entities in turn can bootstrap when there is a need to uniformly manage a set of and control components with much higher state. Such distributed services. We present a scalable, fault- components could be hardware (e.g. network, CPU, tolerant management framework. Our empirical memory) or software (e.g. file-systems, databases, evaluation shows that the architecture adds an services). We consider the combination of such a acceptable number of additional resources making digital entity and the component associated with it as a the approach feasible. manageable resource. Note that we do not imply any Keywords: Scalable, Fault-tolerance, Service Oriented relation to other definitions of the term "resource" Architecture, Web Services elsewhere in literature (e.g. WS-Resource as defined by WSRF). Further, if the digital entity is a service we 1. Introduction add appropriate management interfaces. If the digital This Service Oriented Architecture (SOA) [1] entity is not a service we create a Web Service wrapper delivers unprecedented flexibility and cost savings by that augments a Web Service based management promoting reuse of software components. This has interface to the manageable resource. opened new avenues for building large complex Next, we discuss the scope of management with distributed applications by loosely coupling interacting respect to the discussion presented in this paper. The software services. A distributed application benefits primary goal of Resource Management is the efficient from properly managed (configured, deployed and and effective use of an organization’s resources. monitored) services. However the various technologies Resource management can be defined as “Maintaining used to deploy, configure, secure, monitor and control the system’s ability to provide its specified services distributed services have evolved independently. For with a prescribed quality of service”. Resource instance, many network devices use Simple Network management can be divided into two broad domains: Management Protocol [2], Java applications use Java one that primarily deals with efficient resource Management eXtensions (JMX) [3] while servers utilization and other that deals with resource implement management using Web-based Enterprise administration. Management [4] or Common Information Model [5]. In the first category, resource management provides The Web Services community has addressed this resource allocation and scheduling, the goal being challenge by adopting a SOA using Web Services sharing resources fairly while maintaining optimal technology to provide flexible and interoperable resource utilization. For example, operating systems management protocols. The flexibility comes from the [6] provide resource management by providing fair ability to quickly adapt to rapidly changing resource sharing via process and memory management. environments. The interoperability comes from the use Condor [7] provides specialized job management for - 1 - compute intensive tasks. Similarly Grid Resource hoc solutions to interoperate between different Allocation Manager [8] provides an interface for management protocols. requesting and using remote system resources for job Finally, a central management system poses execution. problems related to scalability and vulnerability to a The second category deals with appropriately single point of failure. configuring and deploying resources / services while These factors motivate the need for a distributed maintaining a valid run-time configuration according management infrastructure. We envisage a generic to some user-defined criteria. In this case management management framework that is capable of managing has static (configuring, bootstrapping) and dynamic any type of resource. By implementing interoperable (monitoring and event handling) aspects. management protocols we can effectively integrate This paper deals with the second category of existing management systems. Finally the management resource / service management. framework must automatically handle failures within itself. 1.2. Motivation 1.3. Desired Features As suggested by Moore’s Law [9], the phenomenal progress of technology has driven the deployment of In this section, we provide a summary of the desired an increasing number of devices ranging from RFID characteristics of the management framework: devices to supercomputers. These devices are widely Fault-tolerance: As applications span wide area deployed, spanning corporate administrative domains networks, they become difficult to maintain and typically protected by firewalls. Low cost of hardware resource failure is normal. Failure could be a result of has made replication a cost-effective approach to fault- the actual resource failure or because of some related tolerance especially when using software replication. component such as network making resource These factors have contributed to the increasing inaccessible or even because of the failure of the complexity of today’s applications which are management framework itself. While the framework composed of ever increasing number of resources: must provide self-healing capabilities, resource failure hardware (hard-drives, CPUs, networks) and software must be handled by providing appropriate policies to (services). Management is required for maintaining a detect and handle failures while avoiding properly running application; however existing inconsistencies. approaches have shown limitations to successfully Scalability: With the increase in number of manage such large scale systems. manageable resource, the framework must scale to First, as the size of an application increases in terms accommodate the management of additional resources. of factors such as hardware components, software Further, additional components are typically required components and geographical scale, it is certain that to provide fault-tolerance. The framework must also some parts of the application will fail. An analysis [10] scale in terms of the number of these additional of causes of failure in Internet services shows that most components. of the services’s downtime can be attributed to Performance: Additional components required to improper management (such as wrong configuration) support the framework increases the initialization cost. while software failures come second. While initialization costs are acceptable, they Second, administration tasks are mainly performed contribute to increasing the cost of recovery from by persons. A great deal of knowledge needed for failure. Runtime events generated by resources require administration tasks is not formalized and is part of a finite amount of time to process. The challenge is to administrator’s know-how and experience. With the achieve acceptable performance in terms of recovery growing size and complexity of applications, the cost from failure and responsiveness to faults. of administration is increasing while the difficulty of Interoperability: As previously discussed, resources administration tasks approaches the limits of exist on different platforms and may be written in administrator’s skills. different languages. While proprietary management Third, different types of resources in as system systems such as JMX and Windows Management require different resource specific management Instrumentation [11] have been quite successful, they frameworks. As discussed before, the resource are not interoperable limiting their use in management systems have evolved independently. heterogeneous systems and platforms. The This complicates application implementation by management framework must address the requiring the use of different proprietary technologies interoperability issue. We leverage a Web Service for managing different types of resources or using ad- based management protocol to address interoperability. - 2 - Generality: Resource management must be generic, 2.1.1. Resource As defined in Section 1.1 we refer to i.e. the framework must apply equally well to hardware Resource as the component that requires management. as

Load more