Evaluation issues in Autonomic Computing

Julie A. McCann, Markus Huebscher Department Of Computing, Imperial College London {jamm, mch1} @doc.ic.ac.uk

Abstract • Self-optimization: an autonomic comput- ing system optimises its use of resources. Autonomic Computing is a concept that brings It may decide to initiate a change to the together many fields of computing with the system proactively (as opposed to reactive purpose of creating computing systems that are behaviour) in an attempt to improve per- reflective and self-adaptive. In this paper we formance. draw upon our experience of this field to dis- • Self-healing: an autonomic computing cuss how we can attempt to evaluate auto- system detects and diagnoses problems. nomic systems. By looking at the diverse sys- What kinds of problems are detected can tems that describe themselves as autonomic, be interpreted broadly: they can be as low- we provide an introduction to the concepts of level as a bit-error in a memory chip Autonomic Computing and describe some (hardware failure) or as high-level as an achievements that have already been made. We erroneous entry in a directory service then discuss this work in terms of what is nec- (software problem) [3]. If possible, it essary to evaluate and compare such systems. should attempt to fix the problem, for ex- We conclude with a definitive set of metrics, ample by switching to a redundant compo- which we believe are useful to evaluate nent or by downloading and installing soft- autonomicity. ware updates. However, it is important that as a result of the healing process the system is not further harmed, for example 1. Introduction by the introduction of new bugs or the loss Autonomic computing is generally consid- of vital system settings. Fault-tolerance is ered to be a term first used by IBM in 2001 to an important aspect of self-healing. Typi- describe computing systems that are said to be cally, an autonomic system is said to be self-managing [1]. However the concept of reactive to failures or early signs of a pos- self-management and adaptation in computing sible failure. systems has been around for some time. The • Self-protection: an autonomic system pro- event of the combination of object-oriented tects itself from malicious attacks but also programming paralleled with component-based from end users who inadvertently make software engineering essentially paved the way software changes, e.g. by deleting an im- toward autonomic computing [49]. That is, it portant file. The system autonomously can be argued that without the concepts of tunes itself to achieve security, privacy dynamic re-binding of components a system and data protection. Thus, security is an cannot effectively reconfigure itself to adapt to important aspect of self-protection, not just improve its service. in software, but also in hardware (e.g. When reviewing the current state-of-the art TCPA [23]). A system may also be able to in autonomic systems, the concept of self- anticipate security breaches and prevent management usually groups into having four them from occurring in the first place. basic properties: self-configuration, self- Self-management requires that a system optimization, self-healing and self-protection. monitor its components (internal knowledge) Here is a brief description of these properties and its environment (external knowledge), so (for more information, see [1] and [2]): that it can adapt to changes that may occur, • Self-configuration: an autonomic comput- which may be known changes or unexpected ing system configures itself according to changes where a certain amount of artificial high-level goals, i.e. by specifying what is intelligence may be required. desired, not necessarily how to accomplish However, there is no agreed definition of it. This can mean being able to install itself what an Autonomic system is, their evaluation based on the needs of a given platform and and moreover comparison, is difficult. Fur- the user. thermore the very emergent nature of such systems adds further complexity to the evalua- tion of such systems. This paper is an attempt to look at Autonomic computing and try to way, as many settings such as web proxy highlight areas, which can be used to compare server addresses or VPN security settings must performance and derive some form of metrics. still be entered manually). PCs on a LAN can The structure of this paper is as follows. discover other devices, such as printers, and Initially, in section 2 we survey the area of install their drivers automatically (under the Autonomic computing to attempt to build a right conditions). Further many consider data- map of the subject. To this end we provide an base query optimisers an early form of auto- introduction to the concepts of Autonomic nomic computing [4].

Company Title Salient feature IBM eLiza servers and mainframes respond to unexpected changes and system components’ failure autonomously [3]. LEO autonomic query optimiser for IBM’s DB2 database systems [38]. LEO monitors queries as they execute and compares its estimates of the cost for each step in the query execution plan with actual results. The detection of estimation errors can also trigger reoptimisation of a query in mid-execution. Blue Gene/L IBM’s a supercomputer to be released in 2005, will implement self-healing and self- (BG/L), management [3] using application programmers placing checkpoints in code. IceCube The server tracks its components’ health and maximum capacity. It can then autono- mously distribute data among the available nodes. Failed nodes (bricks) are worked around automatically, [17] Sun Jini network allow Systems experiencing failure of a component could find components on the network with the required functionality that are still functioning and then reallocate resources to them autonomously Grid Engine Deadline policies can be used to make sure that projects nearing the deadline receive Enterprise Edi- a greater share of resources. Then share based policies consider the accumulated tion resource usage of a user, so that if a user “over-uses” resources, then the grid will lower the entitlement of that user to resources for a certain period of time [39]. HP Superdome adding self-healing properties to replace a processor if there are too many errors to automatically fix Figure 1. Sample Commercial Autonomic Effort

Computing and describe some research that is However our definition of what we mean taking place in various fields of computing and by autonomic computing is that of a self- some achievements that have already been adaptive system as opposed to an adaptive made, section 3. After that in section 4, we system. Therefore, here standard query opti- concentrate on research in the field of software misers would not be considered as providing engineering and describe projects that focus on autonomicity. However if while a query was adding autonomic behaviour to software sys- running and the DBMS was monitoring the tems. Finally, in sections 5 and 6 we combine query’s execution and deciding on a different this work together with a discussion on per- query plan, then we would consider that auto- formance evaluation and benchmarking, taking nomic. Nevertheless, we realise that the into account our experiences of measuring boundary from adaptive to self-adaptive sys- autonomic systems and provide some initial tems is fuzzy. ideas on how such systems can be compared. 3. Why Autonomic computing 2. Autonomic computing today In trying to understand how to evaluate an The ideas behind autonomic computing are not autonomic system one much understand the new. In fact, it is possible to find some aspects reason we would want such a system. This of autonomic computing already in today’s allows us to compare whether or not the objec- software products [2]. For instance, Windows tive has been met. XP optimises its user interface (UI) by creating The main reason for large blue-chip com- a list of most often used programs in the start panies, like IBM, being interested in auto- menu. Thus, it is self-configuring in that it nomic computing is the need to reduce the cost adapts the UI to the behaviour of the user, and complexity of owning and operating an IT although in a fairly basic way, by monitoring infrastructure [4]. In particular, there is a need what programs are called most often. It can to alleviate the complexity with which system also download and install new critical updates administrators of IT services are faced today. without user intervention, sometimes without The aim is to allow administrators to specify restarting the system. Therefore, it also exhib- high-level policies that define the goals of the its basic self-healing properties. DHCP and autonomic system, and let the system manage DNS services allow devices to self-configure itself to accomplish these goals. At present, to access a TCP/IP network (albeit in a limited system administrators must tweak hundreds of settings and often spend weeks before getting a in particular adhoc networking [52]. For ex- system to run optimally. Autonomic systems ample, Liu and Martonosi [22] discuss the are also faster at adapting to changes to the problem of propagating software updates in a environment, e.g. by distributing its resources network of devices that are spread differently when a critical-project requires over a large area and are not all reachable from more CPU processing power. Furthermore, as a base station. Sensors co-operate to propagate information systems in enterprises grow larger, software updates to the entire network of sen- it is becoming increasing difficult to identify a sors, but at the same time they must optimise failure in the system and repair the affected energy consumption, because of tight energy

Group Domain Main characteristics Refs Multi-agent systems Kuo-Ming, James, Framework for multi- Communication middleware based on [19] Norman agent systems CORBA for monitoring and cooperation Sterritt, Bustard Autonomic components Heartbeat or pulse monitor for monitoring [26] Georgiadis, Magee, Architectural constraints Self-organising components with a global [25] Kramer for self-organising view expressed as architecture description components Kumar, Cohen Adaptive Agent Archi- Broker agents used as to provide fault- [27] tecture tolerance to overlying problem-solving agents. Bigus et al. ABLE agent toolkit Framework for building multi-agent systems. [36] Working on including autonomic agents. Architecture design-based autonomic systems Garlan, Schmerl Architecture model- Probes, gauges for monitoring running sys- [7] based adaptation for tem, architecture manager implements adap- [8][9] autonomic systems tive behaviour, based on architecture-model [10] of system. de Lemos, Fiadeiro Architecture for fault- Components considered as black-boxes. [15] tolerance in adaptive systems Dashofy, van der Framework for architec- xADL 2.0 architecture description language, [11] Hoek, Taylor ture-based adaptive c2.fw development framework. [12] systems Valetto, Kaiser Adding autonomic be- Autonomic behaviour as a distributed multi- [28] haviour to existing sys- agent infrastructure called Workflakes. [29] tems Hot swapping components Rutherford et al. Reconfiguration in EJB BARK tool as an extension to EJB to support [33] model component replacement. Whisnant, Model for reconfigur- Adaptivity through replacement of bindings [34] Kalbarczyk, Iyer able software between operations and invoked code blocks. Appavoo et al. Hot-swapping at OS High-performance hot-swapping of fine- [35] level grained components in K42 OS. Kon, Campbell et al. Reflective middleware DynamicTAO, a middleware for dynamically [40] reconfigurable software. G. S. Blair et al. Reflective middleware OpenORB, reflective middleware for self- [41] healing systems. Table 1: Summary of reviewed research in software architectures for autonomic computing.

component quickly, as large systems are het- constraints. Further due to the autonomous erogeneous and no single person knows the nature of NASA’s DS1 (Deep Space 1) mis- entire system. Examples of such systems can sion and the Mars Pathfinder [24],[18] some be found in Figure 1. self-adaptation is required. That is, as mission Autonomic behaviour is a topic that has control cannot rapidly send new commands to found its way in many other computing fields a probe, it must quickly adapt to extraordinary Autonomic Element (agent)

Managed Component

Internal Repair plan situations, therefore it is important that a probe Monitor effector be able to make decisions and carry them out on its own.

4. Software architectures for auto- Adaptation planner System Knowledge nomic computing

The autonomic research activities in software External systems can broadly be categorised into four Monitor areas: monitoring of components, interpreta- tion of monitored data, creation of a repair plan (i.e. an adaptation of the system), and execu- tion of a repair plan. Based on this, we choose Figure 2: An autonomic agent to group the approaches to autonomic comput- ing systems orthogonally into two categories: intelligent multi-agent systems and architec- components are inseparable from the main ture design-based autonomic systems. How- application logic in the agent. ever, the two approaches have common con- Compared to the architecture design-based cepts it is sometimes difficult to place a re- approach, adaptive multi-agent systems have search project in one particular category. an innate distributed architecture. With no centralised monitoring infrastructure, agents Table 1 shows a summary of the research re- must monitor themselves (internal monitor) viewed in this section. but also other agents (external monitor). Ex- ternal monitoring can be achieved proactively 4.1. Multi-agent systems by having each agent send its heartbeat or Complex autonomic systems that are not com- pulse regularly on an autonomic signal channel posed of a single self-managing component that other agents send and listen on [26]. The can be built using intelligent agents (for infor- heartbeat provides a summary of the state of an mation on multi-agent systems, see [5]). Every agent to other agents responsible for monitor- agent has its own goals, which drive its deci- ing that state. Because the autonomic signal sions. An agent in an autonomic system is channel connects agents that are not necessar- proactive, and possesses social ability. The ily on the same physical system, but could very latter can potentially lead to instabilities of the well be peer-to-peer or networked, there is the overall system due to the chain reaction of danger here that external monitoring activity agents instructing other agents to change be- may flood the autonomic signal channel with haviour[1]. A difficult talk is also to define the lots of traffic. Thus, care must be taken in individual goals of the agents such that the designing the monitoring protocol. The heart- desired global goal is accomplished [1]. In an beat approach has already been used by the autonomic system, we want to be able to pro- Open Grid Services Architecture (OGSA) [31] vide goals in the form of high-level notions, and by NASA on its Deep Space 1 (DS1) mis- and expect the agents themselves to determine sion [32] (although in NASA’s case it is used what behaviour is necessary to reach them. in a different context). In both cases the heart- Wise et al. [37] propose a top-down hierar- beat is a compact message that provides very chical coordination model for agent applica- limited information about the monitored tions, in the form of their visual process lan- component. guage Little-JIL. A task is divided into steps, Georgiadis et al show how components can and each step can further be divided into sub self-configure their interactions in compliance steps. A step can then be assigned to an execu- with an overall architectural specification ex- tion agent, which keeps an agenda of tasks to pressed using the Alloy language [25],[16]. complete. Each component has a view of the entire sys- Although in multi-agent systems each tem maintained by the component manager. component exhibits its own autonomic behav- They measure the elapsed time from the mo- iour, there is usually a clean separation be- ment a new component is inserted into the tween the conventional component that per- system until the moment when all components forms a task and the autonomic manager which have the new consistent view of the system. implements self-management around it. Figure Because of message broadcasting, this time 2 (based on a figure from [26]) shows a gen- increasing roughly linearly in the number of eral diagram for an autonomic agent. One nodes in the system. example is [20] known as the BDI methodol- Kumar and Cohen [27] show with experi- ogy. However, in some systems the autonomic mental data how a team of broker agents can recover when a broker agent gets disconnected are updated during monitoring of the running from the rest of the system. Again Broker system and the constraints on them are used to agents share the same global knowledge of the decide when an adaptation is necessary. The system, and therefore when a broker agent autonomic infrastructure is loosely coupled discovers that another agent has been discon- with the running system. In fact, it can run on a nected, it shares this information with the rest different machine, so as not to hinder the run- of the team. ning system [6]. In some the code of compo- Bigus et al. [36] are extending their ABLE nents is augmented with checkpoints for ex- agent platform to support autonomic agents to ample to allow reporting of the occurrence of reduce the system administrator workload. specific method calls, thereby making monitor- ABLE agents are built on top of Enterprise ing more straightforward. JavaBeans. They use sensors to collect moni- toring data and effectors to perform resulting 4.2.1. Monitoring actions on an application. An important aspect thereof is the ability of an intelligent auto- Architecture Manager nomic agent to maintain a model of the exter- nal environment and its own components. Arch. Model

4.2. Architecture design-based auto- High-level notions nomic systems G G Gauges In the architecture-based approach, the indi- vidual components are not per se autonomic. Instead, the infrastructure that handles the Raw monitoring data P autonomic behaviour of the system uses an P P Probes architectural description model of the running Executing system (which is not autonomic itself) to moni- System Repair plan tor the running system, reason about it and execution determine appropriate adaptive actions. The (effectors) adaptivity infrastructure is typically clearly separated from the running system. Figure 2: Architecture model-based systems First of all, an architecture model is used to design the system (as is often the case with Figure 2 shows a diagram of an architecture software development). In essence, an archi- model-based autonomic system illustrating the tecture model can be considered a graph of monitoring infrastructure (it is based on figures interacting components [6]. The nodes of a from [6],[28], [47] and [48]. graph are called components, a general con- Probes can be inserted into the running cept, and what a component actually is de- system to monitor it. These probes are usually pends on the application. Often in research, the localised and deliver system-specific observa- example of a web server application is used, in tions. For example, a probe might be deployed which the components are web servers and to report the size of files that are loaded into a clients, and possibly databases. It may be de- system, in which case the appropriate system sirable, however, to have a finer level of com- call in the OS would be instrumented to allow ponentisation. For instance, user interfaces this type of monitoring by a probe [8]. The raw could be considered components. The arcs in monitoring data provided by the probes must the graph are called connectors and they repre- be aggregated and mapped to high-level no- sent the interaction paths between components tions in the architecture model. This job is (again, a broad notion). Georgiadis et al [25] performed by so-called gauges, which are also used a graph of components and connec- intermediary components between the probes tors, but there they are clearly mapped to self- in the running system and the architecture configuring software components and their manager, which controls adaptation of the communication connections, respectively. system at the architecture model level. Gauges Here, the granularity level of the notion of may need to collect data from various probes components in the model is not necessarily the to be able to compute high-level observations. same for all architecture descriptions, but is These high-level data allow the architecture determined be the designer of the architectural model to be updated based on the current state model of a specific system. of the executing system. When a property in Many systems allow components and con- the architecture model is updated through nectors to be annotated with a property list [6] monitoring, the architecture model is analysed [12] and constraints [46],[48]. These properties to determine whether the system is still per- forming adequately. If not, a repair plan is components in the system with new ones that created. The repair plan describes which possess similar functionality but with a differ- components or connectors are to be removed ent implementation and which are more appro- or adjusted and which ones are to be inserted. priate to the current condition of the system The repair plan is created based on repair and its environment. strategies that are defined in advance. For Much research has been carried out with many of the architectures the adaptive strategy regard to the hot swapping of components to is closed in the future we may see the knowl- reconfigure the system [11],[12],[13],[14] and edge of the success of past repair plans used to [51]. Typically this involves various stages: determine the best strategy [7]. This can be terminating a component that is to be replaced determined ‘of-line’ on another machine, and suspending any components and connec- however a considerable amount of bandwidth tors bordering the affected area; removing may be required for monitoring, and this can components and connectors and adding new become a problem if the monitoring data ones as defined by the repair plan; and resum- travels on the same network interface as ing components and connectors affected. application data as seen in Patia[48] and [6]. Rutherford et al. [33] show how an Enter- An advantage of the complete separation prise JavaBeans system can be extended such between autonomic behaviour and the running that components can be replaced with new system is that software adaptation can be versions. In particular, they must implement “plugged into” a pre-existing system [28]. The the reloading of parameters and refreshing its only direct interaction with the target system is bindings, two activities that are not part of the through the probes that monitor the system and standard JavaBean interface. Preliminary ex- the effectors that make adjustments and recon- periments show that loading a new component figurations. Changes can be coarse-grained, and binding it in the system takes on the order such as replacing entire components or rear- of a few seconds. ranging the connections among components, Whisnant, Kalbarczyk and Iyer [34] de- but they can also be fine-grained, such as scribe a system model in which operational changing the operational parameters, internal elements can be reconfigured by changing the state or functioning logic of individual compo- bindings between operations and the code nents [29]. Naturally, the architecture model blocks invoked through the operation. Interest- used for adaptation decisions has to be created ingly, this approach does not require entire after the existing system. Valetto and Kaiser components to be terminated, removed and [28] tested this concept on a server farm deliv- replaced. Modelling at the level of code blocks ering instant messaging (IM) services. monitor allows efficient adaptation of component the load of IM servers, and if a threshold is behaviour. Here adaptivity is fine-grained in exceeded, another IM server is started. This that it permeates the design of the system may also happen when an existing server fails. down to the code such as at tree data structure. The infrastructure implementing the adaptivity Appavoo et al. [35] show how the compo- of the system, e.g. the deployment of Worklets, nent-based K42 has been gauges and probes, is called Workflakes. The improved to support hot swapping of compo- workflow describing the adaptation process nents in the OS and support of applications. was initially expressed as a set of coding pat- This is carried out by the transfer of the com- terns in Java. ponent state from the old component to the Because of the centralised nature of the ar- new. This can be freely chosen by the compo- chitecture-based approach to self-healing sys- nent, and need not follow a canonical form. tems, deployment of such a system in a truly New components can be downloaded and distributed environment, where unplanned plugged into the running system at any time failures, such as that of the central architecture after the deployment of the original system. manager, can occur requires particular atten- Their experiments showed that overhead was tion to fault tolerance in the infrastructure and negligible compared to the performance gains the individual components. For example [28], achieved. This was possible because they im- use a decentralised workflow system that uses plemented hot swapping at the OS level. That agents that are part of the adaptivity infrastruc- is, the notion of a component here is fine- ture, which is cleanly, separated from the run- grained: a component is for example the File ning system. Cache Manager (FCM)

4.3. Hot swapping components 5. Metrics and evaluation In this section, research projects are described that focus on how to effect adaptation within With modern computing, consisting of new the system. This usually involves replacing paradigms such as planetary-wide computing, pervasive, and , systems mercial performance comparison has borne are more complex than before. Interestingly, cost in mind for some time. For example TPC when chip design became more complex we benchmarks have always taken cost per employed computers to design them. Today we performance into account [54]. are now at the point where humans have lim- Cost comparison is further complicated by ited input to chip design. With systems becom- the fact that adding autonomicity means add- ing more complex it is a natural progression to ing intelligence, monitors and adaptation have the system to not only automatically mechanisms – and these cost. In Patia we generate code but build systems, and carryout aimed to measure the cost of adding autonomic the day-to-day running and configuration of features to a webserver, which can cope with the live system. Therefore autonomic comput- fluctuating demand and sudden high demand ing has become inevitable and therefore will (flash crowds) [48]. We found that the costs of become more prevalent. Hence their evaluation adding monitors and monitor traffic were only is increasingly important. This section lists sets just outweighed by the benefits they provided of metrics and means by which we can com- under the normal operation of the server spe- pare such systems. cifically. As this was fairly predictable it was hardly worth-wile. However under duress the 5.1. Quality of Service (QoS) system would not work without the autonomic QoS is possibly the top-level means to com- features. Therefore would a comparative char- pare modern systems – it should reflect the acteristic be to do with added functionality degree to which the system is reaching its achievable that would otherwise not be primary goal. It is typically composed of a achieved in a non-autonomic system? As this number of metrics, e.g. data delivery turn- might be found in a serendipitous fashion, it around time over cost. It is a highly important could be difficult to predict what to test for in metric in autonomic systems as they are typi- advance. cally designed to improve some aspect of a The actual architecture can also impact in service. Most of the research in this field is the measurement of the cost of a self-adaptive looking at using autonomicity to improve per- system. For example most architecture-based formance (usually speed or efficiency). How- solutions consist of a service that has auto- ever other systems wish to improve the user’s nomic features added. For many of these archi- experience with the system in self-adaptive or tectures the intelligence to run the system is personalised GUI design for disabled people separate and centralised, the monitors or gages etc. Overall this metric is tightly coupled to the are external to what they are measuring and the application area or service that is expected of decision to adapt and its supervision is external the system. It can be measured as a global goal to the component. Here the question is do we metric or at the sub-service or component level compare systems that use other computing where each unit’s ability to met its local goal is nodes to run the autonomic services with those measured. who run the autonomic services on the same 5.2. Cost system? With the former, costs could be in terms of extra hardware and communications Autonomicity costs, the degree of this cost and to that hardware node, and the saving is that it its measurement is not clear-cut. Currently lessens the impact on the running of the main most performance studies of architecture-based system. Extra nodes dedicated to the auto- autonomic systems have measured its ability to nomic services means that they can be more reach its goal. However agent-based systems intelligent, checking the validity of a given typically compare the amount of communica- reconfiguration or if it is an optimum configu- tion, actions performed, and cost of actions ration of many candidates. Further extra nodes required to reach the goal. can allow resources for open intelligence For many commercial systems the aim is to where the autonomic decisions themselves are improve the cost of running an infrastructure, fed into the autonomic system for it to self- which includes primarily people costs in terms evaluate and learn. of systems administrators and maintenance. In AI and agent-based autonomic systems, This means that the reduction in cost for such the intelligence is highly distributed and usu- systems cannot be measured immediately but ally contained within the component or agent. over time and as the system becomes more and The latter type of system can have the intelli- more self-managing. Further many commercial gence to carry out its service tightly coupled to companies have had difficulties in sharing the the self-management intelligence continued vision of autonomic computing with their within its component. Therefore the self- shareholders for this reason [53]. It should management overhead is perhaps indistin- however be noted that current de facto com- guishable from the agent’s core function and therefore it is more difficult to separate out the failure e.g. using a mean time before failure costs – if sensible at all. metric of hardware and others to cope with Further, a class of application very fitting unpredicted environments. To measure this, to autonomic computing is that of Ubiquitous the nature of the failure and how predictable computing which typically consists of net- that failure is, needs to be varied and the sys- works of sensors working together to create tems’ ability to cope measured. Ability to cope intelligent homes, monitor the environment etc could be in terms of a Quality of Service met- [47]. This sort of application relies on self- ric that pertains to the application domain. reliance, distributed self-configuration intelli- For example in our Kendra1 audio server, gence and monitoring. However many of the which is a closed self-adaptive system, we nodes in such a system are limited in resources would test Kendra’s failure avoidance abilities and can be wireless, which means that the cost by varying the bandwidth in terms of available of autonomous computing involves resource bandwidth and how quickly that bandwidth consumption such as battery power. varied. This would test its ability to avoid periods of silence given certain environmental 5.3. Granularity/Flexibility circumstances. That is, in a network, who’s The granularity of autonomicity is an impor- bandwidth only varied slightly or in a predict- tant issue when comparing autonomic systems. able way, we observed that Kendra would Fine grained components with specific adapta- adapt more gracefully than in a bursty network tion rules will be highly flexible and perhaps which saw Kendra adapt up and down the adapt to situations better, however this may codecs sometimes even missing an opportunity cause more overhead in terms of the global to adapt as it did not notice environmental system. That is, if we assume that each finer- change as it was handling the previous adapta- grained component requires environmental tion [44]. data and is providing some form of feedback on its performance then potentially there is 5.5. Degree of Autonomy more monitoring data or at least environmental Related to failure avoidance, we can compare information flowing around the global system. how autonomous a system is. This would re- Of course may not be the case in systems late to AI and agent-based autonomic systems where the intelligence is more centralised. primarily as their autonomic process is usually Many current commercial autonomic endeav- to provide an autonomous activity. For exam- ours are at the thicker grained service level. ple the NASA pathfinder must cope with un- Granularity is important for eg in [33], predicted problems and learn to overcome where unbinding, loading and rebinding a them without external help. Decreasing the component took a few seconds. These few degree of predictability in the environment and seconds are tolerable in a thick-grained com- seeing how the system copes could measure ponent based architecture where the overheads this. Lower predictability could even reach it can be hidden in the system’s overall operation having to cope with things it was not designed and potentially change is not that regular. to. A degree of proactivity could also compare However in finer-grained architectures, such as these features. an Operating System or Ubiquitous computing where change is either more regular or the 5.6. Adaptivity components smaller, the hot swap time is We separate out the act of adaptation form the potentially too much. monitoring and intelligence that causes the One question we may ask is, can systems system to adapt. Adaptivity can be something that provide the same service be compared simple as a parameter begin changed in for with each other if the granularity of example self-configuration systems. Here the autonomicity is different? Perhaps at a high adaptation does not impact the performance so level yes. much as a component-based reconfiguration. In the latter a component may need to be hot- 5.4. Failure avoidance (Robustness) swapped where state is saved, the new compo- Typically many autonomic systems are de- nent located and then bound into the system. signed to avoid failure at some level. Many are Some systems are designed to continue execu- designed to cope with hardware failure such as tion whilst reconfiguring, while others cannot. a node in a cluster system or a component that 1 is no longer responding. Some avoid failure by Kendra is a self-adaptive audio player that was devel- retrieving a missing component. Either way the oped in 1995 and adapted the delivery of the audio codec predictability of failure is an aspect in compar- to best suit the available bandwidth between a client and the audio server. It monitored audio delivery and if band- ing such systems. Some systems will be de- width changed another codec was chosen. The aim was to signed for their ability to cope with predicted keep the audio quality as best as possible and avoid peri- ods of silence [43,44,45,50]. Furthermore the location of such components (how close the system thinks it is to a disaster again impacts the performance of the adaptiv- situation), monitoring sample rates (how much ity process. That is, a component object, which environmental data to monitor and store to use is currently local to the system verses a com- predict change in bandwidth). We found that in ponent (such as a printer driver for example), a generally low bandwidth link it is better that having to be retrieved over the Internet, will the system is not sensitive as that adaptation have significantly differing performance. Per- process impeded too much on the delivery of haps more future systems will have the equiva- the sound. However in good network condi- lent of a pre-fetch of components that are tions it is better to be more sensitive as this likely to be of use and are preloaded to speed delivers the best all round quality of sound up the re-configuration process. [48]. 5.7. Time to adapt and Reaction Time 5.9. Stabilisation Related to cost and sensitivity, these are meas- Another metric related to sensitivity is stabili- urements concerned with the system re- sation. That is the time taken for the system to configuration and adaptation. The time to learn its environment and stabilise its opera- adapt is the measurement of the time a system tion. This is particularly interesting for open takes to adapt to a change in the environment. adaptive systems that learn how to best re- That is, the time taken between the identifica- configure the system. For closed autonomic tion that a change is required until the change systems the sensitivity would be a product of has been effected safely and the system moves the static rule/constraint base and the stability to a continue state. Reaction time can be seen of the underlying environment the system must to partly envelop the adaptation time. This is adapt to. the time between when an environmental ele- ment has changed and the system recognises 5.10. Benchmarking that change, decides on what reconfiguration is Finally, it will become necessary to bring these necessary to react to the environmental change metrics together to form some sort of bench- and get the system ready to adapt. Further the marking tool. There are two approaches this reaction time affects the sensitivity of the can take; either we can derive new autonomic autonomic system to its environment (see systems benchmarks or we can augment cur- below). rent benchmarks to incorporate metrics, which measure autonomic characteristics. Our initial 5.8. Sensitivity attempt to do with was with the Patia project This is a measurement of how well the self- [Patia]. This project required we test our auto- adaptive system fits with the environment it is nomic webserver and compare its performance sitting in. At one extreme a highly sensitive with current webservers. We soon found that system will notice a subtle change as it hap- current webserver benchmarks would not only pens and adapt (perhaps subtly) to improve be able to test the autonomic aspects of Patia, itself based on that change. However in reality, but actually did not measure how traditional depending on the nature of the activity, there is webservers were being used. It soon became usually some form of delay in the feedback apparent that we would have to design and that some part of the environment has changed build a new webserver benchmark, namely effecting a change in the autonomic system. Aeolus [42]. This took research, which de- Further the changeover takes time. Therefore if scribes modern web access and data, character- a system is highly sensitive to its environment istics, and built a benchmark based on this. potentially it can cause the system to be con- Further, we wished to test the robustness of our stantly changing configuration etc and not Patia webserver under extreme conditions getting on with the job itself. where we simulated a flash crowd that would In measuring Kendra we made the parame- test the autonomic features of Patia to the ex- ters such that the system became more sensi- treme. Using many of the metrics we have tive to the fluctuations in bandwidth to see if it mentioned in this section, we extended the would improve the reaction and ultimately Aeolus webserver benchmark accordingly. have the delivery of the audio better match the However, we do not believe that deriving bandwidth available to it. As mentioned in benchmarks that measure autonomic systems is section 5.4, Kendra is a relatively simple self- the way forward. Instead, due to the diverse adaptation system, yet the numbers of parame- application of autonomic systems, it seems ters, which affected the sensitivity of the adap- better to augment application specific bench- tation mechanism, were many. For example we marks to include metrics which evaluate auto- could vary the buffer size (which is the data nomic features of that system e.g. robustness, area used to buffer audio), disaster horizon reaction speed, stability etc. In particular the Quality of Service benchmark, which we be- ing environment (networking) conditions. We lieve is the top-level measurement of how well felt that no concrete quantifiable conclusions the system is meeting its goals, is specific to were really made other than to say that over the application in question. Therefore we see sensitivity in bursty networks is bad which we traditional benchmarks such as the TPC possibly would have guessed. benchmarks being used to measure autonomic Therefore, finally, it is interesting that alle- DBMSs but perhaps extended to test the viate the maintenance and operation of our autonomous nature of the system. modern more complex systems we require that addition of even more complexity. It is our 6. Conclusions argument that this complexity makes such systems much more difficult to evaluate than Autonomic computing is an engineering con- before and therefore the need to derive metrics cept that has found its way in a myriad of and benchmarks is a highly important and computing fields. This paper is a review of interesting area. some typical examples of autonomic comput- ing attempting to give the reader a feel for the References nature of these types of systems and in doing so illustrate the complexities in trying to meas- [1] Kephart J. O., Chess D.M.. The Vision of Auto- ure the performance of such systems and com- nomic Computing. Computer, IEEE, Volume 36, pare them. We have presented two major types Issue 1, January 2003, Pages 41-50 of architecture that exhibit autonomic proper- [2] Bantz D. F. et al. Autonomic personal computing. IBM Systems Journal, Vol 42, No 1, 2003 ties and describe these as AI (agent-based) and [3] Dailey Paulson L.. Computer System, Heal Thy- architecture based. We have presented the self. Computer, IEEE, Volume 35, Issue 8, Au- common components found in each of these gust 2002, Pages 20-22 types of system, and from this derived a set of [4] Pescovitz D.. Helping computers help themselves. metrics and methods which we believe can be Spectrum, IEEE, Volume 39, Issue 9, September used to compare autonomic computing sys- 2002, Pages 49-53 tems. These are: [5] Wooldridge M., An Introduction to MultiAgent Systems. John Wiley & Sons Ltd • Quality of Service [6] Garlan D., B. Schmerl. Exploiting Architectural Design Knowledge to Support Self-Repairing Sys- • Cost tems. Proceedings of the 14th international con- • Granularity/Flexibility ference on Software engineering and knowledge • Failure avoidance (Robustness) engineering. July 2002 • Degree of Autonomy [7] Garlan D., B. Schmerl. Model-based Adaptation • Adaptivity for Self-Healing Systems. Proceedings of the first workshop on Self-healing systems, November • Time to adapt and Reaction Time 2002 • Sensitivity [8] Garlan D., Schmerl B., Chan J.. Using Gauges for • Stabilisation Architecture-Based Monitoring and Adaptation. Working Conference on Complex and Dynamic We realise that some of these metrics are more Systems Architecture, Brisbane, Australia, De- general than others and some pertain to some cember 2001. autonomic systems and not to others. However [9] Cheng S-W., Garlan D., Schmerl B., Sousa J. P., we believe that the next step is to take this Spitznagel B., Steenkiste P.. Using Architectural Syle as a Basis for System Self-repair. Software information and derive a more formal method Architecture: System Design, Development, and to compare performance of autonomic sys- Maintenance (Proceedings of the 3rd Working tems. IEEE/IFIP Conference on Software Architecture) A final note regarding our experience of Bosch J., Gentleman M., Hofmeister C. Kuusela , evaluating the Kendra architecture. Here we J. (Eds), Kluwer Academic Publishers, August set the top-level QoS goal to be that the audio 25-31, 2002. Pages 45-59 quality was as high as possible while avoiding [10] Cheng S-W., Garlan D., Schmerl B., Steenkiste P., Ningning Hu. Software Architecture-based periods of silence. When testing the system we th measured general quality levels, unnecessary Adaptation for Grid Computing. 11 IEEE Con- ference on High Performance Distributed Com- adaptation, missed opportunities to adapt, puting (HPDC’02), Edinburgh, Scotland, July sensitivity to environment etc. Kendra is a 2002. relatively simple system with closed self- [11] Dashofy E. M., van der Hoek A., Taylor R. N.. adaptation (i.e. the autonomic intelligence does Towards Architecture-based Self-Healing Sys- not grow), yet the performance statistics were tems. Proceedings of the first workshop on Self- of a large volume and difficult to interpret - healing systems, November 2002 especially in terms of relating behaviour to [12] Dashofy E. M., van der Hoek A., Taylor R.N.. An varying the many tuning parameters and differ- Infrastructure for the Rapid Development of XML-based Architecture Description Languages. [32] Wyatt J., Hotz, H. Sherwood R., Szijjaro J., Sue Proceedings Of the 24th International Conference M., Beacon Monitor Operations on the Deep on Software Engineering (ICSE2002), Orlando, Space One Mission. 5th Int. Sym. AI, Robotics Florida, May 2002. and Automation in Space, Tokyo, Japan, 1998. [13] xADL 2.0 Homepage. URL: [33] Rutherford M. J., Anderson K., Carzaniga A., http://www.isr.uci.edu/projects/xarchuci/ Heimbigner D., Wolf A. L., Reconfiguration in [14] ArchStudio 3 Foundations – c2.fw. the Enterprise Javabean Component Model. Pro- URL: http://www.isr.uci.edu/projects/archstudio/c ceedings of the IFIP/ACM Working Conference 2fw.html on Component Deployment, Berlin, 2002, Pages [15] de Lemos R., Fiadeiro J. L.. An Architectural 67-81. Support for Self-Adaptive Software for Treating [34] Whisnant K., Kalbarczyk Z. T., Iyer R. K., A Faults system model for dynamically reconfigurable [16] Kramer J. and Magee J.. The Evolving Philoso- software. IBM Systems Journal, Vol. 42, No. 1, pher Problem: Dynamic Change Management. 2003. IEEE Transactions on Software Engineering, Vol. [35] Appavoo J. et al. Enabling autonomic behaviour 16, No. 11, November 1990. in systems software with hot swapping. IBM Sys- [17] BlueGene/L Team at IBM and Lawrence Liver- tems Journal, Vol. 42, No. 1, 2003. more National Laboratory, IBM Research and [36] Bigus J. P. et al. ABLE: A toolkit for building IBM Rochester. An Overview of the BlueGene/L multiagent autonomic systems. IBM Systems Supercomputer. Journal, Vol. 41, No. 3, 2002. [18] Wolpert D. H., Wheeler K. R., Tumer K., Collec- [37] Wise A. et al. Using Little-JIL to Coordinate tive Intelligence for Control of Distributed Dy- Agents in Software Engineering. In Automated namical Systems. NASA Ames Research Center, Software Engineering Conference (ASE 2000), Technical Report No NASA-ARC-IC-99-44. September 2000. [19] Kuo-Ming C., James A., Norman P.. A Frame- [38] Markl V., Lohman G. M., Raman V.. LEO: An work for Intelligent Agents within Effective Con- autonomic query optimizer for DB2. IBM Sys- current Design. The Sixth International Confer- tems Journal, Vol. 42, No. 1, 2003. ence on Computer Supported Cooperative Work [39] How Sun™ Grid Engine, Enterprise Edition 5.3 in Design, 12-14 July 2001, Pages 338–343 Works. [20] Gamma E. et al. Design patterns: elements of URL: http://wwws.sun.com/software/gridware/sg reusable object-oriented software. Addison eee53/wp-sgeee/wp-sgeee.pdf Wesley. [40] Kon F., Campbell R. H., Mickunas M. D.,. [21] Object Management Group. URL: Nahrstedt, K Ballesteros F. J.. 2K: A Distributed http://www.omg.org/ Operating System for Dynamic Heterogeneous [22] Liu T., Martonosi M.. Impala: A Middleware Environments. IEEE, Proceedings of The Ninth System for Managing Autonomic Parallel Sensor International Symposium on High-Performance Systems. Proceedings of the ninth ACM Distributed Computing, 1-4 Aug. 2000, Pages SIGPLAN symposium on Principles and practice 201-208. of parallel programming, June 2003. [41] Blair G. S. et al. Reflection, Self-Awareness and [23] TCPA – The Trusted Computing Platform Alli- Self-Healing in OpenORB. ACM, Proceedings of ance. URL: http://www.trustedcomputing.org/ the first workshop on Self-healing systems, No- [24] Muscettola N., Nayak P. P., Pell B., Williams B. vember 2002. Pages 9-14. C.. Remote Agent: To Boldly Go Where No AI [42] Bletsas E. N., McCann, J. A AEOLUS: An Ex- System Has Gone Before. NASA Ames Research tensible Webserver Benchmarking Tool submit- Center. ted to 13th IW3C2 and ACM World Wide Web [25] Georgiadis I., Magee J., Kramer J.. Self- Conference (WWW04), New York City, 17-22 Organising Software Architectures for Distrib- May 2004. uted Systems. ACM, Proceedings of the first [43] McCann J. A,. Crane J.S.,'Kendra: Internet Dis- workshop on Self-healing systems, November tribution & Delivery System an introductory pa- 2002. per', Proc. SCS EuroMedia Conference, Leicester, [26] Sterritt R., Bustard D.. Towards an Autonomic UK, Ed. Verbraeck A., Al-Akaidi M., Society for Computing Environment. University of Ulster, Computer Simulation International, January 1998. Northern Ireland. pp 134-140 [27] Kumar S.,. Cohen P. R. Towards a Fault-Tolerant [44] McCann J.A., Howlett P., Crane J.S., 'Kendra: Multi-Agent System Architecture. Adaptive Internet System', Journal of Systems [28] Valetto G., Kaiser G.. A Case Study in Software and Software, Elsevier Science, Volume 55, Issue Adaptation. ACM, Proceedings of the first work- 1, 5 November 2000 .pp 3-17. shop on Self-healing systems. November 2002. [45] McCann J.A., 'The Kendra Cache Replacement [29] Valetto G., Kaiser G.. Combining Mobile Agents Policy and its Distribution', published in World and Process-based Coordination to Achieve Soft- Wide Web An International Journal, Volume 3, ware Adaptation. Columbia University. Number 4, Baltzer Science Publishers, ISSN [30] Cougaar – Cognitive Agent Architecture. 1386-145X, December 2000.pp231-240. URL: http://www.cougaar.org/ [46] McCann J .A. , Jawaheer G., Sun L., 'Patia: Adap- [31] The Globus Heartbeat Monitor Specification tive Distributed Webserver (a Position Paper)' In- v. 1.0. ternational Symposium on Autonomous Decen- URL: http://www- tralized Systems (ISADS). April 2003 fp.globus.org/hbm/heartbeat_spec.html [47] McCann J .A., 'ANS (Autonomic Networked [51] Kostkova P., McCann J .A., ' Support for Dy- System): A Position Paper', 1st UK-UbiNet namic Trading and Runtime Adaptability in Mo- Workshop,25-26th September 2003 bile Environments' chapter in Adaptive Evolu- [48] McCann J .A., Jawaheer G. 'Experiences in Build- tionary Information Systems. ed. N. V. Patel, ing the Patia Autonomic Webserver' 1st Interna- 2002. tional Workshop "Autonomic Computing Sys- [52] Barr R., Bicket J. C., Dantas D.S., Du B., T. W. tems", DEXA 2003 D. Kim, B. Zhou, E. Gün Sirer ‘On the need for [49] McCann J .A. 'The Database Machine: Old Story, system-level support for ad hoc and sensor net- New Slant?' Proceedings of the first Biennial works. ACM SIGOPS Operating Systems Re- Conference on Innovative Data Systems Re- view, Volume 36 Issue 2 search, VLDB, January 5-8 2003 [53] Nussey I and Telford R presentation part of the [50] McCann J .A., 'Adaptivity for Improving Web IBM Academic Autonomic Computing Day, Lon- Streaming Application Performance', chapter in don, 29th October 2003. Adaptive Evolutionary Information Systems. ed. [54] The Transaction and Performance Processing N. V. Patel, 2002 Council, PC http://www.tpc.org/