Performance Assurance Via Software Rejuvenation: Monitoring, Statistics and Algorithms

Performance Assurance via Software Rejuvenation: Monitoring, Statistics and Algorithms Alberto Avritzer,∗ Andre Bondi,∗ Michael Grottke,† Kishor S. Trivedi,† and Elaine J. Weyuker‡ Abstract a large complex industrial e-commerce system for which a severe fault that led to significant performance degrada- We present three algorithms for detecting the need for tion eluded detection for several months. This happened software rejuvenation by monitoring the changing values of because the metric of greatest interest to the customer, re- a customer-affecting performance metric, such as response sponse time, was not being monitored, while metrics of less time. Applying these algorithms can improve the values interest including CPU utilization and memory usage were of this customer-affecting metric by triggering rejuvenation being tracked. More detail about the fault can be found before performance degradation becomes severe. The algo- in [4]. rithms differ in the way they gather and use sample values Motivated by this experience, we developed software re- to arrive at a rejuvenation decision. Their effectiveness is juvenation algorithms to track the customer-affecting met- evaluated for different sets of control parameters, includ- ric, in this case response time (RT). In this paper we propose ing sample size, using simulation. The results show that three new algorithms for software rejuvenation that use dif- applying the algorithms with suitable choices of control pa- ferent sampling approaches to determine whether the RT rameters can significantly improve system performance as has severly degraded. Our first algorithm, called static soft- measured by the response time. ware rejuvenation with averaging, is a variant of the static software rejuvenation algorithm introduced in [1]. In this 1 Introduction new algorithm, the observed values of the RT are averaged over a fixed sample size. Our second algorithm, called static The increasing complexity of large industrial systems software rejuvenation with averaging and sample acceler- has created a need for sophisticated software monitoring to ation, reduces the sample size requirement when a degra- ensure performance and availability. We have introduced dation in performance is detected. For this algorithm, the in [1] and [2] a new software rejuvenation approach that time to compute the average of the RT is proportional to can ensure stable performance of degradable systems. Our the sample size and so the overall time required to trigger a approach was designed to distinguish between performance software rejuvenation is reduced when a significant perfor- degradation that occurs as a result of burstiness in the ar- mance degradation is detected. rival process and software degradation that occurs as a re- Our third algorithm is a direct application of the cen- sult of software aging. This is done by frequently monitor- tral limit theorem. Whenever the average of the samples ing the metric of greatest interest to the customer and trig- exceeds the target quantile of the normal distribution, this gering software rejuvenation whenever severe degradation algorithm will trigger a software rejuvenation. In this pa- is detected. We will refer to this metric as the customer- per we evaluate the domain of applicability of the three al- affecting metric. Software aging has been observed in com- gorithms by simulating their performance using the large plex systems and has been described extensively in the lit- e-commerce system mentioned above. erature [3, 8, 9, 15]. We will describe a case study using The outline of our paper is as follows. Section 2 pro- ∗Alberto Avritzer and Andre Bondi are with Siemens Corpo- vides a survey of related work. In Section 3 we present our rate Research, Princeton, NJ 08540. E-mail: {alberto.avritzer, model of software rejuvenation which has been designed to andre.bondi}@siemens.com †Michael Grottke and Kishor Trivedi are with Duke University, Depart- identify performance degradation using observations of the ment of Electrical and Computer Engineering, Durham, NC 27708-0291. RT. In Section 4 we present three new software rejuvenation E-mail: {grottke,kst}@ee.duke.edu. This work was supported by a fel- algorithms and the analytical motivation for our algorithms. lowship within the Postdoc Program of the German Academic Exchange Section 5 presents simulation results for the e-commerce Service. ‡Elaine Weyuker is with AT&T Labs - Research, 180 Park Avenue, system, while Section 6 contains our conclusions and sug- Florham Park, NJ 07932. E-mail: [email protected] gestions for future research. In Proc. International Conference on Dependable Systems and Networks 2006, pages 435-444, 2006. °c IEEE 435 2 Related work with a Java virtual machine (JVM) heap size of 3 GB. The maximum acceptable RT is 10 seconds, and the system must Avritzer and Weyuker [3] introduced an approach to be able to handle up to 1.6 transactions/second. The two restoring degraded software to full capacity that was tai- primary factors that were found to impact performance were lored particularly for telecommunications or other similar the variability that occurred when garbage collections took systems that are characterized by predictably periodic traf- place, and kernel overhead. fic. It is common for telecommunications operating com- The system model used to generate the experimental re- panies to collect detailed traffic pattern data, thereby al- sults is a slightly modified version of the one used in [4]. lowing them to identify system degradations and the sub- sequent restoration of their systems to full capacity by exe- 1. Whenever a thread arrives at the JVM, a new thread cuting software procedures that will free allocated memory, arrival is scheduled with an exponentially distributed release database locks, and reinitialize other operating sys- inter-arrival time (with arrival rate λ), and the number tem tables. of active threads is incremented by one. Bobbio et al. [5] presented a method for the quantitative analysis of software rejuvenation policies. Their approach 2. The thread queues for a CPU. assumed that it was possible to quantify system degradation by monitoring a metric that was correlated with it. They 3. The CPU processing time of the thread is sampled µ = defined a maximum degradation threshold level and intro- from an exponential distribution with service rate 0.2 duced two rejuvenation policies based on this threshold. A transactions/second. risk-based approach was used for their first policy, defining 4. If the number of threads executing in parallel is greater a confidence level for the metric. Rejuvenation was per- than the specified threshold of 50 threads, an overhead formed with a probability proportional to this confidence is incurred. In order to account for this, the processing level. They also introduced a deterministic policy for which time of the thread is multiplied by a factor of 2.0. rejuvenation was performed as soon as the threshold level was reached. The approach we study in this paper is closely 5. As soon as one of the 16 CPUs is obtained, the thread related to this second policy but uses multiple threshold lev- attempts to allocate 10 MB of memory. els so that we can distinguish between bursts of arrivals and soft failures. When a soft failure occurs, the system remains 6. If the remaining memory heap size is less than the operational, but in a degraded mode with the available sys- memory threshold of 100 MB, a full garbage collection tem capacity greatly reduced. event is scheduled and all running threads are delayed In [1], a software rejuvenation approach for single server by 60 seconds, the amount of time needed to perform systems that tracks the value of the RT to determine the a full garbage collection on a 3 GB Heap. best times to perform software rejuvenation was introduced. 7. When a thread completes service, the total RT of the In [2], it was extended by evaluating the applicability of the related job, consisting of the waiting time plus the pro- algorithms to clusters of hosts. The three rejuvenation al- cessing time, is computed. gorithms that we consider in this paper base the decision to trigger rejuvenation on averages of several observations, 8. Based on the observed RTs, an algorithm decides rather than on the value of the RT itself. whether software rejuvenation should be carried out. A comparison of analytical models using Markov regen- We will discuss several versions of a rejuvenation al- erative processes and measurement approaches using time gorithm in the next section. When the system is reju- series analysis, trend detection and estimation for software venated, all threads in execution are terminated and all rejuvenation was presented in [15]. This paper provides resources held by threads are released. This includes motivation for the development of practical policies based both the JVM memory heap and busy CPUs. on actual measurements. Other papers of interest include [11], in which service The goal of our software rejuvenation algorithms is to mon- failures were identified by monitoring the RT, and [6], itor the RT and do a capacity restoration whenever the RT which described the methodology used by IBM Director has severely deteriorated. During rejuvenation, all CPU software for proactive software rejuvenation using statisti- and memory queues are cleared. The cost is defined to be cal estimation of resource exhaustion. the percentage of transactions lost during the rejuvenation event. 3 The system and the simulation model We assume that the RT can be sampled frequently and that the average and standard deviation of the RT when The subject of our case studies is a multi-tier distributed the system is operating without performance degradation, e-commerce system written in Java. It consists of 16 CPUs serves as the basis for the system performance requirement. In Proc. International Conference on Dependable Systems and Networks 2006, pages 435-444, 2006. °c IEEE 436 4 The rejuvenation algorithms 4.1 Analytical motivation Rejuvenation must be carried out often enough to avoid severe performance slowdowns, yet not unnecessarily, since rejuvenation incurs costs such as lost transactions.

Load more