University of Massachusetts Dartmouth

Department of Computer and Information Science

SOFTWARE RELIABILITY MODELS FOR CLOUD- BASED REJUVENATION USING DYNAMIC FAULT TREES

A Dissertation in

Engineering and Applied Sciences

by

Jean F. Rahme

Submitted in the Partial Fulfillment of the

Requirements for the Degree of

Doctor of Computer Science and Information Science

December 2017

I grant the University of Massachusetts Dartmouth the non-exclusive right to use the work for the purpose of making single copies of the work available to the public on a not-for-profit basis if the University’s circulating copy is lost or destroyed.

______Jean F. Rahme

Date______

We approve the dissertation of Jean F. Rahme

Date of Signature

______Haiping Xu Associate Professor, Department of Computer and Information Science Dissertation Advisor

______Ramprasad Balasubramanian Professor, Department of Computer and Information Science Dean, College of Engineering Dissertation Committee

______Liudong Xing Professor, Department of Electrical and Computer Engineering Dissertation Committee

______Xiaoqin Zhang Associate Professor, Department of Computer and Information Science Dissertation Committee

______Jan Bergandy Professor and Chairperson, Department of Computer and Information Science Dissertation Committee

______Gaurav Khanna Graduate Program Director, Department of Engineering and Applied Sciences

______Tesfay Meressi Associate Provost for Graduate Studies

ABSTRACT

A trusted cloud-based software system is a highly reliable, available and predictable advanced computing system with guaranteed Quality of Service (QoS). To maintain the high reliability of a cloud-based software system, it is critical to find a feasible solution to counteract the problem, where system performance may be progressively degraded due to exhaustion of system resources, fragmentation and accumulation of errors.

In this thesis, we adopt a proactive technique, called software rejuvenation, to enhance the fault tolerance of a cloud-based system equipped with software standby spares. We extend the dynamic fault tree (DFT) formalism with Software SPare (SSP) gates, to model the system reliability before and during a software rejuvenation process in an aging cloud- based software system. A novel analytical approach is presented to derive the reliability function of a cloud-based SSP gate, with either one or two Hot Software Spares (HSS). We verify our approach using Continuous Time Markov Chains (CTMC) for the case of constant failure rate. Then, to extend our approach for non-constant failure rates, we adopt

Weibull distribution to model the increasing failure rates for software components with aging issues. We use case studies of a cloud-based software system with multiple HSSs to illustrate the validity of our approach for both the constant and non-constant failure rate cases. Based on the reliability analytical results, we show how software rejuvenation schedules can be created to keep the system reliability consistently staying above predefined critical levels.

iii

Acknowledgments

I would like to announce my deep appreciation and many thanks to my advisor Dr. Haiping

Xu, as being an incredible mentor for me. Thank you for all the encouragement, the lessons and standing by my side through my journey as you believed in me and kept pushing me to achieve my doctorate degree. I will never forget all what you have done to me.

I would also like to thank my committee members, professor Ramprasad Balasubramanian, professor Jan Bergandy, professor Liudong Xing and professor Xiaoqin (Shelley) Zhang for serving as my committee members even at hardship. Furthermore, I would like also to thank the graduate program director professor Gaurav Khanna for all his support as well.

In addition, I would like to take a moment to thank all of you for your great and strong support, that without it, this dissertation would not have been written.

Moreover, I would like to thank the computer science department in its chairperson professor Jan Bergandy for giving me the opportunity to help the students by being granted full time teaching assistant position, that is packed with full tuition assistantship, throughout the course of my studies, and by that I got to taste the true meaning of helping the students as stated in the proverb that I used to apply for the PhD. This proverb states

“It was in my heart to help a little because I was helped much” belongs to the famous poet, author of “The Prophet” book, Gibran Khalil Gibran (native of my hometown Becharre-

Lebanon) where it is grooved on his memorial that is located on Dartmouth Street in the city of Boston - Copley Square.

iv

And how can I forget not mentioning professor Boleslaw Mikolajczak the previous

Chairperson of Computer Science Department, may his soul rest in peace, where he was my first advisor upon my arrival to the University of Massachusetts Dartmouth, as he is the main reason that I got interested in modeling techniques and formal verification when

I took with him in my first semester CIS 361 (Models of computation) and upon that I was the top student in CIS 560 with Dr. Xu.

A special thanks to my parents who sacrificed a lot to make sure I get the proper education, to both my brother Toufic and Joseph and my Uncle Gaby with all their intellectual and financial help and motivation along my journey, I owe them a lot.

I would like to remember my grandmother Therese who passed away in 2016, who always prayed for me to get my projects lined up, I will always mention her in my prayers.

Finally I thank both my heavenly father and mother, God in its trinity, The Father, The Son and The Holy Spirit and Saint Mary Mother of God, where without my faith and the gift of patience, I wouldn’t have endured the entire milestones in this journey, and I ask for their blessing and support for the rest of my journey and to employ the gifts that I got equipped with in the right place with Humbleness, Patience, Joy of Giving and above all with Love.

v

TABLE OF CONTENTS

LIST OF FIGURES ...... v

LIST OF TABLES ...... vii

Chapter 1: INTRODUCTION ...... 1

1.1 Motivation and State of Art ...... 1

1.2 Problem Statement ...... 3

Chapter 2: RELATED WORK ...... 6

Chapter 3: BACKGROUND KNOWLEDGE ...... 10

3.1 Introduction to and Virtualization ...... 10

3.1.1 Cloud Computing ...... 10

3.1.2 Virtualiazation ...... 12

3.2 Introduction to Software Bugs, Software Aging and Software Rejuvenation ...... 14

3.2.1 Software Bugs Classification ...... 14

3.2.2 Software Aging (SA) ...... 16

3.2.3 Software Rejuvenation (SR) ...... 19

3.3 Introduction to Software Reliability Engineering (SRE) ...... 20

3.4 Introduction to Fault Trees ...... 21

3.4.1 Static Fault Trees (SFT) ...... 21

3.4.2 Dynamic Fault Trees (DFT) ...... 22

Chapter 4: REJUVENATION OF CLOUD-BASED COMPONENTS ...... 24

Chapter 5: MODEL WITH CONSTANT FAILURE RATE AND ONE HOT SPARE ...... 28

5.1 Modeling and Analysis ...... 28

vi

5.1.1 Software Spare (SSP) Gate with One Hot Software Spare (1-HSS)...... 28

5.1.2 Verification Using CTMC ...... 33

5.1.3 Modeling and Analysis Using DFT in Two Phases ...... 37

5.2 Case Study 1 : Constant Failure Rate with One Hot Software Spare (1-HSS) ...... 39

Chapter 6: MODEL WITH CONSTANT FAILURE RATE AND TWO HOT SPARES ...... 48

6.1 Modeling and Analysis ...... 48

6.1.1 Software Spare Gate with Two Hot Software Spares (2-HSSs) ...... 48

6.1.2 Verification Using CTMC ...... 55

6.2 Case Study 2 : Constant Failure Rate with Two Hot Software Spares (2-HSSs) ...... 60

6.3 Case Studies 1 and 2 Comparison ...... 67

Chapter 7: MODEL WITH NON-CONSTANT FAILURE RATE AND ONE HOT SPARE ..... 71

7.1 Non-Constant Failure Rate and Common Distribution Functions ...... 71

7.2 Modeling and Analysis ...... 72

7.3 Case Study 3: Non-Constant Failure Rate with One Hot Software Spare...... 74

Chapter 8: MODEL WITH NON-CONSTANT FAILURE RATE AND TWO HOT SPARES .. 82

8.1 Modeling and Analysis ...... 82

8.2 Case Study 3: Non-Constant Failure Rate with One Hot Spare ...... 85

8.3 Case Studies 3 and 4 Comparison ...... 92

8.4 Case Studies: Results Interpretation ...... 95

Chapter 9: CONCLUSIONS AND FUTURE RESEARCH PLAN ...... 99

REFERENCES ...... 101

vii

List of Figures

Figure 3.2.1 Software Bugs and Possible Solutions ...... 15

Figure 4.1 An example of a reliable cloud-based system with spare software components ...... 24

Figure 5.1.1 An SSP gate with primary component P and a HSS component H ...... 29

Figure 5.1.2 Two cases for the failure of an SSP Gate with 1-HSS ...... 29

Figure 5.1.3 The initial unreliability of H* when P fails (i.e., the unreliability of H at time τ1) ...... 31

Figure 5.1.4 The CTMC model of the HSP gate in Figure 5.1.1 ...... ….34

Figure 5.1.5 A DFT model with 2 HSP gates (Phase 2) ...... 38

Figure 5.2.1 A cloud-based system with 2 servers and their HSSs ...... 40

Figure 5.2.2 DFT model of the cloud-based system - Case Study1 - Phase1 ...... 41

Figure 5.2.3 DFT model of the cloud-based system in Case Study1-Phase 2 (Scenario 1) ...... 43

Figure 5.2.4. DFT model of the cloud-based system in Case study1-Phase 2 (Scenario 2 both cases) 44

Figure 5.2.5 Case study 1- Rejuvenation scheduling (Scenario 1 vs. Scenario 2) ...... 47

Figure 6.1.1 An SSP gate with a primary component P and two HSSs H1 and H2 ...... 48

Figure 6.1.2 Six cases for the failure of an SSP gate with two HSSs ...... 48

Figure 6.1.3.a Failure time of H1 and H2 for reliability analysis in Case 1 ...... 51

Figure 6.1.3.b A general view for Figure 6.1.3.b –Case 1 Reliability analysis ...... 54

Figure 6.1.4 The CTMC model of the HSP gate with two hot spares ...... 55

Figure 6.2.1: Case Study2 - A cloud-based system showing two hot spare for each server ...... 60

Figure 6.2.2. DFT model of the cloud-based system - with a) Phase 1 and b) Phase 2 Scenario1 ...... 61

Figure 6.2.3 DFT model of the cloud-based system in Case study1-Phase 2 (Scenario 2 both cases) . 65

viii

Figure 6.2.4 Case study 2- Rejuvenation scheduling (Scenario 1 vs. Scenario 2) ...... 67

Figure 6.3.1 Rejuvenation scheduling: 2-HSSs vs. 1-HSS (Scenario1) ...... 70

Figure 7.1. An SSP gate with a primary component and 1-HSS ...... 73

Figure 7.2.. A DFT model_1-HSS model with both Phases and both Scenarios (Chapter 5) ...... 76

Figure 7.3. Case study 3- Rejuvenation scheduling (Scenario 1 vs. Scenario 2) ...... 76

Figure 8.1. An SSP gate with a primary component and 1-HSS ...... 82

Figure 8.2. A DFT model_2-HSS model with both Phases and both Scenarios (Chapter 6) ...... 87

Figure 8.3. Case study 4- Rejuvenation scheduling (Scenario 1 vs. Scenario 2) ...... 92

Figure 8.4. Rejuvenation scheduling: 2-HSSs vs. 1-HSS (Scenario1) ...... 94

Figure 8.5. Rejuvenation scheduling: 2-HSSs vs. 1-HSS (Scenario2) ...... 94

Figure 8.6. Hazard/failure rate function for Weibull distribution with p≥1 ...... 97

ix

List of Tables

Table 5.1. Case Study 1- System Reliability with Software Rejuvenation (Scenario 1) ...... 45

Table 5.2. Case Study 1- System Reliability with Software Rejuvenation (Scenario 2) ...... 46

Table 6.1. R(t) Analysis Results - Proposed Method vs. CTMC ...... 59

Table 6.2. Case Study 2- System Reliability with Software Rejuvenation (Scenario 1) ...... 64

Table 6.3. Case Study 2- System Reliability with Software Rejuvenation (Scenario 2) ...... 65

Table 6.4. Application Sever Reliability with both 1-HSS and 2-HSSs ...... 68

Table 6.5. Database Sever Reliability with both 1-HSS and 2-HSSs ...... 68

Table 5. Case Study 3- System Reliability with Software Rejuvenation (Scenario 1) ...... 71

Table 6. Case Study 3- System Reliability with Software Rejuvenation (Scenario 2) ...... ….72

Table 7.1. Case Study 3- System Reliability with Software Rejuvenation (Scenario 1) ...... 78

Table 7.2. Case Study 3- System Reliability with Software Rejuvenation (Scenario 2) ...... 79

Table 8.1. Case Study 4- System Reliability with Software Rejuvenation (Scenario 1) ...... 89

Table 8.2. Case Study 4- System Reliability with Software Rejuvenation (Scenario 2) ...... 90

Table 8.3. Application Sever Reliability with both 1-HSS and 2-HSSs ...... 92

Table 8.4. Database Sever Reliability with both 1-HSS and 2-HSSs ...... 93

Table 8.5. Database Sever Reliability with both 1-HSS and 2-HSSs ...... 97

x

Chapter 1

INTRODUCTION

1.1 Motivation and the State of Art

Due to recent advances in cloud computing technologies, cloud services have been used in many different areas such as traffic control, real-time sensor network, healthcare, and mobile cloud computing. Cloud service providers have tried to deliver products with high quality of services (QoS), which provide users fault-tolerant hardware and reliable software platforms for deploying cloud-based applications [1][2]. However, cloud outages are still very common due to component failures, which can affect quite negatively the revenue of cloud-based systems. Previous research on the reliability of cloud-based systems has focused on hardware reliability and availability. As a result, the hardware fault tolerance and fault management are well understood and developed [3]. With the promised high reliability and availability of physical facilities (i.e., the hardware) provided by cloud service providers, software faults have now become a major factor of cloud-based system failures. Since software reliability is considered one of the weakest points in system reliability, software fault tolerance and failure forecasting require more attentions than hardware fault tolerance in modern computer systems [4][5]. Therefore, this work is motivated to deal with software faults in cloud computing in order to assure high reliability and availability of cloud-based software systems.

In many safety-critical computer-based systems, a failure of the software system may lead to unrecoverable loss such as human life [6]. Such systems are required to be perfectly

1

reliable and never fail based on the discipline of fault-tolerant and reliable computing.

Reliability and availability are two common ways to express system fault tolerance in industry. A reliable computer-based system typically has high availability if unreliability is the major cause for unavailability. In this thesis, we focus on analyzing the reliability of cloud-based systems for software fault tolerance in software reliability engineering (SRE).

Traditional SRE has been based on analysis of software defects and bugs such as

Heisenbugs or BohrBugs without considering software aging-related bugs [4]. Bohrbugs are mainly design defects that can be eliminated by debugging or adopting design diversity; while Heisenbugs are defined as faults that would stop causing failure when one attempts to isolate them. The concept of software aging phenomenon was introduced in the middle

90s, which explains that the system resources used by the software degrade gradually in function of time [7]. Software aging starts to show up due to multiple factors such as memory bloating, memory leaks, unterminated threads, data corruption, unreleased file- locks, storage space and fragmentation and accumulation of round-off errors when software is running. Software aging has considerably changed the SRE field of study, and become a major factor for the reliability of a fully tested and deployed software system. To deal with software aging and to assure software fault tolerance, software rejuvenation process has been introduced as a proactive approach to counteracting software aging and maintaining a reliable software system [8]. Software rejuvenation involves actions such as stopping the running software occasionally, cleaning its internal state (e.g., garbage collection, flushing kernel tables, and reinitializing internal data structures). The simplest way to perform software rejuvenation is to restart the application that causes the aging problem or to reboot the whole system.

2

Due to the ever-growing cloud computing technology and its vast markets, the workload of a cloud-based system has increased dramatically. A heavy workload of cloud- based system will inevitably lead to more software aging problems [9]. In this thesis, we introduce an approach to developing rejuvenation schedules for cloud-based systems in order to maintain their high system reliability. We adopt an analytical-based approach to compute the reliability of a cloud-based system using Dynamic Fault Tree (DFT). To maintain high system reliability and ensure a zero-downtime rejuvenation process, we use cloud-based spare parts as major software components. Once the DFT model is developed, it is converted into Continuous Time Markov Chains (CTMC) to calculate the system reliability. We assume a practical reliability threshold for the core software components of the system. When the reliability threshold is reached, software rejuvenation is triggered, and the reliability of the cloud-based system is boosted to its initial state. Our case studies show that software rejuvenation scheduling based on the reliability analysis of a cloud- based system can significantly enhance its system reliability and availability.

1.2 Problem Statement

Due to the promised high reliability of physical facilities provided for cloud services, software faults have become one of the major factors of cloud system failures. This proposed research focuses on developing reliability-based software rejuvenation schedules to maintain high reliability and fault tolerance for cloud-based software systems that are subject to software aging. The work is addressed in an infrastructure as a service (IaaS) cloud model framework, where software aging affects system performance that may be progressively degraded because of the exhaustion of the resources.

3

In order to counteract software aging problem and enhance cloud-based software system fault tolerance, we adopt and customize a proactive technique, namely software rejuvenation along with passive redundancy, specifically software standby spares.

Moreover, we take advantage of cloud scalability and virtualization technology to propose an innovative rejuvenation process that is cloud-specific. One way of software rejuvenation is system reboot, e.g., to restart a virtual machine (VM) in the context of cloud-based systems. The basic idea of our approach is to create a new instance of a VM to replace the one to be rejuvenated. Since the newly deployed VM instance has not yet been affected by the software aging phenomenon, the reliability of the software component is boosted back to its initial condition. Along with the proposed rejuvenation process, we further adopt the software redundancy technique using two different types of software standby spares for cloud-based systems, namely cold spare and hot spare within Dynamic fault tree formalism

(DFT).

In our approach, we defined an extension of DFT, called SSP gate, which is be used to evaluate the reliability of a cloud-based system with multiple software spares for its critical components. We introduce a novel analytical-based approach to analyzing the extended

DFT model for reliability calculation. Our approach does not suffer from the state-space explosion problem as it is compositional, where a DFT is decomposed into subtrees, and the system reliability is calculated by joining the reliabilities of the subtrees. The analytical approach is then formally verified using a Continuous Time Markov Chains (CTMC) model to ensure its correctness. As the CTMC approach has its intrinsic limitation of only supporting components with constant failure rates, our proposed analytical approach is a

4

formal way to correctly derive the reliability function of an SSP gate without such a limitation.

By employing the innovative rejuvenation process along with the extended DFT formalism and the proposed novel analytical approach, we can overcome software aging by generating the system reliability function based on the probability density function (pdf) of the time-to-failure regardless of its distribution. When the computed reliability of a system component or the whole system reaches a predefined threshold, the rejuvenation process is triggered and the system reliability is kept above the predefined threshold achieving zero-downtime rejuvenation.

5

Chapter 2

RELATED WORK

In 1995, researchers introduced software rejuvenation technique to deal with aging-related faults [8]. This technique, in contrast to reactive approach with actions taken after software failure, is considered a proactive approach that preemptively restarts the aging application, clean software aging related bugs [10][11]. Previous studies on software aging and software rejuvenation for predicting a rejuvenation schedule can be classified into two categories, namely measurement-based and analytical-based [10]. In an analytical-based approach, a failure distribution is assumed for the software aging phenomenon, and software rejuvenation is executed at a fixed interval based on the analytical results of the system reliability and availability. Several analytic models have been proposed to determine the optimal time for rejuvenation. In [12], a fine-grained software degradation model is proposed, which is based on the tracking of system parameters for degradation level of the system. In their approach, optimal rejuvenation rules are presented based on alert threshold and risk criterion. In [13], an inspection-based software rejuvenation approach is proposed based on the current system degradation. Semi-Markov processes were the base of software rejuvenation model in [13]. The optimal rejuvenation strategy was based on the analytical analysis of system cost and steady state availability. Time- based approach has been applied in cluster systems, where analytical models of the implementation of software rejuvenation shows an increase in system availability and decrease in expected cost [10], [14], [15]. To sum up, analytical-based approach is based on assuming a certain failure time distribution that is likely to describe the degradation model due to software aging and then develop Markov, semi-Markov, stochastic Petri nets

6

models, Bayesian network etc. to analyze system metrics such as reliability and availability, in order schedule rejuvenation.

On the other hand, measurement-based approach applies statistical analysis to the measured data for the resources usage and degradation that lead to software aging. A monitoring program is applied continuously to collect the measured data, and analyze this data to estimate system degradation level. When exhaustion reaches a critical level, software rejuvenation is triggered. Hence, the analysis contribute to a link between resources leaks or degradation and the failure rate, where the time until the exhaustion reaches a critical level is estimated and automatically perform a software rejuvenation, called optimal rejuvenation schedule . Machida et al. used Mann-Kendall test to detect software aging from traces of computer system metrics [16]. They tested for existence of monotonic trends in time series, which are often considered indication of software aging.

Grottke et al. studied the resource usage in a web server subject to an artificial workload

[17]. They applied non-parametric statistical methods to detect and estimate trends in the data sets for predicting future resource usage and software aging issues. In [18], Trivedi et al. introduce an analytical model that base its analysis on resources measurement based-on the amount of workload for a defined software system. However, measurement-based approaches do provide useful insights about dynamic system behaviors and failure distributions related to software aging. As such, our research is complementary to the existing research efforts on measurement-based software rejuvenation technique that investigates the relationship of software metrics and software aging related software faults using statistical analysis [19]. Although the above approaches addressed the issue of software rejuvenation using analytical and /or measurement based techniques using formal

7

methods for modeling, in this work we will be using DFT formalism to address software aging and software rejuvenation for cloud-based systems.

Similar ideas where applied on virtualized-based systems and cloud systems after the wide usage of virtualization. Machida et al. propose an availability model, using SRN tool, for virtualized systems with time-based rejuvenation for virtual machine (VM) and VM manager (VMM). Among three different rejuvenation schemes, that consider managing the running VMs during VMMs rejuvenation. Within the three proposed rejuvenation schemes, the VMs can be suspended, rebooted, or migrated. However, the same VMMs will be reused and they need to manage deploying back the VMs to the initial VMMs after rejuvenation [20]. Thein et al. propose an analytical approach that models availability for application servers. They consider two cases of redundancy, but for the rejuvenation itself, they still use the same VMMs [21]. Thein et al. improve the work from [21] to cover more

VMs in each rejuvenation scenario in order to improve availability and visualize downtime cost [22]. Bruno et al. worked on studying VMM rejuvenation on cloud systems. They worked on rejuvenation aspect in the area of infrastructure as a service (IaaS), where computational resources VMs are provided to cloud end users. Time and load dependent degradations have been modeled for VMMs in the context of steady state availability using proportional hazard model with non-exponential distributed failure rates. In their thesis, they addressed the rejuvenation issue from service provider perspective by only considering VMMs rejuvenation and not VMs rejuvenation [23]. However, the above approaches are not explicitly based on software reliability analysis. In contrast, our approach analyzes system reliability using DFT models, and can generate rejuvenation schedules that explicitly satisfy the predefined reliability and availability requirements of

8

a cloud-based system. Our approach consists of seeing the cloud from the users end, meaning that when a user is using cloud as an IaaS, the user is responsible of maintaining high reliability for the VM provided by the cloud provider, since the VMs are subject to software aging and rejuvenation.

In this thesis, inspired by all previous work that has been done regarding software rejuvenation over regular datacenters, virtualized datacenters and cloud based systems. We propose a software rejuvenation technique that applies to cloud IaaS model. We model the reliability for VMs to be rejuvenated, using DFT, and we suggest a rejuvenation schedule based on a predefined reliability safety threshold. As a cloud user, we take advantage from cloud scalability, so that the rejuvenated VMs are not necessarily deployed in the same

VMMs shared by the aged VMs. We can deploy a copied image of the original VM onto another highly available and reliable VMM. Like so we show the benefit of working with the cloud, stating that there is no commitment to physical resources or VMMs, unlike working with local datacenter where resources are limited. All the techniques discussed above, consider reusing the same software resources after rejuvenation. Moreover, the above approaches addressed the issue of software rejuvenation using analytical and /or measurement based techniques with various formal tools. The above approaches are not explicitly based on software reliability analysis. In contrast, our approach analyzes system reliability using DFT formalism, and can generate rejuvenation schedules that explicitly satisfy the predefined reliability requirements of a cloud-based system. To the best of our knowledge, there is no previous work using dynamic fault trees to address software aging and software rejuvenation in terms of reliability modeling.

9

Chapter 3

BACKGROUND KNOWLEDGE

This chapter provides an overview of the related concepts to the research. Cloud computing, software reliability engineering and dynamic fault tree formalism are explained. Software aging and software rejuvenation are also explained along with software related bugs.

3.1 Introduction to Cloud Computing and Virtualization

3.1.1 Cloud Computing

Cloud computing is the use of computing resources (hardware and software) that are delivered as a service over the internet. The datacenter hardware and software systems are what we will call a Cloud [24]. When a Cloud is made available in a pay-as-you-go manner to the public, it is called Public Cloud. Current examples of public Utility Computing include AmazonWeb Services (AWS), Google AppEngine, and Microsoft Azure. The term

Private Cloud refers to internal datacenters of a business or other organization that are not made available to the public. This means they request a security authentication mechanism order to obtain the service. Cloud computing is a very broad concept, and it covers every sort of online services. There are three major computing models or area of operation, namely Software as a Service (SaaS), Platform as a Service (PaaS), and Infrastructure as a

Service (IaaS).

SaaS represents the largest cloud market and is still growing quickly. SaaS uses the web to deliver applications that are managed by a third-party vendor and whose interface is accessed on the clients’ side. Most SaaS applications can be run directly from a web

10

browser without any downloads or installations required, although some require plugins.

This reduces complicated and time consuming set up procedures and allows for the client to run more powerful applications than their client machine may be able to handle, as processing is done in the cloud.

With PaaS model, used for applications, and other development, while providing cloud components to software. What developers gain with PaaS is a framework they can build upon to develop or customize applications. PaaS makes the development, testing, and deployment of applications quick, simple, and cost-effective. With this technology, enterprise operations, or a third-party provider, can manage OSes, virtualization, servers, storage, networking, and the PaaS software itself. Developers, however, manage the applications. In this model, developers would deploy their own applications or services.

Using this service, developers do not have to worry about the underlying software platform, such as an operating system, and can focus their attention on development of their application. Once developed and refined, they may use their application or service for their own purpose, or redistribute it to end users, which would be in line with the SaaS model

[25].

Finally, in an IaaS model is a self-service model for accessing, monitoring, and managing remote datacenter infrastructures, such as accessing virtualized instances VMs where a user can configure the operating system, underlying software management, applications and services. With this model, a developer would choose or manually set up an operating system on a virtual machine in the cloud. Compared to SaaS and PaaS, IaaS users are responsible for managing applications, data, runtime, middleware, and OSes.

Providers still manage virtualization, servers, hard drives, storage, and networking [26].

11

Cloud services are rapidly becoming more reliable. Google and others are reaching uptimes in the range of 99.9%, approaching the most demanding enterprise’s expectations.

In fact, many IT systems do not need 99.99% availability or nightly backups, which come at a significant cost. Cloud computing is seen by many as the next wave of information technology for individuals, companies and governments. Cloud technologies have become the basis for radical business innovation and new business models and for significant improvements in the effectiveness of anyone using information technology which, these days, increasingly means most of the world. More than eight in 10 companies currently use some form of cloud computing solutions, and more than half plan to increase cloud investments by 10 percent or more this year based on the survey sponsored by CompTIA a nonprofit IT industry association. In addition, nearly 28 % of small businesses are currently using IaaS [26]. Cloud-based application reliability decreases with time due to hardware components deterioration due to the well- known wear out problem. Moreover, the reliability decreases because of software failures due to a variety of bugs, where we consider the software aging SA related bugs in this thesis, leading to SA phenomenon which affects the software components within cloud-based systems.

3.1.2 Virtualization

A virtual machine (VM) is a software implementation of a machine (i.e. a computer) that executes programs like a physical machine. A system virtual machine provides a complete system platform which supports the execution of a complete operating system (OS). An essential characteristic of a virtual machine is that the software running

12

inside is limited to the resources and abstractions provided by the virtual machine—it cannot break out of its virtual environment.

Hardware virtualization is one of the commonly used technologies in cloud computing due to its numerous characteristics. First, virtualization allows multiple OS environments to coexist on the same computer, in strong isolation from each other. Moreover, virtualization allows application provisioning, maintenance, high availability, reliability and disaster recovery. In addition, virtualization allows an easy server migration between physical hardware and the consolidation of multiple inexpensive physical servers into one server. However, VMs are less efficient than a real machine when it accesses the hardware indirectly. Also, when multiple VMs are concurrently running on the same physical host, each VM may exhibit a varying and unstable performance based on the imposed workload.

It can be counteracted by proper techniques that assure temporal isolation among virtual machines [27].

The desire to run multiple operating systems was the original motivation for virtual machines, as it allowed time-sharing a single computer between several single-tasking operation systems. In some respects, a system VM can be considered a generalization of the concept of virtual memory that historically preceded it. IBM's CP/CMS, the first systems to allow full virtualization, implemented time sharing by providing each user with a single-user operating system, the CMS. Unlike virtual memory, a system virtual machine allowed the user to use privileged instructions in their code. This approach had certain advantages, for instance it allowed users to add input/output devices not allowed by the standard system.

13

The guest OS’s do not have to be compliant with the hardware making it possible to run different OS’s on the same computer (e.g., and Linux, or older versions of an OS to support software that has not yet been ported to the latest version).

3.2 Introduction to Software Bugs Classification, Software Aging and

Software Rejuvenation

3.2.1 Software Bugs Classification

Bohrbugs:

This class of software faults can be easily removed. These faults are to be removed during the debugging phase. If such faults remain in the operational phase, then the only way out is design diversity wherein applications providing the same functionality but using different design/implementations are used to mask faults in individual implementations

[28].

Heisenbugs:

Obvious most design faults in software are likely to have been detected and removed during testing and subsequently as a result of feedback during field use. These are bugs in the software that are revealed only during a specific occurrence of events. For instance, a sequence of operations may leave the software in a state that results in an error on an operation executed next. Synchronization oversights in multithreaded software are another example, where errors occur during some executions, but do not occur when repeated. Such errors are said to be caused by transient faults. Simply retrying a failed operation, or if the application process has crashed, restarting the process (the restarting could be done by 14

middleware providing Software Implemented Fault Tolerance, SIFT) might resolve the problem [28].

Figure 3.2.1. Software Bugs and Possible Solutions [29]

Software Aging bugs:

Another type of fault observed in software systems is due to the phenomenon of resource exhaustion. Operating system resources such as swap space, free memory available, etc. are progressively depleted due to defects in software such as memory leaks and incomplete cleanup of resources after use. These faults may exist in the operating system, middleware, and the application software. The estimated rate of resources exhaustion and consequently the expected time of software failure has been the focus of research and on software rejuvenation techniques. Periodically restarting a process/rebooting a node, or doing a prediction-based rejuvenation based on the observed rate of resource exhaustion may help prevent the software from crashing (operating system, middleware, application). When

15

software application executes continuously for long periods of time, some of the faults cause software to age due to the error conditions that accrue with time and/or load [28].

To Sum up, Software faults should ideally have been removed during the debugging phase. Even if a piece of software has been thoroughly tested, it still may have some non- deterministic such as Heisenbugs and aging-related bugs.

3.2.2 Software Aging (SA)

Software Aging SA phenomenon has been introduced almost two decades and it is researched and proved its occurrence in real-time systems. It is important to distinguish between two definitions of software aging discussed among the scientists. The first refers to the degradation of a software system performance over many years as maintenance of the system implies significant changes as the requirement of the customer changes. The second definition which concerns us in this thesis and it states that software aging refers to performance degradation of the software system over a period of hours, days, moths as errors accumulate while the system is running. This second definition of software aging is applicable mainly to software systems that are designed to run for long period of time, such as a server in a client-server application (e.g. cloud-based e-commerce application system) or a control system on a long-term space mission. Software aging starts to show up due to multiple factors such as memory bloating, memory leaks, unterminated threads, data corruption, storage space fragmentation, and accumulation of round-off errors and unreleased file locks as well. Software aging has been observed not only in specialized software but also in widely used software, where rebooting to clear a problem is a common practice. Aging occurs because software is extremely complex and never completely free

16

of errors. It is almost impossible to fully test and verify that a piece of software is bug-free.

This situation is further exacerbated by the fact that software development tends to be extremely time-to-market-driven, which results in applications which meet the short-term market needs, yet do not account very well for long-term ramifications such as reliability.

Hence, residual faults have to be tolerated in the operational phase. These residual faults can take various forms, but the ones that we are concerned with cause long-term depletion of system resources such as memory, threads, and kernel tables. The essentially economic problem of developing and producing bug-free code is not the problem at hand; instead, we address one of the problems that arise from the prevailing approach to developing software, and one approach to attacking that problem is software rejuvenation [7].

Memory leaks:

A occurs when a computer program acquires memory but fails to release it back to the operating system. In object-oriented programming, a memory leak may happen when an object is stored in memory but cannot be accessed by the running code. Because they can exhaust available system memory as the application runs, memory leaks are often seen as a major contributing factor to software aging.

A memory leak can diminish the performance of the computer by reducing the amount of available memory. Eventually, in the worst case, too much of the available memory may become allocated and all or part of the system or device stops working correctly, the application fails, or the system slows down unacceptably due to thrashing.

Memory leaks may not be serious or even detectable by normal means. In modern operating systems, normal memory used by an application is released when the application

17

terminates. This means that a memory leak in a program that only runs for a short time may not be noticed and is rarely serious.

Memory leaks that are much more serious might happen when the program runs for an extended time and consumes additional memory over time, such as background tasks on servers, but especially in embedded devices which may be left running for many years.

Another case of leak appearance is when a new memory is allocated frequently for one- time tasks, such as when rendering the frames of a computer game or animated video.

Moreover, memory leaks can be detected when the program is able to request memory such as shared memory that is not released, even when the program terminates. In addition, leaks come out when memory is very limited, such as in an embedded system or portable device.

One last example for memory leaks is when the system is running on an operating system that does not automatically release memory on program termination. Often on such machines, if memory is lost it can only be reclaimed by a reboot [31], [32].

Storage space and fragmentation:

Memory fragmentation is one of the most severe problems faced by system managers. Over time, it leads to degradation of system performance. Eventually, memory fragmentation may lead to complete loss of free memory. Memory fragmentation is a kernel programming level problem. During real-time computing of applications, fragmentation levels can reach as high as 99% and may lead to system crashes or other instabilities. This type of system crash can be difficult to avoid, as it is impossible to anticipate the critical rise in levels of memory fragmentation. According to research conducted by the International Data

Corporation, the performance degradation is largely due to external fragmentation; the lifespan of a server is shortened by 33% by external fragmentation alone. This leads to a

18

direct increase of 33% in the yearly budget for hardware upgrades. Thus it can be concluded that memory fragmentation has an undesirable effect not only on memory usage and processing speed of the system but also on hardware components and cost of a project [32].

3.2.3 Software Rejuvenation (SR)

To counteract software aging, a proactive technique called software rejuvenation has been developed. It involves stopping the running software occasionally, “cleaning” its internal state (e.g., garbage collection, flushing operating system kernel tables, and reinitializing internal data structures) and restarting it. An extreme but well-known example of rejuvenation is a system reboot. Proactive fault management takes suitable corrective action to prevent a failure before the system experiences a fault. It has only recently gained recognition and importance for computer systems. Software rejuvenation is a specific form of proactive fault management which can be performed at suitable times, such as when there is no load on the system, and thus typically results in less downtime and cost than the reactive approach. Since proactive fault management incurs some overhead, an important research issue is to determine the optimal times to invoke it in operational software systems

[8].

Two policies have been studied to apply the software rejuvenation: (a) by scheduling periodic actions for rejuvenation, (b) by estimating resource exhaustion time and perform a proactive rejuvenation. While the first policy is simple to understand and apply, it does not provide the best result in terms of availability and cost, since it may trigger unnecessary rejuvenation actions. Prediction and proactive rejuvenation is potentially a better option.

19

There are two basic approaches to applying the prediction for software rejuvenation: (i)

Analytic-based approach (ii) Measurement-based approach.

3.3 Introduction to Software Reliability Engineering (SRE)

In this thesis, we focus on analyzing the reliability of cloud-based systems for software fault tolerance in software reliability engineering (SRE). Traditional SRE has been based on analysis of software defects and bugs such as Heisenbugs or BohrBugs without considering software aging SA related bugs. Bohrbugs are mainly design defects that can be eliminated by debugging or adopting design diversity; while Heisenbugs are defined as faults that would stop causing failure when one attempt to isolate them.

Reliability Definition:

Reliability is the probability that a product or component will operate properly for a specified period of time under the design operating conditions without failure. In other words, reliability is used as a measure of the system’s success in providing its function properly. Reliability is one of the quality characteristics that consumers require from the manufacturer of products.

Mathematically speaking, reliability R(t) is the probability that a system will be successful in the interval from time 0 to time t, R(t) = P( T > t) where T is a random variable denoting the failure time for t ≥0. Hence the Unreliability F(t) or U(t) = P( T ≤ t) where

F(t) is the failure distribution function. If the failure time T has a probability density

∞ 푡 function (pdf) f(t), this implies R(t) = 푓(푠)푑푠 and U(t) = 푓(푠)푑푠. ∫푡 ∫0

The pdf f(t) = lim 푃 (푡 < 푇 ≤ 푡 + ∆푡) is the probability that failure time T will occur ∆푡→0 between the operating time t and the next interval of operation 푡 + ∆푡. Consequently, the

20

failure rate function is found to be h(t) =푓(푡)⁄푅(푡) [4]. If the pdf f(t) has an exponential distribution, hence h(t) = 휆푒−휆푡⁄푒−휆푡= λ is constant, as we will be using the term constant failure rate in later chapters, in particular 5 and 6. Also we will discuss about non-constant failure rate in chapter 7.

3.4 Introduction to Dynamic Fault Trees (DFT)

Fault trees provide a conceptually a framework modeling to represent system level reliability in terms of interactions between component reliabilities.

3.4.1 Static Fault Tree (SFT) Model:

The fault tree modeling technique was introduced in 1962 at bell telephone labs, in connection with a safety calculation of the launching system for the intercontinental

Minuteman missile. A fault tree analysis is by far the most commonly used technique for risk and reliability analysis. It is a logic diagram that displays the interrelationships between potentially critical event in a system and the reason for this event. A fault tree model describes the system failure in terms of the failure of its components. Standard FT consists of combinatorial models built using static gates (the AND, the OR, and the K/M gates) and basic events (BE). A combinatorial model only captures the combination of events and not the order of occurrence of their failures. Combinatorial models become, therefore, inadequate to model today’s complex dynamic systems [30].

SFT models analysis is based using various techniques such as binary decision diagram, Inclusion-exclusion (I/E) or Sum of disjoint products (SDP). Binary decision diagrams are usually applied for large SFT models since it makes render the analysis easier to comprehend. In this thesis we since the study will have one “OR gate”, we use the Sum

21

of disjoint products (SDP) to determine the occurrence probability of the top event (system failure), given the probability of occurrence of the basic events.

For an OR-gate, the reliability of the top event depends on the reliability of the basic events. Consider we have two basic events in an OR-gate A and B, having the reliability of A and B the unreliability of the top event is

Utop-event (t) = UA (t) + [1- UA (t)] * UB (t)

Hence the reliability is a follows: Rtop-event (t) =1- Utop-event (t).

3.4.2 Dynamic Fault Tree (DFT) Model:

DFT Augment the standard combinatorial (AND, OR and M-out-of-N) gates, that belong to regular fault trees and introduce three novel modeling capabilities: (1) spare component management and allocation, (2) functional dependency, and (3) failure sequence dependency. These modeling capabilities are realized using three main dynamic gates: The spare gate, the functional dependency (FDEP) gate, and the priority AND (PAND) gate.

The PAND gate fails when all its inputs fail and fail from left to right order. The spare gate has one primary input and one or more alternate inputs (i.e. the spares). The primary input is initially powered on and when it fails, it is replaced by an alternate input. The spare gate fails when the primary and all the alternate inputs fail (or are unavailable). A spare could also be shared among multiple spare gates. The FDEP gate is comprised of a trigger event and a set of dependent components. When the trigger event occurs, it causes the dependent components to become inaccessible or unusable (i.e. essentially failed). The

FDEP gate’s output is a dummy output (i.e. it is not taken into account during the calculation of the system’s failure probability)

22

Fault tree analysis may be either quantitative or qualitative or both depending on the objective of the analysis. DFT modules are solved by converting them into Continuous

Time Markov Chains (CTMC).

23

Chapter 4

REJUVENATION OF CLOUD-BASED COMPONENTS

Virtualization technology has been well-adopted in cloud computing, which allows one to share a machine’s physical resources among multiple virtual environments, called virtual machines (VM). As shown in Figure 4.1, a VM is not bounded to the hardware directly; rather it is bounded to generic drivers that are created by a virtual machine manager (VMM) or a hypervisor. Since a VM can be easily created and destroyed, it is particularly useful in a disaster recovery process of a cloud-based system. In this thesis, a cloud-based system is referred to as a software system that consists of multiple VMs, where each VM is considered a software component within the system.

Zone 1 Zone 2 P P’ H H’ VM VM ... VM VM ... 1-1 1-2 2-1 2-2

VMM 1 VMM 2 ... Physical Machine 1 ... P hysical Machine 2

Figure 4.1. An example of a reliable cloud-based system with spare software components

As a proactive fault management technique, software rejuvenation has been used to refresh system internal states and prevent the occurrence of software failures due to software aging. As mentioned before, a simple way for software rejuvenation is system reboot, e.g., to restart a VM or all VMs in a cloud-based system. The basic idea of our approach is to create a new instance of VM that replaces the one to be rejuvenated. Since the newly deployed VM instance has not yet been affected by the software aging

24

phenomenon, the reliability of the software component, after being replaced, is boosted back to its initial condition. To achieve a reliable and zero-downtime rejuvenation, we define two types of VM spares, namely Hot Software Spare (HSS) and Cold Software

Spare (CSS).

In the context of cloud-based systems, an HSS is a hot standby VM instance that can be instantly available when a primary component fails. Despite the fact that an HSS is running alongside a primary component, it is not sharing any workload or processing any requests.

Therefore, an HSS is operated using much less CPU power, but can be scaled automatically to meet the workload requirements when a primary component fails. Critical data in an

HSS is mirrored in near real time from the primary VM instance, e.g., in the range of 200

µs, to ensure high fault tolerance. The failure rate of an HSS is much less than that of a primary component as an HSS is not subject to aging-related bugs. This makes a software- defined HSS differ significantly from a hardware-based Hot SPare (HSP) because, with physical wear out, an HSP may have the same failure rate as a primary hardware component. Therefore an HSS behaves more as a hardware Warm Spare component. The difference is that an HSS failure rate will be equal to primary component failure rate after software aging starts affecting it. On the other hand, a CSS refers to a software component that is available as an image of a VM, and can be replicated and deployed as a primary component or a HSS component. As an inactive VM instance, a CSS is mirrored for its critical data based on a specified schedule with most of the time being cold standby.

Therefore, the reliability of a CSS is nearly perfect, which can be reasonably assumed never to fail and we only consider the primary component and its HSSs when calculating the system reliability. The recovery time using a CSS is usually in the range of minutes up to

25

two hours; while the cost of a CSS is its storage and very little CPU resource consumption.

A CSS can be rapidly deployed, which makes it quite different from a hardware-based Cold

SPare (CSP) that is much expensive and requires manual configuration when a primary one fails [31].

In our approach, a rejuvenation schedule of a cloud-based system is created based on its reliability modeling and the analytical results. When the reliability of a system component or the whole system reaches a predefined threshold, the rejuvenation process is triggered.

In addition, we assume in this work that we are dealing with request-based software system; therefore we claim that the rejuvenation process takes about 30 minutes, which is typically sufficient for starting a CSS and transfer all requests to the new VM. If we are working with computation-based system, we need to reconsider the rejuvenation design as we need to take into account no to discontinue the running computation while rejuvenation is taking place. As a simple example illustrated in Figure 4.1, suppose we have two instances, a primary component P and a hot standby one H, which are deployed on two different physical machines. The two physical machines usually belong to two different zones

(denoted as Zone 1 and Zone 2 in Figure 4.1, so a power/network outage in one zone will not affect the availability of the other one [32]. To rejuvenate the whole system, we originally need one CSS for a particular component. Since it is an image, it can be replicated to as many images needed and then deployed either as Primary or a hot software spare (HSS). In this particular example, we can start two CSSs C1 and C2, denoted as P’ and H’ in Figure 4.1, to replace P and H, respectively. Although in Fig.ure 4.1, P’ and H’ are deployed on the same physical machine where P and H are deployed, respectively, in reality, this is not necessary and both P’ and H’ can be deployed on any physical servers.

26

Once the spare components P’ and H’ are up and running, P’ serves as a new primary component and starts to process new user requests; while H’ serves as a new HSS, which is kept alive but does not take any workload. Meanwhile, we allow 30 minutes in total for the old components P and H to finish processing their existing requests. After 30 minutes, we shut down and delete the components P and H, which shall have been successfully replaced by P’ and H’ after the rejuvenation process completes. Finally, a copy of a CSS for a particular software component is kept in storage which is ready for the next round of a rejuvenation process. Note that in our rejuvenation strategy, we have chosen to shut down instances P and H rather than restart and reuse them. This is because different from a physical machine, a VM can be easily created and deployed, thus deploying new instances

P’ and H’ is a much more efficient way than restarting and reusing P and H.

During the rejuvenation procedure, we need to consider two scenarios. One scenario is to rejuvenate the major software components all together. In this case, we replicate the whole system when the system reliability reaches its threshold. We call this scenario a system-specific rejuvenation. The second scenario is a component-specific one, where each time we only rejuvenate the critical component whose reliability is typically the lowest one when the system reliability reaches its reliability threshold [35, 36, 37].

27

Chapter 5

MODEL WITH CONSTANT FAILURE RATE AND ONE

HOT SPARE

In this chapter, we show how to use the SSP gate, a DFT extension, to model and analyze the reliability of a cloud-based system that is subject to software rejuvenation, and that employs one software hot spare to improve its fault tolerance. We assume that the time-to- failure for each software component (i.e., a VM) has a probability density function (pdf) that is exponentially distributed; in other words, all VMs have constant failure rates.

5.1 Modeling and Analysis

5.1.1 Software Spare (SSP) Gate with One Hot Software Spare (1-HSS)

The calculation of system reliability is based on an extended Dynamic Fault Tree (DFT) model of cloud-based systems with Software SPare (SSP) gates. Note that a SSP gate has one primary input and one or more alternate inputs (i.e., the spares). The primary input is initially powered on, and when it fails, it is replaced by an alternate input. The spare gate fails when the primary and all the alternate inputs fail. Figure 5.1.1 shows an SSP gate with one primary component denoted as P and one hot spare component denoted as H. The SSP gate fails when both of the two components P and H fail. In this thesis, the novel analytical approach covers cloud-based SSP gate with one and two hot software spares. This section will cover first the novel approach for one spare, then it will span through the details for the case of two spares in Chapter 6.

28

SSP

P H

Figure 5.1.1 An SSP gate with primary component P and a HSS component H

Suppose the constant failure rates of components P and H are λP and λH, respectively.

Since H does not take any workload when P is functioning, its failure rate λH is typically lower than λP. When P fails, H takes over P’s workload and behaves as a primary component. H now has a higher constant failure rate λH* than λH due to the software aging phenomenon with H’s full workload. For this reason, we call the spare component H, after its role transition, H*. Note that λH* and λP do not have to be equal because P and H may have different configurations.

There are two cases where SSP gate fails. In the first case, P fails before H fails denoted as P  H. This case is illustrated as “Case 1” in Figure. 5.1.2, where P fails at τ1 and H* fails at τ2, with τ1 < τ2. In the second case, H fails before P fails denoted as H  P . In this case, H does not have a chance to behave as a primary component, and the failure of P immediately leads to the failure of the SSP gate. This case is illustrated as “Case 2” in

Figure 5.1.2, where τ2 < τ1.

λ λ

λH*

λP λP

λH λH

0 τ1 τ2 t 0 τ2 τ1 t Case 1 Case 2 Figure 5.1.2. Two cases for the failure of an SSP Gate with 1-HSS

29

We now derive the reliability function R(t) of the SSP gate by considering the above two cases.

Case 1: P fails before H fails, denoted as P  H . In this case, it is guaranteed that H does not fail during (0, τ1]. After P fails, H takes over the workload and becomes H*. Intuitively,

the distribution function FPH (t) of the SSP gate, i.e., the probability that the SSP gate fails during (0, t] can be calculated as in Eq. (5.1).

t t F (t)  Pr (T  t)  ( eP1 )( eH* 2 )d d (5.1) PH PH   P H* 2 1 0 1

However, Eq. (5.1) works only when H *  H , i.e., the constant failure rate of H does not change after it switches its role from a spare component to a primary one at time τ1.

When H*  H , as we can see from Figure 5.1.3, the integration of the pdf of H* from τ1 to t does not give the correct unreliability of the component at time t, because it incorrectly assumes that the component behaves as H* starting from time 0. Since the component actually behaves as H during (0, τ1], the unreliability of H* at time τ1 equals the unreliability of H at τ1 rather than the unreliability calculated by the integration of the pdf of H* from 0 to τ1. This requires us to calculate a new starting integration time τH* for H* such that the unreliability of H* at τH* (represented by the shaded area under the pdf of H*) is equal to the unreliability of H at τ1 (represented by the shaded area under the pdf of H). As the pdfs

H H* of H and H* are f ( )  H e and f ( )  H*e , respectively, such a relationship between

H and H* can be described as in Eq. (5.2) ) [36].

30

 H* 1  eH* d   eH d (5.2)  H*  H 0 0

 Solving Eq. (5.2), we have  H  . Since H* fails during a period of time (t-τ1), the H* H* 1 integration range for H* now becomes [ H  , t   H  ]. Based on the above analysis, the H* 1 1 H* 1 probability of P fails before H fails can be calculated as in Eq. (5.3) [36].

Figure. 5.1.3. The initial unreliability of H* when P fails (i.e., the unreliability of H at time τ1)

H t  1   1 t H* F (t)  Pr (T  t)  ( eP 1 )( eH* 2 )d d (5.3) PH PH   P H * 2 1 0 H  1 H*

u( )    H  To simplify the integration range for H*, we can substitute 2 2 H* 1 for variable τ2

in Eq. (5.3), and derive the distribution function FPH (t) of the HSP gate appearing in Case

1 as in Eq. (5.4).

31

FPH (t)  PrPH (T  t)

t t  1 H H* (u  1 )  ( eP 1 )( e H* )dud   P H * 1 0 0

t t  1 (  )  u (5.4)  ( e P H 1 )( e H* )dud   P H * 1 0 0 t  ( e(P H ) 1 )(1 eH* (t  1 ) )d  P 1 0    P (1 e(P H )t )  P eH*t (1 e(P H H* )t ) P H P H H*    P (1 e(P H )t )  P (e(P H )t  eH*t ) P H P H H*

Case 2: H fails before P fails, denoted as H  P . In this case, it is guaranteed that P does

not fail during (0, τ2]. The distribution function FH P (t)of the SSP gate, i.e., the probability that the SSP gate fails during (0, t] can be calculated as in Eq. (5.5).

FPH (t)  PrH P (T  t) t t  ( eP 1 )( eH 2 )d d   P H 1 2 0  2 t  (eP 2  eP t )( eH 2 )d (5.5)  H 2 0 t   (e(P H ) 2  eP teH 2 )d  H 2 0   H (1 e(P H )t )  eP t (1 eH t ) P H   1 eP t  P (1 e(P H )t ) P H

As the two cases are completely independent, the unreliability of the SSP gate at time t is the summation of the unreliability values of the two cases at time t. Thus, we derive the unreliability function U(t) of the SSP gate as in Eq. (5.6).

U( t )  FPH ( t )  FH P ( t )  P (1  e ( P H )t )  P ( e ( P H )t  e H*t )  P H P H H* (5.6) 1  e Pt  P (1  e ( P H )t ) P H  1  e Pt  P ( e ( P H )t  e H*t ) P H H*

32

Accordingly, the reliability function R(t) of the SSP gate can be derived as in Eq. (5.7).

λ t λ λ t (λ λ )t R(t)  1U(t)  e P  P (e H *  e P H ) (5.7) λ P λ H λ H *

It is worth noting that there is an obvious but subtle third case, where components P and

H fail exactly at the same time, denoted as P  H . As the probability of failure associated with the event [T = τ] is 0, i.e., the probability that either P or H fails during [τ, τ] is 0, the unreliability of the SSP gate in the third case P  H must equal 0. This result can be easily derived as in Eq. (5.8), where P fails at time τ1 during (0, t], and H fails exactly at the same time τ1 when P fails.

t 1 t Pr (T  t)  ( eH 2 )( eP 1 )d d  (eH 1  eH 1 )( eP1 )d  0 (5.8) PH  H P 2 1  P 1 0 1 0

5.1.2 Verification Using CTMC

To formally verify the correctness of both reliability functions R(t) of the SSP gate derived in Section 5.1, we now use a CTMC model and solve its state equations. Figure 5.1.4 shows the CTMC model corresponding to the SSP gate with 1-HSS presented in Figure 5.1.1.

There are four states 1 to 4 defined in the CTMC model, which are denoted as PH, P, H*, and FAILURE, respectively. The state PH (State 1) refers to the one in which both the primary component and the hot spare one are functioning. When the hot spare component or the primary one fails, the model enters its P state (State 2) or H* state (State 3), respectively. Note that we denote State 3 as H* instead of H because in State 3, the hot spare component has a different failure rate as the one in State 1.

33

State 1

λH λP State 2 State 3

λP λH*

State 4

Figure 5.1.4 The CTMC model of the SSP gate in Figure 5.1.1

Let Pi(t) be the probability of the system in state i at time t, where 1 ≤ i ≤ 4, and Pij(dt)

= P[X(t+dt) = j | X(t) = i] be the incremental transition probability with random variable

X(t). The following matrix [Pij(dt)], where 1 ≤ i, j ≤ 4, is the incremental one-step transition matrix [4] of the CTMC defined in Fig. 5.

1 (P  H )dt H dt Pdt 0   0 1  dt 0  dt  (5.9)  P P  [Pij (dt)]   0 0 1 H*dt H*dt    0 0 0 1 

The matrix [Pij(dt)], where 1 ≤ i, j ≤ 4, is a stochastic matrix with each row sums to 1.

This matrix provides the probabilities for each state either remaining (when i = j) or transit to a different state (when i ≠ j) during the time interval dt. Given the initial probabilities of the states, the matrix can be used to describe the state transition process completely. From the matrix defined in Eq. (5.9), we can derive the following relations as in Eqs. (5.10.1-

5.10.4).

P1(t  dt)  (1(p  H )dt)P1(t) (5.10.1)

P2 (t  dt )  ( H dt )P1(t )  (1 pdt )P2 (t ) (5.10.2)

P3 (t  dt )  (Pdt )P1(t ) (1 H* dt )P3(t ) (5.10.3)

P4 (t  dt )  (Pdt )P2 (t )  (H* dt )P3(t )  P4 (t ) (5.10.4)

34

where the initial probabilities are defined by the probability of the system being at State 1.

Thus we have P1(0) = 1, and P2(0) = P3(0) = P4(0) = 0. As dt goes to 0, we derive a set of linear first-order differential equations as in Eqs. (5.11.1-5.11.4), which are state equations of the CTMC model.

P (t  dt )  P (t ) 1 1 (5.11.1) P1'(t )  lim  ( λP  λH )P1(t ) dt0 dt

P (t  dt )  P (t ) 2 2 (5.11.2) P2'(t )  lim  λH P1(t )  λP P2 (t ) dt0 dt

P (t  dt )  P (t ) 3 3 (5.11.3) P3'(t )  lim  λP P1(t )  λH* P3 (t ) dt0 dt

P (t  dt)  P (t) 4 4 (5.11.4) P4'(t)  lim  λP P2 (t)  λH* P3(t) dt0 dt

The state equations defined in Eqs. (5.11.1-5.11.4) can be solved using Laplace transformation, which allows transforming a linear first order differential equation into a linear algebraic equation that is easy to solve.

Let the Laplace transformation of Pi(t) be Fi(s) as defined in Eq. (5.12.1), the Laplace transformation of Pi’(t) can be derived as in Eq. (5.12.2).

 st (5.12.1) L{ Pi (t )}( s )   e Pi (t )dt  Fi ( s ) 0

 st (5.12.2) L{ Pi' (t )}( s )   e Pi' (t )dt  sFi ( s )  Pi (0 ) 0

Now apply the Laplace transformations defined in Eqs. (5.12.1-5.12.2) to both sides of the Eqs. (5.11.1-5.11.4), we can derive Eqs. (5.13.1-5.13.4).

35

sF1(s)  P1(0)  (P  H )F1(s) (5.13.1)

sF2 (s)  P2 (0)  H F1(s)  P F2 (s) (5.13.2)

sF3(s)  P3(0)  P F1(s)  (H* )F3(s) (5.13.3)

sF4 (s)  P4 (0)  PF2 (s)  (H*)F3(s) (5.13.4)

Substituting the initial probabilities Pi(0), where 1 ≤ i ≤ 4, into Eqs. (5.13.1-5.13.4), we can solve F1(s), F2(s) and F3(s). By further applying inverse Laplace transformation to

F1(s), F2(s) and F3(s), we can solve the original linear first order differential equations in

Eqs. (5.10.1-5.10.4).

F (s)  1  P(t)  e(P H )t 1 (sP H ) 1

F (s)  H  P (t)  ePt  e(P H )t 2 (sP H )( sP ) 2

  F (s)  P  P (t)  P (eH*t  e(P H )t ) 3 (sP H )( sH* ) 3 P H H*

The reliability function R(t) is the summation of P1(t), P2(t) and P3(t), which can be calculated as in Eq. (5.14),

 R(t)  P (t)  P (t)  P (t)  ePt  P (eH*t  e(P H )t ) (5.14) 1 2 3 P H H*

It is easy to see that Eq. (5.14) gives exactly the same formula as the one defined in Eq.

(5.7); thus, it verifies the correctness of our proposed analytical approach for calculating the reliability of the SSP gate at time t. Note that P4(t) is the probability that the system is in its FAILURE state at time t. Therefore, P4(t) actually defines the system unreliability function U(t) = P4(t) = 1 - R(t).

5.1.3 Modeling and Analysis Using DFT in Two Phases

36

To model and analyze the reliability of a cloud-based system with spare components, we consider two different phases. Phase 1 represents the pre-rejuvenation stage where the reliability analysis is based on the failure rates of the primary components and their HSSs.

Chapter 4 clearly explains why a CSS is assumed to never fail, hence is not considered in the reliability analysis in phase 1 . We model the system reliability using DFT, and then calculate its reliability based on the reliability function of SSP gate derived in Section 5.1.1.

Phase 2 is the software rejuvenation phase. When the predefined reliability threshold is reached, the software rejuvenation process is initiated, and the system enters this phase. As we have mentioned, there are two rejuvenation scenarios, namely the system-specific rejuvenation and the component-specific one. To illustrate the basic idea of calculating the system reliability in this phase, we use the first scenario as an example, where the whole system is rejuvenated. In this scenario, two CSSs are started one as primary component P’ and the other as an HSS H’ to replace P and H, respectively. During the rejuvenation period, all four software components P, H, P’ and H’ coexist and are functioning. As shown in Figure 5.1.5, the dynamic fault tree model is decomposed into 2 subtrees, S1 and S2, which are all SSP gates that are connected by an AND-gate. This is because the system fails only when both of the two SSP gates fail, and the failure of a single SSP gate during the rejuvenation phase will not lead to the failure of the whole system. Subtree S1 consists of components P and H that are to be rejuvenated; while subtree S2 consists of the newly deployed components P’ and H’, which are used to replace P and H. As both S1 and S2 are defined as SSP gates, they can be computed using the same analysis technique as described in Phase 1.

37

S1 S2

HSS HSS

P H P’ H’

Figure 5.1.5: A DFT model with two SSP gates (Phase 2)

Once we have the distribution functions of S1 and S2, the static gate, i.e., the AND-gate, can be easily solved using the sum-of-disjoint-products (SDP) method [31]. Specifically, to calculate the reliability of the whole system in this phase, we first calculate the unreliability functions US1(t) and US2(t) for S1 and S2, respectively based on Eq. (5.6). Then the reliability of the AND-gate can be calculated as in Eq. (5.15).

R(t) 1U AND (t) 1US1(t)*US2 (t) (5.15)

In the following case study, we will consider both of the two scenarios during the rejuvenation process, where Scenario 1 involves rejuvenation of the whole system, and in this case, we need to replicate all major software components when the system reliability reaches the threshold. On the other hand, Scenario 2 is component specific, thus we only rejuvenate the most critical component whose reliability is the lowest when the system reliability reaches its threshold.

5.2 Case Study 1: Constant Failure Rate with One Hot Spare (1-HSS)

38

A challenging task in cloud computing is to correctly measure the reliability of a cloud- based system and maintain its high reliability. This thesis will present four case studies where we show how to model and analyze the reliability of cloud-based systems using

DFT, and then estimate an effective rejuvenation schedule that meets the high reliability requirement of the system. The case studies will reflect all the materials discussed, starting from software rejuvenation, with two proposed scenarios in Chapter 4, as it will cover one or two hot standby software spares along constant and non-constant failure rates, Chapter

5, 6, 7, 8 respectively. All four case studies are set in such a way to be compared. For instance we will compare Chapter 5 (Case Study 1) with Chapter 6 (Case Study 2) in

Chapter 6 section 6.3 and Chapter 7 (Case Study 3) with Chapter 8 (Case Study 4) in

Chapter 8 section 8.3. Also, we will interpret all case studies results altogether by the end of Chapter 8 in section 8.4.

All four case studies consider a typical cloud-based system consisting of an application server layer and database server layer. Case study1 consists of an application server PA and a database server PB. To enhance the system reliability, two hot spare components HA and HB are set up for PA and PB, respectively, which are ready to take over the workload once the primary ones fail. The case study also involves CSS components, namely CSA and CSB, which are used in the rejuvenation process. Note that a CSS is a stored image of a deployed VM instance that can be easily duplicated, thus only one CSS is needed for each of the primary and HSS components. In addition, since a CSS is stored as an image, its failure rate is considered to be 0. However, once a CSS component is duplicated and deployed, it will assume the failure rate of its corresponding role, either as a running primary component (P’) or as an HSS (H’) as it will be shown in Figure 5.2.3. Note that

39

each of the servers is deployed in different availability zone for fault-tolerance purpose

[33] as shown in Figure 5.2.1. As a clarification for the reliability analysis in this all four case studies, we view a VM with its OS, the server software and the deployed services as a single software component. In addition, we assume the proxy server’s reliability is ideal.

Furthermore, we assume that the proxy server and the application server can monitor and detect failures of the application server and the database server, respectively.

Web Server/ Load Balancer

monitor monitor

r n e o y

i PA HA a t a L

i r l

e Replace p v p r e A S

r

e monitor monitor v r e r S

e e y s a

a PB HB L b

a Replace t a D

CSA CSB

Figure 5.2.1 A cloud-based system with 2 servers and their HSSs

The two HSSs are ready to take over the workload if the primary ones fail. The case study shows the reliability analysis applicable to SSP gate with one HSS for each primary component. We set the reliability threshold to 0.99 as a minimum requirement for system reliability. For this case study, we assume the typical constant failure rates for the servers, where λPA=0.004/day, λHA=0.0025/day, λPB=0.005/day, λHB = 0.003/day. Note that the failure rates of the hot spare parts are lower than their corresponding primary ones because 40

the spare parts do not take any workload when the primary ones are functioning. Yet, when a primary server fails, the failure rate of a substituting HSS increases since it assumes the primary component workload, i.e., λHA* and λHB* will be raised to λHA* = λPA = 0.004, λHB*

= λPB = 0.005. The DFT model of the cloud-based software system for Phase 1 is shown in

Fig. 5.2.2.

S1 S2

HSS HSS

PA HA PB HB

Figure 5.2.2 DFT model of the cloud-based system - Case Study1 - Phase1

Because the system fails when either the application server or the database server fails, the two SSP gates are connected by an OR-gate. The reliability function of the OR-gate can be derived as in Eq. (5.16).

R(t) 1UOR (t) 1 (US1(t)  (1US1(t))*US2 (t)) (5.16)

where US1(t) and US2(t) are the unreliability functions of the subtrees S1 and S2, respectively. According to Eq. (5.6), US1(t) and US2(t) can be calculated as in Eq. (5.17) and Eq. (5.18), respectively. Note that Eqs. (5.17-5.18) have been simplified due to the assumed configurations, where λHA* = λPA and λHB* = λPB.

41

  U (t) 1 R (t) 1 (1 PA )ePAt  ( PA )e(PA HA)t (5.17) S1 S1 HA HA

U (t) 1 R (t) 1 (1 PB )ePBt  ( PB )e(PB HB)t (5.18) S2 S2 HB HB

In Phase 2, we consider both of the scenarios mentioned in Chapter 4, so their impacts on system reliability as well as their consequent rejuvenation schedules can be compared.

Figure 5.2.3 shows the DFT model of the cloud-based system in Phase 2 based on Scenario

1. For the same reason as in Phase 1, the system reliability can be calculated as in Eq.

(5.19). According to Eq. (5.16), US3(t) and US4(t) can be calculated as in Eq. (5.20) and Eq.

(5.21), respectively.

R(t) 1UOR (t) 1 (US3(t)  (1US3(t))*US4 (t)) (5.19)

US3(t) US1(t)*US1' (t) (5.20)

US4 (t) US2 (t)*US2' (t) (5.21)

Note that in Eqs. (5.20-5.21), US1(t), US1’(t), US2(t) and US2’(t) can be calculated in a similar way as in Eqs. (5.17-5.18).

S3 S4

S1 S1' S2 S2'

HSS HSS HSS HSS

PA HA PA’ HA’ PB HB PB’ HB’

42

Figure 5.2.3. DFT model of the cloud-based system in Case Study1-Phase 2 (Scenario 1)

The reliability analysis results for Scenario 1 are listed in Table 5.1. The table shows that the reliability threshold (0.99) is reached every 18 days based on the reliability analysis results in Phase 1. Hence, both application and database servers are rejuvenated at the end of Phase 1. Phase 2 has a 30-minute time duration; therefore, we calculate the system reliability at 5, 10, 20 and 30 minutes in Phase 2 to illustrate how system reliability may change during the rejuvenation process. From the table, we can see that the system reliability is kept very high during the transition. After the 30 minutes, the newly deployed servers (P’ and H’), originally CSSs, completely take over the system, and the old servers to be rejuvenated are shut down. When this happens, the system returns to its initial state, and starts a new life cycle with very high initial reliabilities. Therefore, Table 5.1 suggests that the system should be rejuvenated every 18 days in order to maintain the system reliability above the predefined threshold.

By further looking into Table 5.1, we can see that when the system reliability reaches

0.99 after 18 days, the reliability of the database server subsystem is lower than that of the application server subsystem. This suggests that we may rejuvenate the most critical components (i.e., the component or a subsystem with the lowest reliability) first. In this case study, we choose to rejuvenate the database servers first. Then we wait until the system reliability reaches the threshold again, and rejuvenate the application servers, which now become the most critical components. This is exactly what happens for rejuvenation scheduling in Scenario 2, where the application servers and the database servers are rejuvenated alternatively. Figure 5.2.4 shows the DFT model of the cloud-based system in

Phase 2 for both two cases in Scenario 2. In particular, part b) shows the case where the

43

database servers are rejuvenated. In this case, the system reliability can be calculated as in

Eq. (5.22), and US1(t) and US4(t) can be calculated in a similar way as in Eq. (5.6) and Eq.

(5.15), respectively.

S3 S2

HSS S1 S1'

HSS HSS PB HB

App. Servers Rejuvenation PA HA PA’ HA’

S1 S4

HSS S2 S2'

HSS HSS PA HA

Db. Servers Rejuvenation HB HB’ PB PB’

Figure 5.2.4. DFT model of the cloud-based system in Case study1-Phase 2 (Scenario 2 both cases)

R(t) 1U (t) 1 (U (t)  (1U (t))*U (t)) OR S1 S1 S4 (5.22)

Scenario2 system reliability for application server rejuvenation can be calculated in a similar way.

Table 5.1. Case Study 1- System Reliability with Software Rejuvenation (Scenario 1)

Phase Time (Days) App Servers Reliability DB Servers Reliability System Reliability

0 1 1 1

44

1 0.99998705 0.9999801 0.9999671502577 1 5 0.9996806 0.9995107 0.9991914562824 10 0.998745 0.998085 0.996832403325 18 0.996044 0.994004 0.990071720176 18.0035 0.999999999999 0.999999999999 0.9999999999979 2 18.0069 0.999999999997 0.999999999994 0.9999999999917 18.0139 0.99999999999 0.999999999977 0.9999999999669 18.0208 0.999999999978 0.99999999994 0.9999999999177 ...... 1 90 0.996044 0.994004 0.990071720176 90.0035 0.999999999999 0.999999999999 0.9999999999979 90.0069 0.999999999997 0.999999999994 0.9999999999917 2 90.0139 0.99999999999 0.999999999977 0.9999999999669 90.0208 0.999999999978 0.99999999994 0.9999999999177 91 0.99998705 0.9999801 0.9999671502577 1 95 0.9996806 0.9995107 0.9991914562824 100 0.998745 0.998085 0.996832403325 108 0.996044 0.994004 0.990071720176 108.0035 0.999999999999 0.999999999999 0.9999999999979 108.0069 0.999999999997 0.999999999994 0.9999999999917 2 108.0139 0.99999999999 0.999999999977 0.9999999999669 108.0208 0.999999999978 0.99999999994 0.9999999999177 109 0.99998705 0.9999801 0.9999671502577 1 113 0.9996806 0.9995107 0.9991914562824 118 0.998745 0.998085 0.996832403325

Table 5.2 shows the reliability analysis results for Scenario 2. At the end of each Phase 1, the server subsystem with its reliability marked by “=>” is the one to be rejuvenated. For example, after 18 days, the database servers are rejuvenated, and after 27 days, the application servers are rejuvenated.

Table 5.2. Case Study 1- System Reliability with Software Rejuvenation (Scenario 2)

Phase Time (Days) App Servers Reliability DB Servers Reliability System Reliability 0 1 1 1 1 0.99998705 0.9999801 0.999967150258 1 5 0.9996806 0.9995107 0.999191456282 10 0.998745 0.998085 0.996832403325 18 0.996044 => 0.994004 0.990071720176 18.0035 0.9960425 0.99999999999868 0.996042499999 2 18.0069 0.996041 0.99999999999421 0.996004099994 18.0139 0.996038 0.99999999997683 0.996037999977

45

18.0208 0.996035 0.99999999994002 0.996034999940 20 0.99515 0.99992069 0.995071074654 1 25 0.992552 0.9990492 0.991608281558 27 => 0.9914 0.998442 0.990000000000 27.0035 0.99999999999917 0.998441 0.998440999999 2 27.0069 0.99999999999749 0.998439 0.998438999999 27.0139 0.99999999999006 0.998437 0.99843699999 27.0208 0.9999999999777 0.998435 0.998434999978 30 0.9998842 0.997265 0.997149516713 1 35 0.999109 0.994628 0.993741786452 39 0.998205 => 0.991942 0.990161464110 39.0035 0.998204 0.99999999999822 0.998203999998 2 39.0069 0.998202 0.99999999999222 0.998201999992 39.0139 0.9982 0.99999999996887 0.998199999969 39.0208 0.998199 0.99999999992992 0.99819899993 ...... 85 0.998486 0.99992069 0.998406810075 90 0.996853 0.9990492 0.995905192168 1 95 0.994671 0.997265 0.991950574815 96 0.994172 0.996804 0.990994626288 97 => 0.993652 0.99631 0.99 97.0035 0.99999999999868 0.9963 0.996802999998 2 97.0069 0.99999999999602 0.996298 0.996800999996 97.0139 0.99999999998406 0.996294 0.996797999984 97.0208 0.99999999996424 0.996291 0.996793999964 1 100 0.9998842 0.994628 0.994512822078 105 0.999109 => 0.991194 0.990310846146 105.003 0.9991902 0.99999999999806 0.998743999998 2 105.006 0.9991895 0.99999999999216 0.998741999992 5 105.013 0.9991882 0.99999999996844 0.998739999968 9 105.020 0.9991868 0.99999999992343 0.998737999923 9 110 0.9979 0.9995107 0.99741172753 8 1 115 0.996044 0.998085 0.99413657574 119 => 0.994172 0.99631 0.99050350532

The rejuvenation schedule for both Scenario 1 and Scenario 2 is illustrated in Figure

5.2.5. In the figure, the rejuvenation initiation is indicated by the sudden increment of the system reliability. By comparing the two rejuvenation schedules, we can see that during

119 days, Scenario 1 has 6 rejuvenation processes which require rejuvenating both of the application and database servers. On the other hand, Scenario 2 has 9 rejuvenation processes which only require rejuvenating either the application servers or the database servers.

46

Figure 5.2.5 Case study 1- Rejuvenation scheduling (Scenario 1 vs. Scenario 2)

It is easy to see that Scenario 2 requires less management of the servers in order to keep the system reliability above the 0.99 threshold all the time. Suppose the rejuvenation of the application servers has the same cost as the that of the database servers, by using the rejuvenation scheduling defined in Scenario 2, the cost can be reduced by (6*2-

9)/(6*2)=25% comparing to the rejuvenation scheduling defined in Scenario 1.

47

Chapter 6

MODEL WITH CONSTANT FAILURE RATE AND TWO

HOT SPARES

In this chapter, we show how to use DFT to model and analyze the reliability of a cloud- based system that is subject to software rejuvenation, and that employs two software hot spares (2-HSSs) to improve its fault tolerance. Similar to Chapter 5, we assume that the time-to-failure for each software component (i.e., a VM) has a probability density function

(pdf) that is exponentially distributed; in other words, all software components have constant failure rates.

6.1 Modeling and Analysis

6.1.1 SSP gate of cloud-based systems with Two Hot Software Spares

This section takes into consideration an SSP gate with one primary component denoted as

P and two hot spares components denoted as H1 and H2 as shown in Figure 6.1.1. The spare gate fails when the primary and all the alternate inputs fail. Suppose the constant failure rates of components P and H1 and H2 are λP , λH1, and λH2 respectively. Following the same logic in section 5.1.1, H1 and H2 do not take any workload when P is functioning; their failure rates λHi is typically lower than λP. When P fails, H1 takes over P’s workload first, and behaves as a primary component. H1 now has a higher constant failure rate λH1* than

λH1 due to the software aging phenomenon with H1’s full workload, the same rule applies between H1* and H2, where H2 will take over the workload upon H1* failure and becomes

H2*. For this reason, we call the spare component Hi, after its role transition, Hi*. Note

48

that λH* and λP do not have to be equal because P and H may have different configurations.

In addition, we designate τ1, τ2, and τ3 as the time to failure of P, H1 and H2 respectively.

SSP

P H H 1 2

Figure 6.1.1 An SSP gate with a primary component P and two HSSs H1 and H2

Similar to the latter Chapter, to compute the reliability function of the HSP gate with two hot spares, we identify the failure scenarios where the HSP gate fails based on components failure sequence. When component X fails before component Y, We denote the event “component X fails before component Y” as X≺Y, and summarize six disjoint events ei or cases where 1 ≤ i ≤ 6, as we denote this event as in Figure 6.1.2.

Figure 6.1.2 Six cases for the failure of an SSP gate with two HSSs

49

Let event A be the failure of an SSP gate at time t. We can calculate the probability of event A as in Eq. (6.1):

6 6 Pr(A)  Pr(A| ei )*Pr(ei )  Pr(Aei ) i1 i1 (6.1)

It is worth noting that when event ei happens, the SSP gate also fails. Therefore, event

A always happens with some event ei. Thus, Eq. (1) can be simplified as in Eq. (6.2).

6 Pr(A)  Pr(ei ) i1 (6.2)

We now derive the reliability function R(t) of the SSP gate by considering the above six cases.

Case 1: Event e1 where P fails before H1 and H1 before H2 fails, denoted as P≺H1≺H2.

In this case, it is guaranteed that H1 does not fail during (0, τ1] and that H2 does not fail during (0, τ2]. After P fails, H1 takes over the workload and becomes H1*, also after H1* fails, H2 takes over the workload and becomes H2*. Intuitively, the distribution function

F(t) of the SSP gate, i.e., the probability that the SSP gate fails during (0, t] can be calculated as in Eq. (6.3).

t t t λ τ λ τ λ τ F (t)  Pr (T  t)  (λ e P 1 )(λ e H 1* 2 )(λ e H 2* 3 )dτ dτ dτ PH1H 2 P  H 1 H 2    P H1* H 2* 3 2 1 (6.3) 0 τ1 τ2

However, similar to Chapter 5, Eq. (6.3) works only when H *  H , i.e., the constant failure rate of Hn does not change after it switches its role from a spare component to a primary one at time τ1. When λH* > λH, the integration of the probability density function

50

(pdf) of H1* from τ1 to t does not give the correct unreliability of the component at time t, as it incorrectly assumes that component H1 behaves as H1* starting from time 0. Since the component actually behaves as H1 during (0, τ1], the unreliability of H1* at time τ1 equals the unreliability of H1 at τ1 rather than the unreliability calculated by the integration of the pdf of H1* from 0 to τ1. This is to ensure the unreliability continuity for H1 before and after it serves as a primary component H1*. By calculating a new starting integration time τH1* for H1*, we take into consideration that τ2, originally the failure of component H1, is shifted to the left by (τ1–τH1*). As a result, when we consider the failure of H1*, we must add (τ1–

τH1*) to τ2 since H2* is activated based on the original non-shifted failure time variable τ2 of

H1. Therefore, the value of τ2 after the adjustment is given as τ2|actual = τ2|shifted + (τ1– τH1*) =

  τ2+(τ1– τH1*), as shown in Figure 6.1.3.a. As a rule of thumb, in the case of P H1 H2…

Hn, where n >1 (τ1 does not get shifted since it is the failure time of P, and P always acts as a primary component), when a component Hn* acts as a primary one, its actual time to failure equals τ(i+1)+(τi–τHi*). This observation and adjustment is critical for yielding the correct reliability function.

0 τH1* τ1 τH2* τ2|shifted τ2|actual t-( τ2|actual - τH2*) t

t - τ2|actual

τ1 - τH1*

Figure 6.1.3.a Failure time of H1 and H2 for reliability analysis in Case 1

H1 or the first spare case has been covered in the previous chapter, as 1-HSS, where τH1*

 H1 became was determined from Eq. (5.2) as  H *   1 . On the other hand, in regards, to the 1 H1* second spare H2, it is guaranteed that H2 does not fail during (0, τ2]. After H1* fails, H2

51

takes over the workload and becomes H2*. Since the component actually behaves as H2 during (0, τ2], the unreliability of H2* at time τ2 equals the unreliability of H2 at τ2 rather than the unreliability calculated by the integration of the pdf of H2* from 0 to τ1. This requires us to calculate a new starting integration time τH2* for H2* such that the unreliability of H2* at τH2* is equal to the unreliability of H2 at τ2. As the pdfs of H2 and

H 2 H 2* H2* are f (3)  H 2e and f ( 3)  H 2*e , respectively, such a relationship between

H2 and H2* can be described as in Eq. (6.4), taking into account the adjustment of τ2 the time to failure of H1*.

  (  ) H2* 2 1 H1*      e H2* 2 d   e H2 2 d  H 2 * 2  H 2 2 (6.4) 0 0

 H2 Solving Eq. (6.2), we have H *   ( 2  (1  H *)) . Since H2* fails during a period of 2 H2* 1 time (t-τ2), the integration range for H2* now becomes [ , t  (  (  )  ]. Based H2* 2 1 H1* H2* on the above analysis, the probability of P fails before H1 and H1 before H2 fails, denoted as P≺H1≺H2 can be calculated as in Eq. (6.5) [37].

λ H (tτ  λ H τ ) (t[τ τ τ ]( 2 [τ τ τ ])) 1 1 2 1 H1* λ 2 1 H1* t λH * H2 * λ pτ1 λH1* τ2 λH2* τ3 Pr (T  t)     (λ pe )(λ H1* e )(λ H2* e )dτ3dτ2 dτ1 (6.5) PH1H2 0 λ H1 λ H 2 τ [τ2 τ1 τH ] λ 1 λ 1* H1* H2 *

To simplify the integration range in Eq. (6.3), we can substitute w (τ2) = τ2 + τ1 -τH1* for

τ2 variable, where it simplifies to Eq. (6.6).

52

twτ t t H2* λ pτ1 λH1* [wτ1 τH 1* ] λH2*τ3 Pr (T  t)     (λ pe )(λ H1*e )(λ H2*e )dτ3dwdτ1 (6.6) PH 1H 2 0 τ λ 1 H 2 [w] λ H2 *

Applying a second substitution in Eq. (6.6), z (τ3) = τ3 + w - τH2* for τ3 variable, where dz/dτ3 =1 and dz=dτ3 will yield to Eq. (6.7).

t t t p1 H1* [ w1 H 1* ] H2* [ zw H 2* ] Pr (T  t )    ( pe )( H1*e )( H2*e )dzdwd 1 (6.5) PH 1H 2 0 1  2

Finally, w and z can be replaced back by variables τ2 and τ3 respectively resulting in the final simplified Eq. (6.8).

t t t p1 H1* [ 2 1 H 1* ] H2* [3 2  H 2* ] Pr (T  t )    ( pe )( H1*e )( H2*e )d 3d 2d 1 (6.8) PH 1H 2 0 1  2

Eq. (6.6) is the simplest final form that is the result of the formal construction of the probability function based on the formal qualitative analysis. Now we can claim that we can construct the final form directly, by formally stating that we need to shift to the right the probability density function of each spare Hn* by (τn- τHn*) and integrate from τn to t.

It is important to generalize case 1 so its reliability anlysis covers any number of used spares depending on the desired reliability requirement for a specific system. Figure 6.1.3.b shows broader general view of Figure 6.1.3.a that helps understanding the reliability analysis for case 1 with n number of spares with the respective form P≺H1≺H2≺…≺ Hn.

53

τ2 - τH2*

0 τH1* τ1 τH2* τ2|shifted τ2|actual τH3* τ3|shifted τ3|actual τHn* τn|shifted τn|actual

τ1 - τH1* τn - τHn*

Figure 6.1.3.b A general view for Figure 6.1.3.b –Case 1 Reliability analysis

First we start by Eq. (6.8.a) that leads to generating τHn* taking into consideration the

   (  ) shifting that will produce the actual time to failure n|actual n (n1) H(n1) * . We use also the general form of the failure rate h(t) as our approach can be applied to any probability distribution.

h ( ) H1 2  H *  h ( ) 1 1 H1* 2 h ( ) h ( ) H2 3 H2 3  H *  h ( ) ( 2  (1  H * )  h ( ) ( 2 ) 2 H2* 3 1 H2* 3 |actual h ( ) h ( ) H3 4 H3 4  H *  h ( ) [ 3  ( 2|actual  H * )]  h ( ) ( 3 ) 3 H3* 4 2 H3* 4 |actual h ( ) h ( ) (6.8.a) H4 5 H4 5  H *  h ( ) [ 4  ( 3|actual  H * )]  h ( ) ( 4 ) 4 H4* 5 3 H4* 5 |actual  h ( ) h ( ) Hn ( n1) Hn ( n1)  H *  h ( ) [ n  ( (n1)|actual  H * )]  h ( ) ( n ) n Hn* ( n1) ( n1) Hn* ( n1) |actual

Now that we have defined τn|actual and τHn*, the probability of P fails before H1, H1 before

H2 fails, H2 before H3 … and Hn-1 before Hn denoted as P≺H1≺H2 ≺… Hn-1≺ Hn can be calculated similar to Eq. (6.5) as in Eq. (6.8.b).

(t(  ) (t(  )) (t(  )) t 1 H1* 2|actual H2 * n|actual Hn * n (6.8.b) (T  t)  ... f ( ). f ( )d ...d d Pr     p 1  Hk (k1) n 2 1 PH1H2 ...Hn 0    k1 H1* H2 * Hn *

We need to keep in mind that if we used n spares in the rejuvenation design, there is

“n!” number of cases that the system may fail. We just have seen how to get the reliability

54

analysis of a unique case P≺H1≺H2 ≺… Hn-1≺ Hn shown in Eq. (6.8.b). In the future work, a research paper will shed the light on how to resolve other cases.

Case 2: Event e2 where P≺H2≺H1, this is where P fails first then H2 fails as a spare before H1* fails. The failure of H2 is independent of H1*, and the failure of H1* dependent on P failure but not on H2 failure. The integration of H1* requires getting τH1* which is based on τ1 and we also need to move the integration from τH1* to [τH1* + (τ3 -τ1)] resulting in Eq. (6.9) [37].

t(τ τ ) t t 1 H1* λ pτ1 λH1*τ2 λH2τ3 Pr (T  t)     (λ pe )(λ H1*e )(λ H2 e )dτ2dτ3dτ1 (6.9) PH 2H 1 0 τ λ 1 H1 τ (τ τ ) λ 1 3 1 H1*

Case 3: Event e3 where the components fail in the sequence of H2, P, and H1, denoted as H2  P H1. Note that this case is a simple one similar to Eq. (4). The probability that the SSP gate fails can be calculated as in Eq. (6.10).

t(τ τ ) t t 1 H 1* λ τ λ τ λ τ (T  t)  (λ e p 1 )(λ e H1* 2 )(λ e H2 3 )dτ dτ dτ Pr    p H1* H2 2 1 3 (6.10) H 2PH 1 0 τ λ 3 H1 τ λ 1 H1*

Case 4: Event e4 where the components fail in the sequence of H1, H2, and P, denoted as H1 H2 P. In this case, it is impossible for P to fail during (0, τ3]. The probability that the SSP gate fails during (0, t] can be calculated as in Eq. (6.11).

t t t λ p τ1 λ H1τ2 λ H2τ3 Pr (T  t)    (λ pe )(λH1e )(λH2 e )dτ2dτ3dτ1 (6.11) H 1H 2P 0 τ2 τ3

55

Case 5: Event e5 where H1≺P≺H2 , similar to Case 3, this is where H1 fails first as a spare, then P then H2 as H2*. Note that the complexity is similar to one spare HSP gate

P≺H. The probability that the HSP gate fails is calculated as in Eq. (6.12).

t(τ τ ) t t 2 H2* λ pτ1 λH1τ2 λH2*τ3 Pr (T  t)     (λ pe )(λH1e )(λH2*e )dτ3dτ1dτ2 (6.12) H 1PH 2 0 τ 2 λH2 τ2 λH 2*

Case 6: Event e6 where H2≺H1 ≺P, similar to case 4, H2 and H1 fail as spares before P fails, similar to one spare HSP gate it is guaranteed that P does not fail during (0,

τ2].The probability that the HSP gate fails during (0, t] can be calculated as in Eq. (6.13).

t t t λ p τ1 λ H1τ 2 λ H2τ3 Pr (T  t)    (λ pe )(λH1e )(λH2 e )dτ3dτ2dτ1 (6.13) H 1H 2P 0 τ3 τ 2

Hence, the two spares HSP gate unreliability is given in Eq.(6.14).

U (t) = Eq. (6.8) + Eq. (6.9) + Eq. (6.10) + Eq. (6.11) + Eq. (6.12) + Eq. (6.13) (6.14)

6.1.2 Verification Using CTMC

In this case, we will also solve the CTMC model state equations. We propose random constant failure rates such as λP = 0.005, λH1 = λH2 =0.003 and λH1* = λ H2* =0.005 with time variable t=1000. We will prove that CTMC will give the same result as our proposed approach.

56

Figure 6.1.4 The CTMC model of the HSP gate with two hot spares

Following the same procedure in the previous verification in chapter 5, we start by computing the state

transition diagram that describes the state transition process completely.

(6.15)

Based on the matrix defined in Eq. (6.15), we can derive the following equations as shown in section 5.1.2

resulting in Eqs. (6.16.1-6.16.7).

P1(t+dt)= (1 (λP  λH1  λH 2 )dt).P1(t) (6.16.1)

P2(t+dt)= λPdt.P1(t) (1-( λH1*  λH2 )dt).P2 (t) (6.16.2)

P3(t+dt)= λH2 dt.P1(t) (1-( λP  λH1 )dt).P3 (t) (6.16.3)

57

P4(t+dt)= λH1 dt.P1(t) (1-( λP  λH2 )dt).P4 (t) (6.16.4)

P5(t+dt)= λH2 dt.P2 (t)  λPdt.P3 (t) (1- λH1*dt).P5 (t) (6.16.5)

P6 (t+dt)= λH1*dt.P2 (t)  λPdt.P4 (t) (1- λH2*dt).P6 (t) (6.16.6)

P7 (t+dt)= λH1 dt.P3 (t)  λH2 dt.P4 (t) (1- λPdt).P7 (t) (6.16.7)

Then we rewrite Eqs. (6.16.1-6.16.7) in the form of CTMC state Eqs. (6.17.1-6.17.7) as follows:

P(t  dt)  P(t) 1 1  P'(t)  (     )P(t) (6.17.1) dt 1 P H1 H 2 1

P (t  dt)  P (t) 2 2  P '(t)   P(t)  (   )P (t) (6.17.2) dt 2 P 1 H1* H2 2

P (t  dt)  P (t) 3 3  P '(t)   P(t)  (   )P (t) (6.17.3) dt 3 H 2 1 P H1 3

P (t  dt)  P (t) 4 4  P '(t)   P(t)  (   )P (t) (6.17.4) dt 4 H1 1 P H2 4

P (t  dt)  P (t) 5 5  P '(t)   P (t)   P (t)   P (t) (6.17.5) dt 5 H 2 2 P 3 H1* 5

P (t  dt)  P (t) 6 6  P '(t)   P (t)   P (t)   .P (t) (6.17.6) dt 6 H1* 2 P 4 H2* 6

P (t  dt)  P (t) 7 7  P '(t)   P (t)   P (t)   P (t) (6.17.7) dt 7 H1 3 H 2 4 P 7

58

State equations (6.17.1-6.17.7) can be solved using Laplace transform as derived in the previous section 5.1.2. By applying the Laplace transform from Eqs. (5.12.1 – 5.12.2) to both sides of Eqs. (6.17.1-6.17.7), we can derive Eqs. (6.18.1-6.18.7).

sP1(s)  P1(0)  (P  H1  H 2 )P1(s) (6.18.1)

sP2 (s)  P2 (0)  P P1(s)  (H1*  H2 )P2 (s) (6.18.2)

sP3 (s)  P3 (0)  H 2P1(s) (P  H1 )P3(s) (6.18.3)

sP4 (s)  P4 (0)  H1P1(s) (P  H2 )P4 (s) (6.18.4)

sP5 (s)  P5 (0)  H 2P2 (s)  PP3(s) (H1*)P5 (s) (6.18.5)

sP6 (s)  P6 (0)  (H1*)P2 (s)  PP4 (s) (H2*)P6 (s) (6.18.6)

sP7 (s)  P7 (0)  H1P3(s)  H2 P4 (s)  PP7 (s) (6.18.7)

Substituting the initial probabilities Pi(0), where 1 ≤ i ≤ 7, into Eqs. (6.18.1-6.18.7), we can solve F1(s), F2(s), F3(s), F4(s), F5(s), F6(s) and F7(s), by applying inverse Laplace transformation, we can solve the original linear first order differential equations in Eqs.

(6.18.1- 6.18.7).

1 (   )t P H1 H 2 P1(s)   P1(t)  e (s  P  H1  H 2 )

P  P1(s) P P2 (s)   (s  H1*  H 2 ) (s  P  H1  H 2 )(s  H1*  H 2 )

P (H 1* H 2 )t (P H 1 H 2 )t  P2 (t)  (e  e ) P  H1  H1*

59

  P(s)  P (s)  H 2 1  H 2 3 (s  P  H1) (s  P  H1  H 2 )(s  P  H1)

(P H1 )t (P H 1 H 2 )t  P3 (t)  e  e

  P (s)  P (s)  H1 4  H1 4 (s  P  H 2 ) (s  P  H1  H 2 )(s  P  H 2 )

(P H 2 )t (P H 1 H 2 )t  P4 (t)  e  e

H 2  P2 (s) P  P3 (s) P.H 2 H 2.P P5 (s)     (s  H1* ) (s  H1* ) (s  P  H1  H 2 )(s  H1*  H 2 )(s  H1* ) (s  P  H1  H 2 )(s  P  H1)(s  H1* )

P (H 1* )t  P (H 1*H 2 )t (P H 1)t P (P  H1  H 2  H1* ) (P H 1H 2 )t  P5 (t)  [ (e ][ (e  e )][ e ] P  H1  H1* P  H1  H1* (P  H1  H 2  H1*)(P  H1  H1* )

H1  P3 (s) H 2  P4 (s) H1 H 2 H1 H 2 P7 (s)     (s  P ) (s  P ) (s  P  H1  H 2 )(s  P  H1)(s  P ) (s  P  H1  H 2 )(s  P  H 2 )(s  P )

(P )t (P H 1)t (P H 2 )t (P H 1H 2 )t  P7 (t)  e e  e  e

H1*  P2 (s) P  P4 (s) P H1* H1 P P6 (s)     (s  H 2* ) (s  H 2* ) (s  P  H1  H 2 )(s  H1*  H 2 )(s  H 2* ) (s  P  H1  H 2 )(s  P  H 2 )(s  H 2* ) (  ) (H 2* )t (  ) (H 1* H 2 )t (  ) (P H 1 H 2 )t P H1* e P H1* e P H1* e  P6 (t)  [   ] (P  H1  H 2  H 2* )(H1*  H 2  H 2* ) (P  H1  H1* )(H1*  H 2  H 2* ) (P  H1  H 2  H 2* )(P  H1  H1* ) (  ) (H 2* )t (  ) ( p H 2 )t (  ) (P H 1 H 2 )t [ P H1 e  P H1 e  P H1 e ] (P  H1  H 2  H 2* )(P  H 2  H 2* ) (P  H 2  H 2* )(H1 ) (P  H1  H 2  H 2* )(H1 )

(  )(     )  (  )(     ) ( )t   (  )t  [ P H1* P H 2 H 2* P H1 H 2 H1* H 2* e H 2* ][ P H1* (e H 2 H 1* ] (P  H1  H 2  H 2* )(H 2  H1*  H 2* )(P  H 2  H 2* ) (P  H1  H1* )(H 2*  H 2  H1* )

 (  )t  (   ) (   )t [ P (e P H 2 ][ P P H1 e P H 1 H 2 ]  P  H 2  H 2* (P  H1  H 2  H 2* )(P  H1  H1* )

The reliability function R(t) is the summation of the system probability being in a non- failure state, namely P1(t), P2(t), P3(t), P4(t), P5(t), P6(t) and P7(t), which can be calculated as in Eq. (6.19),

60

R(t)  P1(t) P2(t)  P3(t)  P4(t)  P5(t)  P6(t)  P7 (t) (6.19)

Picking time value t=1000 leads to U(1000) = 1- R(1000) ≃ 0.969 in Eq. (6.19).

On the other hand, the unreliability function in Eq. (6.14) U(1000) ≃ 0.969, verifying the proposed approach using CTMC for constant failure rates.

We also compute the system reliabilities using the reliability function R(t) from both the proposed approach and the CTMC approach presented in Eq. (6.14) and Eq. (6.19), respectively, in Table 6.1. The results show they are perfectly matched.

Table 6.1. R(t) Analysis Results - Proposed Method vs. CTMC

Time (days) R(t) - proposed method R(t) - CTMC 90 0.9815 0.9815 180 0.9019 0.9019 300 0.7299 0.7299 1000 0.031 0.031

6.2 Case Study 2: Constant Failure Rate with Two Hot Software Spares

In this case study, we show how to model and analyze the reliability of a cloud-based system with two HSSs for each critical component using extended DFT, and then estimate rejuvenation schedules based on reliability quantitative analysis generated by the proposed approach in Section 6.1.1. Fig. 6.2.1 shows a cloud-based system that consists of an application server PA and a database server PB. To enhance the system reliability, four hot spare components HA1, HA2 are set up for PA, and HB1 and HB2 are set up for PB. The four

HSSs are ready to take over the workload if the primary ones fail. The case study shows the reliability analysis applicable to SSP gate with two HSSs for each primary component.

We set the reliability threshold to 0.99 as a minimum requirement for system reliability.

61

For this case study, we assume constant failure rates for the servers, where λPA = 0.004/day,

λHA1= λHA2 = 0.0025/day, λPB = 0.005/day, λHB1 = λHB2 = 0.003/day, using the same failure rates as in Chapter 5, so that the results can be readily compared.

Web Server/ Load Balancer

monitor monitor

r monitor e n y o i a t L a PA HA HA

1 2 c r i l e Replace p v r p

e Replace A S

r e v r e monitor monitor monitor S

r e e s y a a

b PB HB1 HB2 L a t Replace Replace a D

CSA CSB

Figure 6.2.1: Case Study2 - A cloud-based system with software spares

As stated earlier, the failure rates of the HSS servers are lower than those of their

corresponding primary ones because HSSs are not subject to the same workload; thus

they have no software aging issues, and less likely to fail. Yet, when a primary server

fails, the failure rate of a substituting HSS increases since it assumes the primary

62

component workload, i.e., λPA = λHA1* = λHA2* = 0.004, and λPB = λHB1* = λHB2* = 0.005.

S1 S2

SSP SSP

P HA1 HA2 P HB1 HB2 a) Phase 1

S3 S4

S1 S1' S2 S2'

SSP SSP SSP SSP

P HA1 HA2 P’ HA1' HA2' P HB1 HB2 P’ HB1' HB2' b) Phase 2 – Scenario 1 Figure 6.2.2. DFT model of the cloud-based system - with a) Phase 1 and b) Phase 2 Scenario1

The case study also involves CSS components, namely CSA and CSB, which are used in the rejuvenation process. Note that a CSS is a stored image of a deployed VM instance that can be easily duplicated, thus only one CSS is needed for each of the primary and HSS components. In addition, since a CSS is stored as an image, its failure rate is considered to be 0. However, once a CSS component is duplicated and deployed, it will assume the failure rate of its corresponding role, either as a running primary component or as an HSS.

The DFT model of the cloud-based software system for Phase 1 is shown in Figure 6.2.2 a).

Same reliability analysis from case study 1 applies for case study 2. Because the system fails when either the application server or the database server fails, the two HSP gates are

63

connected by an OR-gate. The reliability function of the OR-gate can be derived as in case study1 Eq. (5.16).

R(t) 1UOR (t) 1 (US1(t)  (1US1(t))*US2 (t)) (5.16)

US1(t) and US2(t) are the unreliability functions of the subtrees S1 and S2, respectively.

According to Eq. (6.14) that was detailed in the previous section, US1(t) and US2(t) can be calculated in a similar way to that equation.

U (t) = Eq. (6.6) + Eq. (6.7) + Eq. (6.8) + Eq. (6.9) + Eq. (6.10) + Eq. (6.11) (6.12)

In Phase 2, we also consider both of the scenarios mentioned at the end of Chapter 4 for comparison purposed between case study 1 and case study 2. Figure 6.2.2 b) shows the

DFT model of the cloud-based system in Phase 2 based on Scenario 1. Same reasoning applies as in Phase 1, the system reliability can be calculated as case study1 Eq. (5.19).

According to Eq. (5.15), US3(t) and US4(t) can be calculated as in Eq. (5.20) and Eq. (5.21), similar to case study1, respectively.

R(t) 1UOR (t) 1 (US3(t)  (1US3(t))*US4 (t)) (5.19)

US3(t) US1(t)*US1' (t) (5.20)

US4 (t) US2 (t)*US2' (t) (5.21)

The reliability analysis results for Scenario 1 are listed in Table 6.2. The table shows that the reliability threshold (0.99) is reached every 48 days based on the reliability analysis results in Phase 1. Hence, both application and database servers are rejuvenated at the end of Phase 1.

Phase 2 has a 30-minute time duration for all case studies; therefore, the system reliability is calculated at 5, 10, 20 and 30 minutes in Phase 2 to illustrate how system

64

reliability may change during the rejuvenation process. From the table, we can see that the system reliability is kept very high during the transition. After the 30 minutes, the newly deployed servers completely take over the system, and the servers to be rejuvenated are shut down. When this happens, the system returns to its initial state and starts a new life cycle with very high initial reliabilities. Therefore, Table 6.2 suggests that the system should be rejuvenated every 48 days in order to maintain the system reliability above the predefined threshold.

By further looking into Table 6.2, we can see that when the system reliability reaches

0.99 after 48 days, the reliability of the database server subsystem is lower than that of the application server subsystem. This suggests that we may rejuvenate the most critical components (i.e., the component or a subsystem with the lowest reliability) first. In this case study, we choose to rejuvenate the database servers first. Then we wait until the system reliability reaches the threshold again, and rejuvenate the application servers, which now become the most critical components. This is exactly what happens for rejuvenation scheduling in Scenario 2, where the application servers and the database servers are rejuvenated alternatively. Figure 6.2.3 shows the DFT model of the cloud-based system in

Phase 2 for both two cases in Scenario 2. In particular, part a) shows the case where the application servers are rejuvenated. In this case, similar to case study1, the system reliability can be calculated as in Eq. (6.18), and US3(t) and US2(t) can be calculated in a similar way as in Eq. (5.16) and Eq. (6.14), respectively.

R(t)  1UOR (t)  1 (U S3 (t)  (1U S3 (t)) *U S2 (t)) (5.16)

Scenario2 system reliability for application server rejuvenation can be calculated in a similar way.

65

Table 6.2. Case Study 2- System Reliability with Software Rejuvenation (Scenario 1)

App. Servers Reliability Db. Servers Reliability Phase Time (Days) System Reliability (PA/ HA1/HA2) (PB/ HB1/HB2)

0 1 1 1 1 0.99999996119 0.99999992711 0.9999998883 5 0.999995242 0.999991104 0.99998634604 1 10 0.99996249 0.99993093 0.99989342259 20 0.99976169 0.9994793 0.999241114088 30 0.9990894 0.998344 0.99743490795 48 0.996578 0.993896 0.990494887888 48.003472 0.999999999999998 0.999999999999997 0.999999999999995 48.006944 0.999999999999987 0.999999999999975 0.999999999999962 2 48.01389 0.999999999999896 0.999999999999803 0.999999999999699 48.020833 0.999999999999647 0.999999999999337 0.999999999998984 49 0.99999996119 0.99999992711 0.9999998883 53 0.999995242 0.999991104 0.99998634604 58 0.99996249 0.99993093 0.99989342259 1 68 0.99976169 0.9994793 0.999241114088 78 0.9990894 0.998344 0.99743490795 96 0.996578 0.993896 0.990494887888 96.003472 0.999999999999998 0.999999999999997 0.999999999999995 96.006944 0.999999999999987 0.999999999999975 0.999999999999962 2 96.01389 0.999999999999896 0.999999999999803 0.999999999999699 96.020833 0.999999999999647 0.999999999999337 0.999999999998984 97 0.99999996119 0.99999992711 0.9999998883 101 0.999995242 0.999991104 0.99998634604 106 0.99996249 0.99993093 0.99989342259 1 116 0.99976169 0.9994793 0.999241114088 126 0.9990894 0.998344 0.99743490795 144 0.996578 0.993896 0.990494887888

66

S3 S2

SSP S1 S1'

P HB1 HB2 SSP SSP

a) App. Servers Rejuvenation

P HA1 HA2 P’ HA1' HA2'

S1 S4

SSP S2 S2'

P HA1 HA2 SSP SSP

b) Db. Servers Rejuvenation

P HB1 HB HB ' HB ' 2 P’ 1 2

Figure 6.2.3 DFT model of the cloud-based system in Case study1-Phase 2 (Scenario 2 both cases)

Table 6.3 shows the reliability analysis results for Scenario 2. At the end of each Phase

1, the server subsystem with its reliability marked by “=>” is the one to be rejuvenated. For example, after 48 days, the database servers are rejuvenated, and after 69 days, the application servers are rejuvenated.

Table 6.3. Case Study 2- System Reliability with Software Rejuvenation (Scenario 2)

App. Servers Reliability Db. Servers Reliability Phase Time (Days) System Reliability (PA/ HA1/HA2) (PB/ HB1/HB2) 0 1 1 1 1 0.99999996119 0.99999992711 0.9999998883 1 5 0.999995242 0.999991104 0.99998634604 10 0.99996249 0.99993093 0.99989342259 20 0.99976169 0.9994793 0.999241114088

67

30 0.9990894 0.998344 0.99743490795 48 0.996578 => 0.993896 0.990494887888 48.003472 0.996577 0.999999999999999981685 0.9965779 48.006944 0.996576 0.999999999999999847325 0.9965759 2 48.01389 0.996575 0.999999999999998796527 0.9965749 48.020833 0.996574 0.999999999999995948407 0.9965739 50 0.996169 0.9999994203 0.9961686 55 0.995021 0.99997588 0.994997 1 60 0.993687 0.9998821 0.993569 65 0.992262 0.9996745 0.991939 69 => 0.990799 0.9994008 0.99020531 69.003472 0.999999999999999984982 0.9994005 0.99940049 69.006944 0.999999999999999879809 0.9994003 0.99940029 2 69.01389 0.999999999999999037973 0.9993997 0.99939969 69.020833 0.999999999999996753259 0.9993991 0.999399 70 0.99999996119 0.9993152 0.99931516 75 0.999991819 0.998771 0.9987638 80 0.99995079 0.998013 0.9979638 1 85 0.9998522 0.997018 0.9968706 90 0.9996738 0.995765 0.99544018 100 0.999 0.9924 0.9914076 102 0.998805 => 0.991608 0.990513 102.003472 0.998804 0.9999999999999999745 0.9988039 102.006944 0.998802 0.9999999999999997901 0.9988019 2 102.01389 0.9988 0.9999999999999983457 0.998799 102.020833 0.998798 0.9999999999999944315 0.9987979 105 0.998471 0.999998055 0.998470805797 110 0.997795 0.99996421 0.997759288917 1 120 0.995954 0.9996159 0.995571454069 134 => 0.992162 0.998013 0.990190570000

The rejuvenation schedule for both Scenario 1 and Scenario 2 is illustrated in Figure

6.2.4. In the figure, the rejuvenation initiation is indicated by the sudden increment of the system reliability. By comparing the two rejuvenation schedules, we can see that during

119 days, Scenario 1 has 2 rejuvenation processes which require rejuvenating both of the application and database servers. On the other hand, Scenario 2 has 3 rejuvenation processes which only require rejuvenating either the application servers or the database servers.

68

Figure 6.2.4 Case study 2- Rejuvenation scheduling (Scenario 1 vs. Scenario 2)

It is easy to see that Scenario 2 requires less management of the servers in order to keep the system reliability above the 0.99 threshold all the time. Suppose the rejuvenation of the application servers has the same cost as the that of the database servers, by using the rejuvenation scheduling defined in Scenario 2, the cost can be reduced by (2*2-

3)/(2*2)=25% comparing to the rejuvenation scheduling defined in Scenario 1.

6.3 Case Studies 1 and 2 Comparison

In this section, the results from Case study 1 and Case Study 2 are arranged specifically to visualize the difference and the impact of employing 2-HSSs vs. 1-HSS in terms of reliability and rejuvenation scheduling in a cloud-based system.

Table 6.4 shows the reliability analysis results for the application server subsystem in both of the 1-HSS and 2-HSSs cases. It is easy to see that the 2-HSSs case is more reliable than the 1-HSS case since the system design employs two HSSs for each primary one, and thus it is more fault-tolerant.

69

Table 6.4. Application Sever Reliability with both 1-HSS and 2-HSSs

Time (Days) 1-HSS App. Server R(t) 2-HSS App. Server R(t) 0 1 1 1 0.99998705 0.99999996119 5 0.9996806 0.999995242 10 0.998745 0.99996249 20 0.99515 0.99976169 30 0.98945 0.9990894 60 0.96194 0.993687 90 0.92259 0.9815 120 0.8754 0.96185 180 0.769 0.90192 240 0.6593 0.8214 300 0.5555 0.7299 365 0.4546 0.6276

In a similar way to Table 6.4, Table 6.5 shows the reliability analysis results for the database server sub-system for both 1-HSS and 2-HSSs cases.

Table 6.5. Database Sever Reliability with both 1-HSS and 2-HSSs

Time (Days) 1-HSS Db. Server R(t) 2-HSS Db. Server R(t) 0 1 1 1 0.9999801 0.99999992711 5 0.9995107 0.999991104 10 0.998085 0.99993093 20 0.99266 0.9994793 30 0.98417 0.998344 60 0.94421 0.98888 90 0.8891 0.96842 120 0.8253 0.93681 180 0.6893 0.8466 240 0.5588 0.7352 300 0.4438 0.6184 365 0.34 0.4987

Similarly, Table 6.5 shows the reliability analysis results for the database server subsystem in both of the 1-HSS and 2-HSSs cases. Again, the 2-HSSs case is more reliable than the 1-HSS case since the system design employs two HSSs for each primary one, and thus it is more fault-tolerant.

70

Figure 6 illustrates illustrates the details of the difference between the two cases based on Scenario 1. Note that 1-HSS results are formerly provided in Chapter 5 Section 5.2.

From the figure, we can see that the system reliability is kept very high during the transition. According to Fig. 9, the reliability threshold for 2-HSSs is reached at 48 days, hence it is suggested that the system should be rejuvenated every 48 days under Scenario

1. On the other hand, we can also see that the system needs to be rejuvenated every 18 days with a single HSS usage. Comparing rejuvenation scheduling based on reliability analysis for both cases over the period of 120 days, we notice that the system with 2-HSSs only needs two rejuvenations (at 48 and 96 days), but it requires six rejuvenations with a single

HSS for its critical component. Therefore, Scenario 1 with 2-HSSs results in (6*2–

6*2)/(6*2) = 66% reduction in cost and management for software rejuvenation, while keeping the system above the same reliability threshold (0.99).

Figure. 6.3.1 Rejuvenation scheduling: 2-HSSs vs. 1-HSS (Scenario1)

Figure 6.3.2 shows Scenario 2 for component-specific software rejuvenation. According to the figure, when the system reliability reaches the threshold in 48 days, the components with the lowest reliability, i.e., the database servers, are scheduled for rejuvenation first.

71

Figure. 6.3.1 Rejuvenation scheduling: 2-HSSs vs. 1-HSS (Scenario1)

The rejuvenation induces a partial spike in the reliability curve, and then the system reliability is continuously monitored until it reaches the threshold again at the 69th day. At this point, the application server components become the ones with the lowest reliability.

As a result, there will be an alternation in rejuvenation process for the two subsystems. We can see three rejuvenations for Scenario 2 with 2-HSSs vs. nine rejuvenations for 1-HSS design. Therefore, Scenario 2 with 2-HSSs results in (9–3)/(9) = 66% reduction in cost and management for software rejuvenation, while keeping the system above the same reliability threshold (0.99).

72

Chapter 7

MODEL WITH NON-CONSTANT FAILURE RATE AND

ONE HOT SPARE

7.1 Non-Constant Failure Rate and Common Distribution Functions

Exponential distribution is not the only common distribution function that has applications in reliability engineering. Many distributions have some interesting hazard/failure models that can be applied to diverse systems, including software systems. For example, Normal distribution primarily applies to measurements of product vulnerability and external stress.

It is a two-parameter distribution that describes systems in which failures are due to wear- out effect for mechanical systems. In addition, Log Normal distribution is a flexible model that can empirically fit many types of failure data, in both repairable and non-reparable systems. Gamma distribution is used as a failure probability function for components whose distribution is skewed and unpredictable. It is similar to Weibull distribution and can be related in special cases to exponential and Weibull distributions. Pareto distribution is a mix of exponential distribution and gamma distribution. It was originally developed to model income in a population, due to its long tail characteristic. Pareto can be used along normal distribution in modeling population size and economical incomes. Rayleigh distribution is a flexible lifetime distribution that can be applied to many degradation failure modes. Two-parameter and three-parameter hazard rate functions distribution can have either increasing or decreasing hazard/failure rates. Finally Weibull distribution, where exponential distribution is a special case, can have also increasing and decreasing

73

failure rate based on the pick of the shape and scale parameters [4]. In this thesis, we choose to work and investigate Weibull distribution since it a close cousin to exponential. At the same time, it allows us to test the proposed analytical approach for its correctness with non- constant failure rate, using exponential distribution as a comparison benchmark or reference for the collected results in case studies 3 and 4 presented in the next section of this chapter.

7.2 Modeling and Analysis

In this section, the modeling and qualitative analysis is the same as Chapter 5 - Section

5.1.1, which applies to the extended DFT Software spare gate SSP gate with 1-HSS. The difference will be in the quantitative analysis as this case study will consider Weibull distribution for time-to-failure probability density function (pdf) for software components.

Following the same model construct defined in Chapter 5, a SSP gate with one primary component P and 1-HSS component H is illustrated in Figure 7.1. A SSP gate fails when

P and all other alternate spares (the only spare part in Fig. 1 is H) fail. When P fails, H takes over P’s workload, and then behaves as H* with λH* ≥ λH. This is due to the software- aging phenomenon when an HSS takes a full workload after it replaces the primary one.

Based on Figure 7.1, we now consider two disjoint cases, events that lead to the failure of the SSP gate, which are P fails before H (called path event e1) and H fails before P (called path event e2).

74

SSP

P H

Figure. 7.1. An SSP gate with a primary component and 1-HSS

Case 1: Event e1 where P fails before H fails denoted as P  H. Let τ1 and τ2 be the failure times of P and H, respectively. In this case, it is impossible for H to fail during (0, τ1].

Hence the probability of P failing before H fails, i.e., Pr(e1), can be calculated using double integrations as in Eq. (7.1) vs. Chapter 5 Eq. (5.1) that is based on exponential distribution.

(t τ τ ) t 1 H 1* Pr(e1)    fP (τ1). fH *(τ2 )dτ2dτ1 0 τ H 1* (t τ τ ) (7.1) t 1 H 1* p p 2 p p ( p1) ( p1) (λ p τ1 ) (λ H * τ 2 )    p λP λH* (τ1 τ2 )(λPe )(λH*e )dτ2dτ1 0 τ H1*

p with τH1* = [(hH(τ2)) / (hH*(τ2))]τ1 = [(λ H/λ H*) ]τ1 [38].

Case 2: Event e2 where H fails before P fails denoted as H  P . In this case, it is impossible for P to fail during (0, τ2], where τ2 is the failure time of H. Hence the probability of H failing before P fails, i.e., Pr(e2), can be calculated as in Eq. (7.2).

t t p (  ) p Pr(p )  p2 p p ( ( p1)  ( p1) )( e(H 2 ) )( e p 1 )d d 2   P H 1 2 H P 1 2 0  2 (7.2)

The reliability function R(t) for a SSP gate with 1-HSS is given in a general form as R(t)

= 1- U(t), where U(t) is given as in Eq. (7.3). Refer to the detailed derivation of Eq. (7.3) in Chapter 5.

75

U (t)= Pr (T  t)  Pr (T  t) PH HP (7.3)

7.3 Case Study 3: Non-Constant Failure Rate with One Hot Spare

The usage of Weibull distribution in this case study serves multiple purposes. First of all, the used distribution is a general case of exponential distribution, and hence it allows comparison, if possible, with the latter two case study in chapters 5. That is the main reason why Weibull probability density function pdf is selected in the following form where

p f (t)  p p t( p1) e(t) p and λ are the shape and scale parameters respectively. If p=1, then the pdf is exponentially distributed like the latter case studies. Moreover, Given that the reliability function of

p Weibull distribution is R(t)e(t) , and the failure rate h(t) = f(t) / R(t), hence the failure rate becomes h(t)  p p t( p1) which is non-constant and a function of time. As a result, this case study allows us to prove the validity of the proposed approach and its applicability for non-constant failure rates where CTMC cannot generate reliability function with non- constant failure rates. This case study has exactly the same setup and configuration as case study 1, where it consists of an application server PA and a database server PB. To enhance the system reliability, two hot spare components HA and HB are set up for PA and PB, respectively, which are ready to take over the workload once the primary ones fail. The same assumptions made in for the latter case studies regarding availability zone, redundancy, fault tolerance, reliability threshold, software components and rejuvenation duration hold for this case study, as shown in Figure 5.2.1. The only difference from case study 1 is the failure rate function for software components. We will keep the same λ

76

parameters for each component λPA=0.004/day, λHA=0.0025/day, λPB=0.005/day, λHB =

0.003/day. However, the shape parameter is selected p > 1, this selection indicates a monotonically increasing failure rate function with respect to time, to simulate SA phenomenon effect on the software system. The shape parameter p for application servers is set to p = 1.2 and for database servers p = 1.1. The failure rates are stated as follows:

p ( p1) p ( p1) p ( p1) p ( p1) hPA(t)  pPAt , hHA (t)  pHA t , hPB(t)  pPBt , hHB (t)  pHB t with

hPA(t)  hHA* (t) and hPB (t)  hHB* (t) as HB* and HA* will be acting as primary components after the primary servers fail. The case study also involves CSS components, namely CSA and CSB, which are used in the rejuvenation process. We name the severs in the second group as PA’, HA’, PB’, and HB’. As the CSS components are undeployed VM images, their failure rates are 0. Once deployed, they will have the same failure rates as their corresponding software components due to the assumed same configurations.

Figure 7.2 shows the DFT model of the cloud-based system in Phase 1. Because the system fails when either the application server or the database server fails, the two HSP gates are connected by an OR-gate. The reliability function of the OR-gate can be derived as in Eq.

(5.16) shown as Eq. (7.3).

R(t) 1UOR (t) 1 (US1(t)  (1US1(t))*US2 (t)) (7.4)

where US1(t) and US2(t) are the unreliability functions of the subtrees S1 and S2, respectively. According to Eq. (7.3), US1(t) and US2(t) can be calculated at any given time value t.

77

S1 S2 S3 S2

HSS HSS HSS S1 S1'

HSS HSS PA HA PA’ HA’ PB HB a) Phase 1 c) Phase 2 - Scenario 2 App. Servers Rejuvenation PA HA PA’ HA’ S3 S4

S1 S1 S1' S2 S2' S4

HSS HSS HSS HSS HSS S2 S2'

HSS HSS PA HA PA’ HA’ PB HB PB’ HB’ PA HA

d) Phase 2 - Scenario 2 b) Phase 2 – Scenario 1 Db. Servers Rejuvenation PB HB PB’ HB’

Figure 7.2: A DFT model_1-HSS model with both Phases and both Scenarios (Chapter 5)

In Phase 2, we consider both of the scenarios mentioned in the end of Section 5.3, so their impacts on system reliability as well as their consequent rejuvenation schedules can be compared. From case study 1, Figure 7.2-a) shows the DFT model of the cloud-based system in Phase 2 based on Scenario 1. For the same reason as in Phase 1, the system reliability can be calculated as in Eq. (7.4). According to Eq. (5.16), US3(t) and US4(t) can be calculated as in Eq. (7.5) and Eq. (7.6), respectively.

R(t) 1UOR (t) 1 (US3(t)  (1US3(t))*US4 (t)) (7.4)

US3(t) US1(t)*US1' (t) (7.5)

US4 (t) US2 (t)*US2' (t) (7.6)

Note that in Eqs. (7.5-7.6), US1(t), US1’(t), US2(t) and US2’(t) can be calculated in a similar way as in Eq. (7.3).

78

The reliability analysis results for Scenario 1 are listed in Table 5. The table shows that the reliability threshold (0.99) is reached every 25 days based on the reliability analysis results in Phase 1. Hence, both application and database servers are rejuvenated at the end of Phase 1. Phase 2 has a 30-minute time duration; therefore, we calculate the system reliability at 5, 10, 20 and 30 minutes in Phase 2 to illustrate how system reliability may change during the rejuvenation process. From the table, we can see that the system reliability is kept very high during the transition. After the 30 minutes, the newly deployed servers completely take over the system, and the servers to be rejuvenated are shut down.

When this happens, the system returns to its initial state and starts a new life cycle with very high initial reliabilities. Therefore, Table 1 suggests that the system should be rejuvenated every 25 days in order to maintain the system reliability above the predefined threshold.

By further looking into Table 7.1, we can see that when the system reliability reaches

0.99 after 25 days, the reliability of the database server subsystem is lower than that of the application server subsystem. This suggests that we may rejuvenate the most critical components (i.e., the component or a subsystem with the lowest reliability) first. In this case study, we choose to rejuvenate the database servers first. Then we wait until the system reliability reaches the threshold again, and rejuvenate the application servers, which now become the most critical components. This is exactly what happens for rejuvenation scheduling in Scenario 2, where the application servers and the database servers are rejuvenated alternatively. Figure 7.1-c,d) shows the DFT model of the cloud-based system in Phase 2 for both two cases in Scenario 2. In particular, part d) shows the case where the database servers are rejuvenated. In this case, the system reliability can be calculated as in

79

Eq. (7.7), and US1(t) and US4(t) can be calculated in a similar way as in Eq. (5.6) and Eq.

(5.16), respectively.

R(t) 1U (t) 1 (U (t)  (1U (t))*U (t)) OR S1 S1 S4 (7.7)

Scenario 2 system reliability for application server rejuvenation can be calculated in a similar way.

Table 7.1. Case Study 3- System Reliability with Software Rejuvenation (Scenario 1)

App. Servers Reliability Db. Servers Reliability Phase Time (Days) System Reliability (P HS) (P HS)

0 1 1 1 1 0.999998673 0.999993340 0.999992013 5 0.999938086 0.999774952 0.999774952 1 10 0.999682812 0.998990000 0.998673 20 0.998435928 0.995651480 0.994094209 25 0.997424208 0.993113690 0.990555636 25.003472 0.999999999999998 0.999999999999997 0.999999999999995 25.006944 0.999999999999987 0.999999999999975 0.999999999999962 2 25.01389 0.999999999999896 0.999999999999803 0.999999999999699 25.020833 0.999999999999647 0.999999999999337 0.999999999998984 26 0.999998673 0.999993340 0.999992013 30 0.999938086 0.999774952 0.999774952 1 35 0.999682812 0.998990000 0.998673 45 0.998435928 0.995651480 0.994094209 50 0.997424208 0.993113690 0.990555636 50.003472 0.999999999999998 0.999999999999997 0.999999999999995 50.006944 0.999999999999987 0.999999999999975 0.999999999999962 2 50.01389 0.999999999999896 0.999999999999803 0.999999999999699 50.020833 0.999999999999647 0.999999999999337 0.999999999998984 51 0.999998673 0.999993340 0.999992013 55 0.999938086 0.999774952 0.999774952 1 60 0.999682812 0.998990000 0.998673 70 0.998435928 0.995651480 0.994094209 75 0.997424208 0.993113690 0.990555636 75.003472 0.999999999999998 0.999999999999997 0.999999999999995 75.006944 0.999999999999987 0.999999999999975 0.999999999999962 2 75.01389 0.999999999999896 0.999999999999803 0.999999999999699 75.020833 0.999999999999647 0.999999999999337 0.999999999998984 80 0.999938086 0.999774952 0.999774952 1 85 0.999682812 0.998990000 0.998673 95 0.998435928 0.995651480 0.994094209

80

100 0.997424208 0.993113690 0.990555636 100.003472 0.999999999999998 0.999999999999997 0.999999999999995 100.006944 0.999999999999987 0.999999999999975 0.999999999999962 2 100.01389 0.999999999999896 0.999999999999803 0.999999999999699 100.020833 0.999999999999647 0.999999999999337 0.999999999998984 105 0.999938086 0.999774952 0.999774952 110 0.999682812 0.998990000 0.998673 1 120 0.998435928 0.995651480 0.994094209 125 0.997424208 0.993113690 0.990555636

Table 7.2 shows the reliability analysis results for Scenario 2. At the end of each Phase

1, the server subsystem with its reliability marked by “=>” is the one to be rejuvenated. For example, after 25 days, the database servers are rejuvenated, and after 40 days, the application servers are rejuvenated.

Table 7.2. Case Study 3- System Reliability with Software Rejuvenation (Scenario 2)

App. Servers Reliability Db. Servers Reliability Phase Time (Days) System Reliability (PA HA) (PB HB)

0 1 1 1 1 0.999998673 0.999993340 0.999992013 5 0.999938086 0.999774952 0.999774952 1 10 0.999682812 0.998990000 0.998673 20 0.998435928 0.995651480 0.994094209 25 0.997424208 => 0.99311369 0.990555636 25.003472 0.997423373 0.999999999999821 0.99742372 25.006944 0.997422 0.999999999999178 0.997421 2 25.01389 0.99742 0.99999999999622 0.997419 25.020833 0.997418 0.999999999990784 0.9974178 26 0.99719112 0.999993340 0.997184479 30 0.99615889 0.999774952 0.995934706 1 35 0.99465212 0.998990000 0.993647521 40 => 0.99292004 0.997619011 0.9905 40.003472 0.999999999999988 0.9976191 0.99761909 40.006944 0.999999999999938 0.99761909 0.997619089 2 40.01389 0.999999999999671 0.99761908 0.997619079 40.020833 0.99999999999913 0.99761906 0.9976191 41 0.999998673 0.997272340 0.997271102 1 45 0.999938086 0.995651480 0.995589835 55 0.998587859 => 0.9903634 0.99 2 55.003472 0.998587858 0.999999999999750 0.998587857

81

55.006944 0.998587856 0.99999999999885 0.998587855 55.01389 0.998587854 0.999999999994713 0.998587853 55.020833 0.998587853 0.999999999987103 0.998587852 60 0.998435928 0.999774952 0.998211232 65 0.997424208 0.99899 0.99641681 1 70 0.99615919 0.997619011 0.993787346 75 0.99615919 => 0.99565148 0.991 75.003472 0.99615918 0.999999999999750 0.996159179 75.006944 0.99615916 0.99999999999885 0.996159159 2 75.01389 0.99615914 0.999999999994713 0.996159139 75.020833 0.99615913 0.999999999987103 0.996159129 80 0.99292004 0.999774952 0.992696585 1 85 => 0.990981980 0.99899 0.99 85.003472 0.999999999999988 0.99898 0.998979 85.006944 0.999999999999938 0.99896 0.998958 2 85.01389 0.999999999999671 0.99894 0.998949 85.020833 0.99999999999913 0.99892 0.998919 90 0.999938086 0.997619011 0.997557244 1 95 0.999682812 0.995651480 0.995335671 105 0.998435928 => 0.9903634 0.99 105.003472 0.998440 0.999999999999750 0.998439 105.006944 0.998439 0.99999999999885 0.9984388 2 105.01389 0.998436 0.999999999994713 0.9984359 105.020833 0.998434 0.999999999987103 0.9984337 110 0.997424208 0.999774952 0.99719974 115 0.99615919 0.99899 0.995153069 1 120 0.99615919 0.997619011 0.993787346 125 => 0.99292004 0.995651480 0.99

The rejuvenation schedule for both Scenario 1 and Scenario 2 is illustrated in Figure

7.3. In the figure, the rejuvenation initiation is indicated by the sudden increment of the system reliability. By comparing the two rejuvenation schedules in Table 7.1 and Table

7.2, we can see that during 125 days, Scenario 1 has 5 rejuvenation processes which require rejuvenating both of the application and database servers. On the other hand, Scenario 2 has 7 rejuvenation processes which only require rejuvenating either the application servers or the database servers.

It is easy to see that Scenario 2 requires less management of the servers in order to keep the system reliability above the 0.99 threshold all the time. Suppose the rejuvenation

82

of the application servers has the same cost as the that of the database servers, by using the rejuvenation scheduling defined in Scenario 2, the cost can be reduced by (5*2-

7)/(5*2)=30% comparing to the rejuvenation scheduling defined in Scenario 1.

1

0.998 Scenari 0.996 o1 Scenari 0.994 o2

Reliability 0.992

0.99 0 20 40 60 80 100 120 Time (Day)

Figure 7.3. Case study 3- Rejuvenation scheduling (Scenario 1 vs. Scenario 2)

83

Chapter 8

MODEL WITH NON-CONSTANT FAILURE RATE AND

TWO HOT SPARES

8.1 Modeling and Analysis

In this section, the modeling and qualitative analysis is the similar to Chapter 6 Section

6.1.1, which applies to the extended DFT Software spare gate SSP gate with 2-HSS. The difference will be in the quantitative analysis as this case study will consider Weibull distribution for time-to-failure probability density function (pdf) for software components.

Following the same model construct defined in Chapter 6, a SSP gate with one primary component P and two HSS components H1 and H2 is illustrated in Figure 8.1. An SSP gate fails when the primary component and all the alternate inputs fail. When H1 takes the lead to replace P, it becomes H1*, with λH1* ≥ λH1, due to the software aging phenomenon when it takes the full workload. In this case, H1* serves as a primary one, and H2 serves as its hot software spare. Similarly, when H1* fails, H2 replaces H1*, and behave as H2*, with λH2* ≥

λH2.

SSP

P H 1 H 2

Figure. 8.1. An SSP gate with a primary component and 1-HSS

84

Let τ1, τ2 and τ3 be the failure times of component P, H1 and H2, respectively. We now identify all the possible cases/events that lead to the failure of a SSP gate according to the component failure sequence. To calculate the reliability function of an SSP gate, we investigate six disjoint paths (denoted as e1 to e6, respectively) as follows.

Case 1: Event e1 where the components fail in the sequence of P, H1, and H2, denoted as

P  H1 H2. In this case, it is impossible for H1 to fail during (0, τ1] and for H2 to fail during

(0, τ2]. The HSS H1 takes over the workload and becomes H1* right after P fails; similarly,

H2 takes over the workload and becomes H2* right after H1* fails. Hence, the probability

of the path event P H1 H2 = Pr(e1) can be calculated as in Eq. (8.1) with τH2* =[(hH2(τ2))

/ (hH2*(τ2))](τ2+(τ2–τH1*)) [38], which is a generalized form of the equation

λ p H2 τΗ *  λ  (τ2  (τ1  τH * )) for time shifting derived in Chapter 7. 2 H2* 1

(tτ τ ) (t[τ τ τ ]τ ) t 1 H1* 2 1 H1* H 2* Pr(e )  f (τ ) f (τ ) f (τ )dτ dτ dτ 1    P 1 H1* 2 H 2 * 3 3 2 1 0 τ τ H1* H 2* (tτ τ ) (t[τ τ τ ]τ ) t 1 H1* 2 1 H1* H 2*  p 3 λp λp λp (τ ( p1) τ ( p1) τ ( p1) )    P H1* H 2* 1 2 3 (8.1) 0 τ τ H1* H 2*

p p (λ τ ) p (λ τ ) (λ τ ) (λ e P 1 )(λ e H1* 2 )(λ e H 2 * 3 )dτ dτ dτ P H1* H 2* 3 2 1

Case 2: Event e2 where the components fail in the sequence of P, H2, and H1, denoted as

P H2 H1. Similar to previous work in Chapter 7, the integration of H1* requires to shift

the integration limit from τH1* to τH1*+(τ3–τ1), which leads to Eq. (8.2).

t(τ τ ) t t 1 H1* Pr(e )  p3 λp λp λp (τ( p1) τ( p1) τ( p1) ) 2    P H1* H2 1 2 3 0 τ1 [τ (τ3 τ1 )] H1* (8.2) p p (λ τ ) p (λ τ ) (λ τ ) (λ e P 1 )(λ e H1* 2 )(λ e H2 3 )dτ dτ dτ P H1* H2 2 3 1

85

Case 3: Event e3 where the components fail in the sequence of H2, P, and H1, denoted as

H2  P H1. Note that this case is a simple one similar to Eq. (4). The probability that the

SSP gate fails can be calculated as in Eq. (8.3).

t(τ1τ ) t t H1* Pr(e )  p3 λp λp λp (τ( p1) τ( p1) τ( p1) ) 3    P H1* H2 1 2 3 0 τ p 3  λ   H1  τ  λ  1 (8.3)  H1*  λ τ λ τ λ τ (λ e P 1 )(λ e H1* 2 )(λ e H2 3 )dτ dτ dτ P H1* H2 3 1 2

Case 4: Event e4 where the components fail in the sequence of H1, H2, and P, denoted as

H1 H2 P. In this case, it is impossible for P to fail during (0, τ3]. The probability that the

SSP gate fails during (0, t] can be calculated as in Eq. (8.4).

t t t Pr(e )  p3 λp λp λp (τ( p1) τ( p1) τ( p1) ) 4    P H1 H2 1 2 3 0 τ τ 2 3 (8.4) λ τ λ τ λ τ (λ e P 1 )(λ e H1 2 )(λ e H2 3 )dτ dτ dτ P H1 H2 2 3 1

Case 5: Event e5 where the components fail in the sequence of H1, P, and H2, denoted as

H1 P H2 . Similar to Path p3, this is where H1 fails first as a spare, then P fails before H2 fails. In this case, the probability that the SSP gate fails can be calculated as in Eq.

(8.5), where τH2* is similar to τH1* as H2 being dependent on P only, i.e., P H2.

t(τ2 τ ) t t H 2* Pr(e )  p 3 λp λp λp (τ ( p1) τ ( p1) τ ( p1) ) 5    P H1 H 2* 1 2 3 0 τ P 2  λ   H 2  τ  λ  2 (8.5)  H 2 *  λ τ λ τ λ τ (λ e P 1 )(λ e H1 2 )(λ e H 2* 3 )dτ dτ dτ P H1 H 2* 3 1 2

86

Case 6: Event e6 where The components fail in the sequence of H2, H1, and P, denoted as

H2  H1 P. In this case, the probability that the SSP gate fails during (0, t] can be calculated as in Eq. (8.6).

t t t Pr(e )  p3 λp λp λp (τ( p1) τ( p1) τ( p1) ) 6    P H1 H 2 1 2 3 0 τ τ 3 2 (8.6) λ τ λ τ λ τ (λ e P 1 )(λ e H1 2 )(λ e H2 3 )dτ dτ dτ P H1 H 2 1 2 3

The reliability function R(t) for a SSP gate with 2-HSS is given in a general form as R(t)

= 1- U(t), where U(t) is given as in Eq. (8.7).

U(t) = Pr(e1 )  Pr(e2 )  Pr(e3 )  Pr(e4 )  Pr(e5 )  Pr(e6 ) (8.7)

8.2 Case Study 4: Non-Constant Failure Rate with 2-HSSs

In this section, the modeling and qualitative analysis is the same as Chapter 6 which applies to the extended DFT Software spare gate SSP gate with 2-HSSs Section 6.1.1. The difference will be in the quantitative analysis as this case study will consider Weibull distribution, described in section 8.1, for time-to-failure probability density function (pdf) for software components. Similar to Chapter 7, case study section, using Weibull distribution we get the following probability density function where p and λ are the shape and scale parameters respectively.

p f (t)  p p t( p1) e(t)

If p=1, then the pdf is exponentially distributed like the latter case studies. Moreover, Given

p that the reliability function of Weibull distribution is R(t)e(t) , and the failure rate h(t)

87

= f(t) / R(t), hence the failure rate becomes h(t)  p p t( p1) which is non-constant and a function of time.

This case study has exactly the same setup and configuration as case study 2, where it consists of an application server PA and a database server PB. To enhance the system reliability two HSSs are deployed for each primary server, namely HA1, HA2, HB1 and HB2.

As already mentioned, these HSS components are ready to take over the workload once the primary ones fail. The same assumptions made in for the latter case studies regarding availability zone, redundancy, fault tolerance, reliability threshold, software components and rejuvenation duration hold for this case study, as shown in Figure 6.2.1. The only difference from case study 2 is the failure rate function for software components which will eventually affect the quantitative analysis manifested in reliability results and rejuvenation scheduling. In the case study, we define the following scale parameters: λPA

= 0.004/day, λHA1= λHA2 = 0.0025/day, λPB = 0.005/day, λHB1 = λHB2 = 0.003/day. For comparison purposes, we set them the same values as the constant failure rates of exponential distribution used in our case study 2 in Chapter 6. However, the shape parameter is selected p > 1, this selection indicates a monotonically increasing failure rate function with respect to time, to simulate SA phenomenon effect on the software system.

The shape parameter p for application servers is set to p = 1.2 and for database servers p

= 1.1. The failure rates are stated as follows:

(t)  p p ( p1) , (t)  p p ( p1) , (t)  p p ( p1) , (t)  p p ( p1) , hPA PAt hHA1 λHA1 t hHA 2 λHA 2 t hPB PBt

(t)  p p ( p1) , (t)  p p ( p1) with (t)  (t)  (t) and hHB 1 λHB 1 t hHB2 λHB2 t hPA hHA1 * hHA2 *

(t)  (t)  (t) hPB hHB 1 * hHB 2 * as HB1*, HB2*, HA1* and HA2* will be acting as primary components after the primary servers fail. Note that the rejuvenation process also involves

88

Cold Software Spare (CSS) components, which are images of VM instances that can be easily deployed. Since a CSS is simply a cloud image that is not running, its failure rate equals 0. As such, a CSS does not appear in the DFT model because it does not affect the system reliability. We consider a CSS only when it is activated and deployed as a primary one or an HSS. Once deployed, they will have the same failure rates as their corresponding software components due to the assumed configurations.

S1 S2 S3 S2

SSP SSP SSP S1 S1'

P HA1 HA2 P HB1 HB2 P HB1 HB2 SSP a) Phase 1 SSP

c) Phase 2 - Scenario 2 App. Servers Rejuvenation

S3 S4 P HA1 HA2 P’ HA1' HA2'

S1 S1 S1' S2 S2' S4

SSP SSP SSP SSP SSP S2 S2'

P HA1 HA2 SSP SSP P HA1 HA2 P’ HA1' HA2' P HB1 HB2 P’ HB1' HB2' b) Phase 2 – Scenario 1 d) Phase 2 - Scenario 2 Db. Servers Rejuvenation

P HB1 HB2 P’ HB1' HB2'

Figure 8.2: A DFT model_2-HSS model with both Phases and both Scenarios (Chapter 6)

Figure 8.2-a) shows the DFT model of the cloud-based system in Phase 1. Because the system fails when either the application server or the database server fails, the two SSP gates are connected by an OR-gate. The reliability function of the OR-gate can be derived as in Eq. (5.16) shown as Eq. (8.8).

R(t) 1UOR (t) 1 (US1(t)  (1US1(t))*US2 (t)) (8.8)

89

where US1(t) and US2(t) are the unreliability functions of the subtrees S1 and S2, respectively. Note that US1(t), US2(t)) and can be evaluated using Eq. (8.7) detailed in the previous section.

In Phase 2, we consider both of the scenarios mentioned in the end of Chapter 4, so we can compare their impacts on system reliability as well as their impact in regards to rejuvenation schedules. Figure 8.2-b) represents the DFT model of the cloud-based system with 2-HSSs in Phase 2 based on Scenario 1. Similar to the reliability analysis for Phase 1, we can analyze the DFT model for Phase 2 (Scenario1) by decomposing it into sub-trees.

Thus, the unreliability functions of the subtrees US1(t), US1’(t), US2(t) and US2’(t) can be computed using Eq. (8.7) defined in previous section. As for US3(t) and US4(t), since they are AND-gates, their unreliabilities can be calculated using the sum of disjoint product method as shown in Eqs. (8.9-8.10)

US3(t) US1(t)*US1' (t) (8.9)

US4 (t) US2 (t)*US2' (t) (8.10)

The reliability analysis results for Scenario 1 are listed in Table 8.1. The table shows that the reliability threshold (0.99) is reached every 59 days based on the reliability analysis results in Phase 1. Hence, both application and database servers are rejuvenated at the end of Phase 1. Phase 2 has a 30-minute time duration; therefore, we calculate the system reliability at 5, 10, 20 and 30 minutes in Phase 2 to illustrate how system reliability may change during the rejuvenation process. From the table, we can see that the system reliability is kept very high during the transition. After the 30 minutes, the newly deployed servers completely take over the system, and the servers to be rejuvenated are shut down.

90

When this happens, the system returns to its initial state and starts a new life cycle with very high initial reliabilities. Therefore, Table 1 suggests that the system should be rejuvenated every 48 days in order to maintain the system reliability above the predefined threshold.

Table 8.1. Case Study 4- System Reliability with Software Rejuvenation (Scenario 1)

Phase Time (Days) App. Servers Reliability Db. Servers Reliability System Reliability

1 0.99999999879 0.999999986232 0.999999985 5 0.999999605 0.999997256469 0.999996861 10 0.999995272 0.999973585767 0.999968858 1 20 0.999944405 0.999752173000 0.999696591 30 0.999768668 0.999102301329 0.998871177 59 0.997633368 0.992834925850 0.990485251 59.003472 0.999999999999999 0.999999999999999 0.999999999999999 59.006944 0.999999999999999 0.999999999999999 0.999999999999999 2 59.01389 0.999999999999999 0.999999999999999 0.999999999999999 59.020833 0.999999999999999 0.999999999999999 0.999999999999999 60 0.99999999879 0.999999986232 0.999999985 64 0.999999605 0.999997256469 0.999996861 1 69 0.999995272 0.999973585767 0.999968858 89 0.999768668 0.999102301329 0.998871177 118 0.997633368 0.992834925850 0.990485251 118.003472 0.999999999999999 0.999999999999999 0.999999999999999 118.006944 0.999999999999999 0.999999999999999 0.999999999999999 2 118.01389 0.999999999999999 0.999999999999999 0.999999999999999 118.020833 0.999999999999999 0.999999999999999 0.999999999999999 119 0.99999999879 0.999999986232 0.999999985 123 0.999999605 0.999997256469 0.999996861 1 128 0.999995272 0.999973585767 0.999968858 148 0.999768668 0.999102301329 0.998871177 177 0.997633368 0.992834925850 0.990485251

By further looking into Table 8.1, we can see that when the system reliability reaches

0.99 after 59 days, the reliability of the database server subsystem is lower than that of the application server subsystem. This suggests that we may rejuvenate the most critical components (i.e., the component or a subsystem with the lowest reliability) first. In this case study, we choose to rejuvenate the database servers first. Then we wait until the system

91

reliability reaches the threshold again, and rejuvenate the application servers, which now become the most critical components. This is exactly what happens for rejuvenation scheduling in Scenario 2, where the application servers and the database servers are rejuvenated alternatively. Figure 8.2-c,d) shows the DFT model of the cloud-based system in Phase 2 for both two cases in Scenario 2. In particular, part d) of the figure shows the case where the database servers are rejuvenated. In this case, the system reliability can be calculated as in Eq. (8.11), and US1(t) can be evaluated based on Eq. (8.7) and US4(t) since it is represented by an AND gate, it case be calculated using SDP method resulting in a similar equation as Eq. (8.9-8.10), where again US2(t) and US2’(t) can be evaluated based on Eq. (8.7) since it is an SSP gate with 2-HSSs.

R(t) 1U (t) 1 (U (t)  (1U (t))*U (t)) OR S1 S1 S4 (8.11)

It is worth to mention that scenario 2 system reliability for application server rejuvenation can be calculated in a similar way.

Table 8.2 shows the reliability analysis results for Scenario 2. At the end of each Phase

1, the server subsystem with its reliability marked by “=>” is the one to be rejuvenated. For example, after 59 days, the database servers are rejuvenated, and after 90 days, the application servers are rejuvenated.

Table 8.2. Case Study 4- System Reliability with Software Rejuvenation (Scenario 2)

Phase Time (Days) App. Servers Reliability Db. Servers Reliability System Reliability

0 1.00000000000 1.000000000000 1 1 0.99999999879 0.999999986232 0.999999985 5 0.999999605 0.999997256469 0.999996861 1 10 0.999995272 0.999973585767 0.999968858 30 0.999768668 0.999102301329 0.998871177 59 0.997633368 => 0.992834925850 0.990485251 59.003472 0.997632899000 0.999999999999999 0.997632898000 2 59.006944 0.997632430926 0.999999999999999 0.997632430925

92

59.01389 0.997631493010 0.999999999999999 0.997631493009 59.020833 0.997630555247 0.999999999999999 0.997630555246 64 0.996891175500 0.999997256469 0.996888440000 69 0.996006353777 0.999973585767 0.995980045000 1 80 0.993350142500 0.999710333000 0.993062400000 90 0.990481969976 0.999004880000 0.990000000000 90.003472 0.999999999999999 0.999463020000 0.999463010000 90.006944 0.999999999999999 0.999463000000 0.999462900000 2 90.01389 0.999999999999999 0.999462900000 0.999462800000 90.020833 0.999999999999999 0.999462700000 0.999462600000 95 0.999999605 0.998412074000 0.998411679000 100 0.999995272 0.997624570000 0.997619850000 1 120 0.999768668 0.992088640000 0.991000000000 123 0.999677420000 0.990881620000 0.990560000000 123.003472 0.999677419000 0.999999999999999 0.999677418000 123.006944 0.999677417000 0.999999999999999 0.999677416000 2 123.01389 0.999677416000 0.999999999999999 0.999677415000 123.020833 0.999677414000 0.999999999999999 0.999677412000 128 0.999473600000 0.999997256469 0.999470858000 133 0.999193689000 0.999973585767 0.999167290000 1 150 0.997495760000 0.999356000000 0.996853300000 160 0.995811240000 0.998270930000 0.994089400000 170 0.993501425502 0.996389577000 0.990000000000

The rejuvenation schedule for both Scenario 1 and Scenario 2 is illustrated in Figure

8.3. In the figure, the rejuvenation initiation is indicated by the sudden increment of the system reliability. By comparing the two rejuvenation schedules in Table 8.1 and Table

8.2, we can see that during about 125 days, Scenario 1 has 2 rejuvenation processes which require rejuvenating both of the application and database servers. On the other hand,

Scenario 2 has 3 rejuvenation processes which only require rejuvenating either the application servers or the database servers.

It is easy to see that Scenario 2 requires less management of the servers in order to keep the system reliability above the 0.99 threshold all the time. Suppose the rejuvenation of the application servers has the same cost as the that of the database servers, by using the rejuvenation scheduling defined in Scenario 2, the cost can be reduced by (2*2-

3)/(4)=25% comparing to the rejuvenation scheduling defined in Scenario 1.

93

1.001

0.999

0.997

0.995

0.993 Reliability

0.991

0.989 0 20 40 60 80 100 120 140 Scenario 1 scenario 2 Time (Days)

Figure 8.3. Case study 4- Rejuvenation scheduling (Scenario 1 vs. Scenario 2)

8.3 Case Studies 3 and 4 Comparison

In this section, the next step is to show the analysis results and visualize the differences and the impacts of employing 2-HSSs vs. 1-HSS on rejuvenation schedules in a cloud- based system in the case of non-constant failure rates based on past two case studies.

Table 8.3 shows the reliability analysis results for the application server subsystem in both of the 1-HSS and 2-HSSs cases. It is easy to see that the 2-HSSs case is more reliable than the 1-HSS case since the system design employs two HSSs for each primary one, and thus it is more fault-tolerant.

Table 8.3. Application Sever Reliability with both 1-HSS and 2-HSSs

Time (Days) 1-HSS App. Server R(t) 2-HSS App. Server R(t) 0 1 1 1 0.99998705 0.99999996119 5 0.9996806 0.999995242 10 0.998745 0.99996249 20 0.99515 0.99976169 30 0.98945 0.9990894 60 0.96194 0.993687 90 0.92259 0.9815 120 0.8754 0.96185

94

180 0.769 0.90192 240 0.6593 0.8214 300 0.5555 0.7299 365 0.4546 0.6276

In a similar way to Table 8.3, Table 8.4 shows the reliability analysis results for the database server sub-system for both 1-HSS and 2-HSSs cases.

Table 8.4. Database Sever Reliability with both 1-HSS and 2-HSSs

Time (Days) 1-HSS Db. Server R(t) 2-HSS Db. Server R(t)

0 1 1 1 0.9999801 0.99999992711 5 0.9995107 0.999991104 10 0.998085 0.99993093 20 0.99266 0.9994793 30 0.98417 0.998344 60 0.94421 0.98888 90 0.8891 0.96842 120 0.8253 0.93681 180 0.6893 0.8466 240 0.5588 0.7352 300 0.4438 0.6184 365 0.34 0.4987

Similarly, Table 8.4 shows the reliability analysis results for the database server subsystem in both of the 1-HSS and 2-HSSs cases. Again, the 2-HSSs case is more reliable than the 1-HSS case since the system design employs two HSSs for each primary one, and thus it is more fault-tolerant.

Figure 8.4 illustrates in details the differences between 1-HSS and 2-HSSs models under

Scenario 1 policy. We can see that the system reliability reaches the threshold after 25 days and 59 days for the 1-HSS and 2-HSSs cases, respectively. According to Scenario 1, the whole system is restarted when the threshold is reached, and the system returns to its initial state. As a result, the rejuvenation must be repeated regularly every 25 and 59 days

95

for the 1-HSS and 2-HSSs cases, respectively. Such rejuvenation strategies are reflected in

Fig. 6 as recur-rent rejuvenation schedules for the two cases in Scenario 1.

Figure. 8.4. Rejuvenation scheduling: 2-HSSs vs. 1-HSS (Scenario1)

On the other hand, Figure 8.5 shows irregular occurrences of rejuvenation in Scenario

2. This is because in Scenario 2, we rejuvenate the component that has the lowest reliability when the system reliability reaches threshold 0.99. According to the figure, when the reliability threshold is reached, the component with the lowest reliability, e.g., the database server, is rejuvenated first.

1

0.998

0.996

Reliability 0.994

0.992

0.99 0 20 40 60 80 100 120 Time (Days) 1-HSS Scenario 2 2-HSS Scenario 2

Figure. 8.5. Rejuvenation scheduling: 2-HSSs vs. 1-HSS (Scenario2)

Figure 8.5 shows the differences between the two cases, 1-HSS vs. 2-HSSs, based on

Scenario 2 for component-specific rejuvenation. According to the figure, when the

96

reliability threshold is reached, the component with the lowest reliability, e.g., the database server is rejuvenated first.

It is worth mentioning that in Scenario 2 with 1-HSS, the database server gets rejuvenated for two consecutive times on day 80 and day 101, as shown in Figure 8.5. We can see how this irregularity affects the reliability pattern in the same figure. One can notice that 3 rejuvenations are needed for Scenario 2 with 2-HSSs vs. 7 rejuvenations needed for the 1-HSS case for a duration around 125 days. Therefore, compared with Scenario 2 with

1-HSS, using Scenario 2 with 2-HSSs results in (7–3)/(7) = 57% reduction in cost and management for software rejuvenation while keeping the system reliability well above the

0.99 threshold.

This result was as expected because using 2-HSSs for each primary one surely makes the whole system more reliable and dependable.

8.4 Case Studies: Results Interpretation

This section serves to shed the light on the results obtained from all four case studies along chapters 5, 6, 7 and 8. First, the case studies with same type of failure rates are addressed, and then we point to case studies with same amount of hot software spare.

In regards to the first two case studies, all the components time-to-failure pdfs are exponentially distributed; hence both case studies components have constant failure rates.

Since the scope of the research is reliability-based software rejuvenation scheduling, it is important to point that when we use more standby spares, the overall system reliability increase. Therefore, the rejuvenation frequency decrease resulting in less rejuvenation management, but it will cost the cloud consumer more resources spendings to maintain a

97

higher system reliability. Same interpretation applies to case studies 3 and 4, where the only diffrence is that components time-to-failure pdfs follow Weibull distribution. We still can see the same pattern observed in the first two case studies.

Taking into consideration the case studies that has same number of software hot spare, we can interpret case studies 1 and 3 results alongside with case studies 2 and 4, resepectively. In regards to case studies 1 and 3, case study 3 employ Weibull distribution as the time-to-failure pdf, and we set up the shape parameter p >1 to yield to an increasing failure rate in order to translate software aging factor onto the failure rate. Scale parameter

λ is kept the same among both case studies. It is important to understand the shape of the non-constant failure rate function (Weibull ) in order to understand the difference in results between the two case studies since the in the first case study the rejuvenation is triggered at 18 days based on reliability analysis, whereas for case study 3, the rejuvenation is triggered at 25 days. Nevertheless, we are working with increasing failure rate, so we expect the system reliability for case study 3 to decay to 0.99 faster, which is not the case.

We can observe the same thing for case studies 2 and 4 where case study 2 rejuvenation is triggered at 48 days, whereas for case study 4, the rejuvenation is triggered at 59 days.

For that matter, we investigate visually the Weibull distribution hazard/failure rate function h(t)  p λp t( p1) with respect to different shape parameters p such as p≥1 for a fixed

λ=0.5 as seen in Figure 8.6. Moreover, we check numerically the intersection of constant

λ=0.005 and λ=0.004 that are relevant to the case studies with the non-constant hazard function h(t)  (1.1)(0.0051.1)t(0.1) and h(t)  (1.2)(0.0041.2)t(0.2) respectively.

98

Figure 8.6. Hazard/failure rate function for Weibull distribution with p≥1

We observe the following: For λ=0.005 and h(t)  (1.1)(0.0051.1)t(0.1) , h(t) is less than the constant failure rate λ for t < 77.109, equal when t = 77.109, otherwise greater. For

λ=0.004 and h(t)  (1.2)(0.0041.2)t(0.2) , h(t) is less than λ for t < 100.469, equal when t =

100.469, otherwise greater.

Based-on Figure 8.6 and the latter remarks comparing the non-constant hazard rate h(t) to the constant failure rate λ, it is seen that the failure rate at time 18 and 25 (Table 8.5) of the hazard rate function h(t) is less than the constant failure rate λ. For example, h(t)  (1.1)(0.0051.1)t(0.1) at time t = 30 days, “h(t)= 0.004141” < “λ =0.005”.

Table 8.5. Hazard/failure rate function h(t) values for Weibull distribution w.r.t time

Time (days) h(t) p=1.2; λ= 0.004 h(t) p=1.1; λ= 0.005 10 0.002521 0.004076 30 0.003141 0.00455 300 0.004978 0.005728

Therefore, the results obtained in the case studies converge to this investigation and in particular to the numerical value of the failure rate h(t). Moreover, it can be seen as a proof that the proposed analytical approach converges to logical results for non-constant failure

99

rates, as it has been already proved for constant failure rates by matching results with

CTMC for constant failure rate cases.

100

Chapter 9

Conclusion and Future Work

In this thesis, we propose a reliability-based approach to establishing cost-effective software rejuvenation schedules for cloud-based systems. We defined an extension of DFT, called SSP gate, which can be used to model and evaluate the reliability of a cloud-based system with multiple software spares. Our approach has been verified using CTMC for constant failure rates. We did extend our approach for non-constant failure rates, as we adopt Weibull distribution to emulate the increasing failure rate due to software aging. We define two phases for the software rejuvenation, and discuss about two scenarios of the rejuvenation process in Phase 2. The case studies showed that our proposed approach is feasable for non-constant failure rates, and a rejuvenation schedule can be derived to maintain the system reliablity of a cloud-based software system with multiple software spare components above a certain level.

For future work, in order to forecast increasing failure rates for software compoennts, we will develop an e-commerce application, deploy it on reputable cloud-based platfroms, such as Amazon Web Service AWS, Window Azure, and Google App Engine, and collect empirical data related to resource degradation. Data fitting technique will be used to derive the most suitable probability density function for the system time-to-failure. Stochastic partial differential equations may be considered and applied to this field of study to help predict how software aging affects the failure rate. As such, more accurate results for system reliability can be used to derive preventive maintencance schedules for cloud-based systems. As an alternative direction, we envision modeling and analyzing cloud-based

101

systems with active standby spare components, which can share workload with the primary ones [34], as a future, and more ambitious research direction.

102

References

[1] K. V. Vishwanath and N. Nagappan, Characterizing cloud computing hardware

reliability, in Proc. of the 1st ACM symposium on Cloud computing (SoCC’10),

Indianapolis, IN, USA, June 10-11, 2010, pp. 193-204.

[2] D. Fitch and H. Xu, A RAID-based secure and fault-tolerant model for cloud

information storage, International Journal of Software Engineering and Knowledge

Engineering (IJSEKE) 23(5) (2013) 627-654.

[3] D. Siewiorek and R. Swarz, The Theory and Practice of Reliable System Design,

Digital Press, Bedford Massachusetts, USA, 1982.

[4] H. Pham, System Software Reliability, Springer Series in Reliability Engineering,

Springer-Verlag London, 2006.

[5] A. Somani and N. Vaidya, Understanding fault tolerance and reliability, IEEE

Computer 30(4) (1997) 45-50.

[6] E. Marshall, Fatal error: how patriot overlooked a scud, Science 255(5050) (1992)

1347.

[7] M. Grotte, R. Matias and K. S. Trivedi, The fundamentals of software aging, in Proc.

of the 1st International Workshop on Software Aging and Rejuvenation (WoSAR

2008), ISSRE, Seattle, WA, USA, November 11-14, 2008, pp. 1-6.

[8] Y. Huang, C. Kintala, N. Kolettis and N. Fulton, Software rejuvenation: analysis,

module and applications, in Proc. of the Twenty-Fifth International Symposium on

103

Fault-Tolerant Computing (FTCS ’95), Pasadena, CA, USA, June 27-30, 1995, pp.

381-390.

[9] M. Grottke, L. Li, K. Vaidyanathan, and K. S. Trivedi, “Analysis of Software Aging in

a Web Server,” IEEE Transactions on Reliability, Vol. 55, No. 3, 2006, pp. 411-420.

[10] V. Castelli, R.E. Harper and P. Heidelberger, et al., Proactive management of

software aging, IBM Journal of Research and Development 45(2) (2001) 311-332.

[11] L. Jiang and G. Xu, Modeling and analysis of software aging and software failure,

Journal of Systems and Software 80(4) (2007) 590-595.

[12] A. Bobbio, M. Sereno and C. Anglano, Fine grained software degradation models

for optimal rejuvenation policies, Performance Evaluation 46(1) (2001) 45-62.

[13] K. Vaidyanathan, D. Selvamuthu and K. S. Trivedi, Analysis of inspection-based

preventive maintenance in operational software systems, in Proc. of the 21st IEEE

Symposium on Reliable Distributed Systems (SRDS 2002), Suita, Japan, October 13-

16, 2002, pp. 286-295

[14] T. Dohi, K. Goseva-Popstojanova and K. S. Trivedi, Statistical non-parametric

algorithms to estimate the optimal software rejuvenation schedule, in Proc. of

International Symposium on Dependable Computing, Los Angeles, CA, USA,

December 2000, pp. 77-84.

104

[15] V. P. Koutras and A. N. Platis, Applying software rejuvenation in a two node cluster

system for high availability, in Proc. of the International Conference on Dependability

of Computer Systems, Szklarska, Poreba, May 25-27, 2006, pp. 175-182.

[16] F. Machida, A. Andrzejak, R. Matias and E. Vicente, On the effectiveness of Mann-

Kendall test for detection of software aging, in Proc. of the IEEE International

Symposium on Software Reliability Engineering Workshops (ISSREW), Pasadena, CA,

USA, November 4-7, 2013, pp. 269-274.

[17] M. Grottke, L. Li, K. Vaidyanathan and K. S. Trivedi, Analysis of software aging

in a web server, IEEE Trans. on Reliability 55(3) (2006) 411-420.

[18] Y. Bao, X. Sun and K. Trivedi, “A Workload-Based Analysis of Software Aging,

and Rejuvenation,” IEEE Transaction on Reliability 0018-9529, vol. 54, no. 3,

September 2005.

[19] D. Cotroneo, R. Natella and R. Pietrantuono, “Is software aging related to software

metrics? ”in Proc. of the IEEE Second International Workshop on Software Aging and

Rejuvenation (WoSAR), San Jose, CA, USA, November 2, 2010, pp. 1-6.

[20] F. Machida, D. Kim, and K. Trivedi, “Modeling and Analysis of Software

Rejuvenation in a Server Virtualized System,” Proc. IEEE Second Int’l Workshop

Software Aging and Rejuvenation, pp. 1-6, 2010.

[21] T. Thein, S. Park, “ Availability Modeling and Analysis on Virtualized Clustering

with Rejuvenation,” IJCNS International Journal of Computer Science and Network

Security, VOL.8 No.9,September 2008 .

105

[22] T. Thein, S. Park, “ Availability analysis of application servers using software

rejuvenation and virtualization,” JOURNAL OF COMPUTER SCIENCE AND

TECHNOLOGY 24(2): 339-346 Mar. 2009.

[23] D. Bruneo, F Longo, A. Puliafito, M. Scarpa, S. Distefano, “Software Rejuvenation

in the cloud”, SIMUTOOLS ’12 In Proceedings of the 5th International ICST

Conference on Simulation Tools and Techniques Pages 8-16, 2012.

[24] M. Armbrust, A. Fox, R. Griffith, A. Joseph, R. Katz, A. Konwinski, G. Lee, D.

Patterson, A. Rabkin, I. Stoica, M. Zaharia, “Above the Clouds: A Berkeley View of

Cloud Computing,” Technical Report No. UCB/EECS-2009-28, University of

California at Berkley, USA, Feb. 10, 2009

[25] Architecting for the Cloud: Best Practices, Jinesh Varia , January 2010

[26] Ned Smith, “Why More Businesses Are Using Cloud Computing,” Based on the

Survey Sponsored by CompTIA a Nonprofit IT Industry, Retrieved on July 29, 2012

from http:// www. cnbc.com/id/48319526, July 25 2012.

[27] Morty Eisen, “Introduction to Virtualization”, The Long Island Chapter of the IEEE

Circuits and Systems (CAS) Society, April 28th, 2011,

https://www.ieee.li/pdf/viewgraphs/introduction_to_virtualization.pdf

[28] Grottke et al, “An Empirical Investigation of Fault Types in Space Mission System

Software”, in Proc. of the IEEE Second International Conference Dependable Systems

and Networks, , 2010, pp. 447-456- 978-4244-7501-8/10

106

[29] Software Reliability – Software Bugs http://srel.ee.duke.edu/.

[30] J. B. Dugan, S. J. Bavuso and M. A. Boyd, Dynamic fault-tree models for fault-

tolerant computer systems, IEEE Trans. on Reliability 41(3) (1992) 363-377.

[31] M. Rausand and A. Høyland, System Reliability Theory: Models, Statistical

Methods, and Applications, Second Edition, Hoboken, New Jersey, USA, John Wiley

& Sons, Inc., 2004

[32] J. Barr, A. Narin and J. Varia, “Building fault-tolerant applications on AWS”,

Amazon Web Services (AWS), Amazon, October 2011, retrieved on July 15, 2015

[33] G. Robinson, A. Narin, and C. Elleman, “Using Amazon Web Services for Disaster

Recovery”, Amazon Web Services (AWS), Amazon, October 2011, retrieved on July

15, 2015

[34] L. Huang and Q. Xu, Lifetime reliability for load-sharing redundant systems with

arbitrary failure distributions, IEEE Trans. on Reliability 59(2) (2010) 319-330.

[35] J. Rahme and H. Xu, Reliability-based software rejuvenation scheduling for cloud-

based systems, in Proc. of the 27th International Conference on Software Engineering

and Knowledge Engineering (SEKE 2015), Pittsburgh, USA, July 6-8, 2015, pp. 298-

303.

[36] J. Rahme and H. Xu, “A software reliability model for cloud-based software

rejuvenation using dynamic fault trees,” International Journal of Software Engineering

and Knowledge Engineering (IJSEKE), Vol. 25, Nos. 9 & 10, 2015, pp. 1491-1513.

107

[37] J. Rahme and H. Xu, “Dependable and Reliable Cloud-Based Systems Using

Multiple Software Spare Components,” To appear in Proc. of the International

Conference on Advanced and Trusted Computing (ATC-17), San Francisco Bay Area,

CA, USA, Aug. 4-8, 2017.

[38] J. Rahme and H. Xu, “Preventive Maintenance for Cloud-Based Software Systems

Subject to Non-Constant Failure Rates,” To appear in Proc. of the International

Conference on Advanced and Trusted Computing (ATC-17), San Francisco Bay Area,

CA, USA, Aug. 4-8, 2017.

108