INFORMATION TO USERS

This manuscript has been reproduced from the microfilm master. UMI films the tect directly from the original or copy submitted. Thus, some thesis and dissertation copies are in typewriter fece, while others may be fi'om any type of computer printer.

The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleedthrough, substandard margins, and improper alignment can adversely affect reproduction.

In the unlikely event that the author did not send UMI a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion.

Oversize materials (e.g., maps, drawings, charts) are reproduced by sectioning the original, beginning at the upper left-hand comer and continuing from left to right in equal sections with small overlaps. Each original is also photographed in one exposure and is included in reduced form at the back of the book.

Photographs included in the original manuscript have been reproduced xerographically in this copy. Higher quality 6” x 9” black and white photographic prints are available for any photographs or illustrations appearing in this copy for an additional charge. Contact UMI directly to order. UMI A Bell & Howell Infonnation Company 300 North Zed) Road, Ann Arbor MI 48106-1346 USA 313/761-4700 800/521-0600

QUASI-SYNCHRONOUS CHECKPOINTING AND FAILURE RECOVERY IN DISTRIBUTED SYSTEMS

DISSERTATION

Presented in Partial Fulfillment of the Requirements for

the Degree Doctor of Philosophy in the

Graduate School of The Ohio State University

By

D. Manivannan. B.S.. M.S.

*****

The Ohio State University

1997

Dissertation Committee: Approved by

Dr. .\Iukesh Singhal. .Adviser

Dr. Xeelain Soundararajan -Adviser Dr. -Anisli .Arora Department of Computer and Information Science X3MI Number; 9801742

Copyright 1997 by Manivannan, D.

All rights reserved.

UMI Microform 9801742 Copyright 1997, by UMI Company. All rights reserved.

This microform edition is protected against unauthorized copying under Title 17, United States Code.

UMI 300 North Zeeb Road Ann Arbor, MI 48103 © Copyright by

D. Manivannan

1997 ABSTRACT

Checkpointing and rollback recovery is widely used for achieving fault-tolerance in distributed systems. When the state of a process is saved periodically, the saved states are called checkpoints of the process. A set of checkpoints, one from each process is called a consistent global checkpoint if none of them causally happened before any other checkpoint in the set. In rollback recovery, processes roll back to a consistent global checkpoint when a failure occurs. Consistent global checkpoints of a distributed computation has applications not only in failure recovery but also in debugging distributed programs, output commit, monitoring distributed events, protocol specification and verification, and others.

When processes take checkpoints independently, some of the checkpoints may not be part of any consistent global checkpoint. In this thesis, we present a theoretical framework for identifying the checkpoints that can be used to construct consistent global checkpoints containing a target set of checkpoints. We illustrate the applica­ tion of our results by presenting a simple and elegant for enumerating all consistent global checkpoints containing a target set of checkpoints.

We also present a characterization and classification of c[uasi-synchronous check­ pointing , i.e.. checkpointing algorithms which allow processes to take checkpoints independently as well as force processes to take communication induced checkpoints. The classification helps analyze the properties and limitations of such al­

gorithms and also provides guidelines for designing and evaluating new checkpointing

algorithms. This classification also sheds light on some important open problems.

Our classification of quasi-synchronous checkpointing algorithms helped us design

a new low-overhead quasi-synchronous checkpointing algorithm which makes every

checkpoint useful in the sense that every checkpoint is part of a consistent global

checkpoint. This property of the checkpointing algorithm is especially helpful to

minimize the rollback distance during failure recovery because a failed process needs

to rollback only to its latest checkpoint.

Based on the checkpointing algorithm, we also present an asynchronous recovery

algorithm which can handle concurrent failure of multiple processes. Unlike existing algorithms, our recovery algorithm does not use vector timestamps to track depen­ dency. Moreover, it uses selective message logging to cope with the messages lost due

to rollback.

Ill To mv Parents. Wife, and Children

IV ACKNOWLEDGMENTS

I would like to express niy sincere gratitude and thanks to niv advisor Prof.

Mukesh Singhal for his encouragement and guidance throughout niv graduate studies.

It is only through many stimulating discussions with him. I have been able to im­ prove this dissertation both in content and presentation. Working with him has been a great pleasure. I would like to thank Prof. Neelam Soundararajan and Prof. .\nish

.\rora for serving on my dissertation committee and providing me valuable feedback on my thesis.

I would like to thank Prof. Xeelam Soundararajan. Prof. D. .Jayasimha and

Prof. Dhabaleswar Panda for their guidance and encouragement during my graduate studies. I would like to thank Prof. Robert H. B. Xetzer. Department of Computer

Science. Brown University, for many useful email discussions on zigzag paths between checkpoints of a distributed computation.

I would also like to thank my wife Bala Soulossana and children Suganya and

Vasudevan for their patience in putting up with my long hours of work, forgetfulness, impatience, etc. and providing me with their love and support during the preparation of this dissertation. Finally. I would like to thank my parents for giving me the freedom to choose whatever I wanted to do.

V VITA

May 5. 1953 ...... Born - Melmayil. Tamil \adu. India.

1973 B.S.. M athematics University of Madriis Madras. India. 1992 ...... M.S.. M athematics The Ohio State University Columbus. Ohio. US.A.. 1993 ...... M.S.. Computer Science The Ohio State University Columbus. Ohio. USA.

PUBLICATIONS

Research Publications

D. Manivannan. Robert H. B. Xetzer and M. Singhal "Finding Consistent Global Checkpoints in a Distributed Computation". IEEE Transactions on Parallel and Distributed Systems. 8(6);623-627. .June 1997.

D. X. .Jayasimha. D. Manivannan. .Jeff .A.. May. Loren Schwiebert. and Stephen L. Hary "A Foundation for Designing Deadlock-free Routing .Algorithms in Wormhole .Xetworks" Proceedings of the International Symposium on Parallel mid Distributed Processing. 190-197. Xew Orleans. October 1996.

D. Manivannan and M. Singhal ".A Low-overhead Recovery Technique using Quasi- synchronous Checkpointing" Proceedings of the 16'^' International Conference, on Systems. 100-107. Hong Kong. May 1996.

VI D. Manivannan and M. Singhal "Decentralized Token Generation Scheme for Token- Based Algorithms". International Journal of Computer Sijstnn.^ Science and Engineering". ll(l):45-54. .January 1996.

D. Manivannan and M. Singhal ".A.n Efficient Fault-tolerant Mutual Exclusion .4.1- gorithm for Distributed Systems" Proceedings of the ISC A International Conference on Parallel and Distributed Computing Systems. 525-530. October 1994.

FIELDS OF STUDY

Major Field: Computer and Information Science

Studies in: Distributed Systems Prof. Mukesh Singhal Computer Architecture Prof. Dhabaleswar Panda Programming Languages Prof. Xeelam Soundararajan

vu TABLE OF CONTENTS

Page

A b stra c t...... ii

D edication ...... iv

Acknowledgments ...... v

\ ’i t a ...... vi

List of T a b le s ...... xi

List of Figures ...... xii

Chapters:

1. Introduction ...... I

1.1 Background and Motivation ...... 1 1.2 System Model and N o ta tio n s ...... 3 1.3 Related W o rk ...... -ô 1.3.1 Consistent Global Checkpoints ...... 5 1.3.2 Checkpointing Algorithms ...... 6 1.3.3 Recovery Algorithms ...... S 1.4 Problem Statement ...... 10 1.5 Organization of the Thesis ...... 12

2. Finding Consistent Global Checkpoints in a Distributed Computation . . 13

2.1 Introduction ...... 13 2.2 Background and Related W ork ...... 16 2.3 Finding Consistent Global Checkpoints ...... 23 2.3.1 Extending S to a Consistent Global Checkpoint ...... 24

viii 2.3.2 Minimal and Maximal Consistent Global Checkpoints . . . 27 2.3.3 .Mgorithm for Enumerating Consistent Global Checkpoints . 32 2.4 Finding Z-paths in a Distributed C om putation ...... 35 2.5 Summary- of R e s u lts ...... 40

3. Characterization and Classification of Checkpointing .A.lgorithms ...... 41

3.1 Introduction ...... 41 3.1.1 Objectives...... 42 3.2 A Characterization of Quasi-Synchronous Checkpointing ...... 43 3.3 Classification of Quasi-Synchronous Checkpointing ...... 4G 3.3.1 Strictly Z-path Free Checkpointing...... 46 3.3.2 Z-path Free Checkpointing ...... 53 3.3.3 Z-cycle Free Checkpointing ...... 59 3.3.4 Partially Z-cycle Free Checkpointing ...... 63 3.4 Discussion ...... 65 3.5 Summary of R esults ...... 67

4. Quasi-Synchronous Checkpointing Algorithm ...... 69

4.1 Introduction ...... 69 4.2 The Algorithm ...... 70 4.2.1 Informal Description of the A lgorithm ...... 70 4.2.2 An Example ...... 72 4.3 Consistent Global Checkpoint Collection ...... 73 4.3.1 Correctness of Consistent Global Checkpoint Collection . . 74 4.4 An Overhead .Analysis...... 77 4.5 Comparison with Existing A lgorithm s ...... S3 4.6 Summary of R esults ...... 84

5. Recovery Under Single Process Failure ...... 85

5.1 Introduction ...... 85 5.2 Basic Recovery .Algorithm ...... 86 5.2.1 An Explanation of the Basic Recovery .Algorithm ...... 86 5.3 A Comprehensive Recovery .A.lgorithm ...... 87 5.3.1 A Message C lassification ...... 89 5.3.2 Restoring the Processes to a Consistent Global Checkpoint . 91 5.3.3 Handling Messages ...... 92 5.3.4 Formal Description of the .-Mgorithm ...... 98 5.3.5 Correctness of the .Algorithm ...... 100 5.4 Comparison With Existing W ork ...... 103

IX 5.5 Summary' of R esults ...... 104

6. Recovery Under Concurrent Failure of Multiple Processes ...... 105

6.1 Introduction ...... 105 6.2 System M odel ...... 105 6.3 A Comprehensive Recovery .\lgorithm ...... 106 6.3.1 Basic I d e a ...... 108 6.3.2 Handling Rollback M essages...... 112 6.3.3 Handling Application M essages ...... 117 6.3.4 Message Logging and Message Replaying ...... 119 6.3.5 Formal Description of the Complete Process Recovery .A.lgo­ rithm 120 6.4 Correctness Proof ...... 124 6.4.1 Reducing Message Overhead ...... 128 6.4.2 .Asynchronous Garbage Collection ...... 129 6.5 Comparison With Existing W ork ...... 130 6.6 Summary of R esults ...... 133

7. Summary and Future Research ...... 134

7.1 Summary ...... 134 7.2 Future Research D irections ...... 138

Bibliography ...... 140

X LIST OF TABLES

Table Page

6.1 Comparison with related work (.V is the number of processes in the system and / is the maximum number of failures of any single process. F is the total number of recovery lines established) ...... 132

XI LIST OF FIGURES

Figure Page

1.1 Space-time diagram of a distributed computation ...... 4

2.1 Example execution: B and C can be used to construct a consistent global checkpoint (dashed line) but A and C. or B and D. cannot. . . 14

2.2 Z-paths: (a) Z-path from A to B and Z-cycle involving C. (b) .-Vny cut through A and B is inconsistent, as well as any cut involving C (all the dashed lines are inconsistent) ...... IS

2.3 The Z-cone and the C-cone associated with a set of checkpoints 5. . . 2'

2.4 The minimal and the maximal consistent global checkpoints containing a target set S ...... 29

2.5 .Algorithm for computing all consistent global checkpoints containing 5. 33

2.6 .A. distributed computation ...... 37

2.7 The R-graph of the distributed computation in Figure 2.6 ...... 37

3.1 distributed computation with asynchronous checkpointing ...... 42

3.2 Xon-causal Z-paths ...... 44

3.3 SZPF Checkpointing ...... 48

3.4 Checkpointing in XRAS m ethod ...... 50

3.5 Checkpointing in C.AS method ...... 51

XU 3.6 Checkpointing in CBR method ...... 52

3.7 Checkpointing in CASBR method ...... 33

3.8 ZPF Checkpointing ...... -54

3.9 ZCF Checkpointing ...... 60

3.10 Relationship between the various checkpointing models proposed. . . 67

4.1 Example illustrating the checkpointing algorithm ...... 72

4.2 Communication-induced checkpointing coordination: (a)asynchronous checkpointing and communication pattern: (b) quasi-synchronous check­ pointing for the same communication pattern ...... 78

4.3 Communication-induced checkpointing coordination: (a)asynchronous checkpointing and communication pattern: (b)checkpointing using our algorithm ...... 82

•5.1 \arious Types of Messages ...... 89

3.2 Handling of messages during recovery ...... 94

6.1 -A.n example with concurrent failure of multiple processes ...... 107

6.2 Handling concurrent failure of multiple processes ...... 110

6.3 Recovery line established after concurrent multiple failures ...... I l l

6.4 Partial recovery lines established due to concurrent failures ...... 113

6.3 Recovery lines established after complete recovery...... 116

.XII1 CHAPTER 1

INTRODUCTION

A distributed system is a set of computers connected by a communication network.

Due to enormous changes in the microprocessor technology and the availability of

high-speed networks, distributed systems are replacing expensive centralized systems.

Distributed systems can meet the increasing demand for both higher throughput and

higher availability.

1.1 Background and Motivation

.A. distributed computation consists of a set of asynchronous processes running in a distributed system. These processes cooperate with each other in achieving a c ommon goal. They do not share a global memory or a global clock and communicate solely by passing messages over the communication network. A checkpoint of a process is a recorded state of the process. .A global checkpoint of a distributed computation is the union of checkpoints of the individual processes involved in the computation. When processes record their local states independently without any coordination, the union of all such local states may not capture any meaningful state of the computation. .A global checkpoint is said to be consistent if it represents the snapshot of the process States that actually occurred simultaneously during the execution or had the potential of doing so [11].

Recording consistent global checkpoints of a distributed computation is an impor­ tant paradigm and it finds applications at several places in distributed systems de­ sign [30]. such as transparent failure recoveiy [29]. distributed debugging [17. 19. 20]. monitoring distributed events [49]. setting distributed breakpoints [38], protocol spec­ ification and verification [18]. and others [30. 56]. In the absence of a shared memory and a global clock, recording consistent global checkpoints of a distributed com­ putation is difficult. When processes record their states periodically without any coordination with other processes, there is no efficient algorithm to construct consis­ tent global checkpoints from the recorded local states of the processes. To solve this problem, we present a theoretical framework for identifying the checkpoints that can be used to construct consistent global checkpoints. Based on the frame­ work. we present a simple and elegant algorithm for enumerating all consistent global checkpoints of a distributed computation. We also present a characterization and classification of quasi-synchronous checkpointing algorithms which led us to design low-overhead checkpointing and recovery algorithms.

The rest of this chapter is organized as follows. In the next section, we present the system model and some of the notations used. In Section 1.3. we summarize related work. In Section 1.4. we state the problems solved in this thesis and in Section 1.5 we present the organization of the thesis. 1.2 System Model cind Notations

The distributed computation we consider consists of A spatial!} separated as\n-

chronous processes denoted by Pp Po. ■■ ■■ P.v- The processes do not share a global

memor}' or a global clock. They communicate with each other solel\ b\ passing

messages over the communication network. The communication delay is finite but

unpredictable. The execution of a process produces sequence of events. Process

execution and message transfer are asynchronous. .A process can execute an e\ent

spontaneously; after sending a message, a process does not have to wait for the deli\-

ery of the message to be complete. Xo assumption is made about the FIFO (First-In

First-Out) nature of the communication channels.

The execution of a process is modeled by four types of events: the send event of a message, the receive event of a message, a local event, and a checkpoint event.

Figure 1.1 shows the space-time diagram of a distributed computation consisting of

three processes. .A horizontal line represents the progress of a process with time and a slanted arrow indicates a message transfer. \ ’arious types of events are also shown

in the figure. Checkpoints of processes are denoted by letters .4. B . C The states of the processes involved in a distributed computation depend on one another due to interprocess communication.

Lamport's happened before relation [32] on events. is defiited as the transitive closure of the union of two other relations: U -^ ) . The relation captures the order in which local events of a process are executed [40]. The event of any process Pp (denoted Cp.,) always executes before the event: Cp,

The relation shows the relation between the send and receive events of the same message: if a is the send event of a message and h is the corresponding receive event

3 local event message send event

message receive event

▲ A message send event T A message receive event Time

• A local event | A checkpoint event

Figure 1.1: Space-time diagram of a distributed computation.

of the same message, then a b. If n is ordered before h by (i.e.. a h). ue say that a causal path exists from a to b. The events of any process are totally ordered under Lamport's happened before relation However, if the events of all processes involved in a distributed computation are taken into consideration, they are not totally ordered by If neither a nor b happens before the other, we say that they areunordered or concurrent. Using Lamport's happened before relation, a consistent global checkpoint of a distributed computation can be defined as follows:

D efin itio n 1 .4 .sef of checkpoints S = {Ci.C> C.v} of a distributed computation

IS called a consistent global checkpoint if and only if Cp -/^ C, VI < p.q < .V where

7^ denotes the negation of and'dr : 1 < r < .V. Cr is a checkpoint of process P^.

In Figure 1.1. {.4. 5 . C} is not a consistent global checkpoint since B C.

However. {E. D.C} is a consistent global checkpoint. 1.3 Related Work

In this section, we present an overview of the existing results which address the problem of finding consistent global checkpoints. We also summarize the existing checkpointing and recovery algorithms related to our work.

1.3.1 Consistent Global Checkpoints

Recording consistent global checkpoints finds applications in many areas of dis­ tributed systems such as transparent failure recovery, distributed debugging, proto­ col specification and verification and others. One way of recording consistent global checkpoints is to allow processes to checkpoint their local states periodically and when there is need to determine consistent global checkpoints, a process collects the information about causal dependencies among the checkpoints and determines which local checkpoints form a consistent global checkpoint.

The definition of consistency states that for a set 5 of local checkpoints to l)e a consistent global checkpoint. S must contain one local checkpoint from mch of the .V processes and no causal path should exist between any two checkpoints in

5. However, when |5| < .V. mere absence of causal paths between the checkpoints in S is not sufficient by itself to ensure that local checkpoints from processes not represented in S can be combined with 5 to form consistent global checkpoint. Xetzer and Xu [40] define a generalization of causal paths called zigzag paths and prove that the absence of zigzag paths between checkpoints in 5 guarantees that S can be extended to a consistent global checkpoint. The concept of zigzag path expresses the exact conditions for consistency and hence is a powerful notion for reasoning about consistent states. The notion of zigzag paths has been used recently in several problems [2. 3. 4. 6. 5]. Although Xetzer and Xu prove the exact conditions under

which a set of checkpoints S can be used to build a consistent global checkpoint, they do not discuss how to construct the consistent global checkpoints containing S.

Wang [56] builds on the results of Xetzer and Xu and presents methods for con­ structing the maximal and minimal consistent global checkpoints containing a given set of local checkpoints. However, he does not address the issue of finding all the consistent global checkpoints of a distributed computation. In this thesis, we build a theoretical framework necessary for constructing consistent global checkpoints and present an algorithm for constructing them.

1.3.2 Checkpointing Algorithms

Checkpointing helps in achieving fault-tolerance in distributed systems. One of the primary objectives of checkpointing is to maintain at least one consistent global checkpoint in stable storage all the time so that when a failure occurs, processes can be restarted from the consistent global checkpoint. To reduce the amount of computation lost due to rollback, the consistent global checkpoint to which processes roll back in the event of a failure should be close to the maximum recoverable state [45]. Moreover, in order to make the checkpointing effort useful, processes should take checkpoints in such a way that every checkpoint is part of a consistent global checkpoint.

Several checkpointing algorithms have been proposed in the literature [8. 28. 29.

34. 36. 42. 14. 31. 51. 52. 54. 57]. These checkpointing algorithms have tradition­ ally been classified as asynchronous and synchronous. In a.'iynchronous checkpoint­ ing [8. 31. 52]. processes take checkpoints periodically without any coordination with each other. To recover from a failure, a failed process rolls back to its latest check­ point and communicates with other processes to determine if their current states are causally related to its current state. If they are. processes that received messages which are responsible for the causal dependencies rollback to eliminate the causal dependencies [45]. This process is repeated until the local states of all the processes are free from causal dependencies. Thus, recover}- may suffer from domino effect in which processes roll back recursively while determining a consistent global checkpoint.

Message logging [24. 25. 47. 50. 55] and message reordering [53] have been suggested in the literature to cope with the domino effect, .^.synchronous checkpointing requires multiple checkpoints to be stored at each process. Thus, storage requirement may be large. This problem can be solved by periodically establishing a recoverij line, a globally consistent set of checkpoints, and deleting all checkpoints that precede the re­ covery line. When processes take checkpoints asynchronously. some of the checkpoints taken may be useless (i.e.. may not be part of any consistent global checkpoint).

In synchronous checkpointing schemes, domino-free recovery is achieved by sacri­ ficing process autonomy and incurring extra message overhead during checkpointing.

In this approach, processes synchronize their checkpointing activities so that a glob­ ally consistent set of checkpoints is always maintained in the system [14. 29. 34].

The storage requirement for the checkpoints is minimum because each process keeps only one checkpoint in the stable storage at any given time. However, process execu­ tion may have to be suspended during the checkpointing coordination as in [27. 29]. resulting in performance degradation. Moreover, in large systems, the cost of syn­ chronization can be prohibitive. However, recovery is simple because when a failure occurs, all processes have to roll back to their only checkpoint. To overcome the disadvantages of synchronous and asynchronous checkpointing. quasi-synchronous checkpointing has been proposed in the literature [28. .51. -54. 57].

The primary- goal of a quasi-synchronous checkpointing algorithm is to allow processes take checkpoints asynchronously and minimize the number of useless checkpoints by forcing processes to take communication-induced checkpoints at appropriate places in addition to the checkpoints taken asynchronously. We classify and characterize quasi- synchronous checkpointing algorithms based on the extent to which they minimize the number of useless checkpoints. Using the insight gained from the classification, we develop an efficient quasi-synchronous checkpointing algorithm that eliminates useless checkpoints without causing much overhead.

1.3.3 Recovery Algorithms

In the literature, several recover}' algorithm s have appeared [9. 13. 21. 24. 26. 25.

41. 39. 48. 47. 50] in the past decade. The survey paper of Elnozahy et al. [15] serves as good source of reference for rollback-recover}- protocols proposed in the literature.

Checkpointing along with message logging is used in many recovery algorithms to speed up recovery and also to restore the system to maximum recoverable state when a failure occurs. Message logging has been classified as optimistic or pessimistic [50].

In the pessimistic message logging, all messages received by a process are logged into stable storage before being processed [9. 39]. When a process fails, its last check­ point is restored and the logged messages that were received after the checkpointed state are replayed in the order they were received. Pessimism in logging ensures that no other process needs to roll back. However, it results in performance degradation since every message needs to be logged before being processed. Therefore, it is not a desirable scheme when failures are rare and the number of messages exchanged is high.

In the optimistic message logging, a process stores the received messages in volatile memory and flushes them to a stable storage periodically [1.3. 24. 41. 48. 47. -50|.

Since the contents of the volatile memory are lost when a failure occurs, some of the messages received may not be in the stable storage and hence cannot be replayed.

Thus some of the states of the failed process cannot be recreated. States of other processes that depend on the lost states of the failed process become orphan. The recovery protocol must rollback these orphan states to non-orphan states. Recovery protocols based on optimistic message logging have better performance than recovery protocols based on pessimistic message logging if failures are rare.

Strom and Yemini [.30] introduced the area of optimistic recovery using checkpoint­ ing. T heir recover}' technique however suffers from the domino effect. .A.s a result, when a failure occurs, a process may have to rollback exponential number of times with respect .V. .Moreover, this algorithm tolerates only single failure and recpiires the channels to be FIFO. Sistla and Welch [47] presented an optimistic recovery protocol that avoids worst case exponential rollbacks by synchronization between processes during recovery. Their protocol also requires the channels to be FIFO and can handle only single failure.

lohnson and Zwaenepoel [24] present a general model for recovery in message logging systems. Based on the model, they present a centralized algorithm for de­ termining the maximum recoverable system state at any given time. Peterson and

Kearns [41] present a synchronous recovery protocol based on vector time. Their [>ro- tocol cannot handle concurrent multiple failures. Richard and Siiighal [21] present a recover}' algorithm based on asynchronous checkpointing and optimistic message

logging. Their algorithm also uses vector time to track dependency. Smith et al. [48j

presented a completely asynchronous, optimistic recovery protocol which can handle

concurrent failure of multiple processes. They also use vector timestamps. The al­

gorithm of Damini and Gar g [13] is based on the notion of fault-tolerant vector clock

which helps in tracking causal dependencies in spite of failures. It also uses a histonj

mechanism to detect orphan states and obsolete messages. These two mechanisms

along with checkpointing are used to restore the system to a consistent state after

the failure of one or more processes.

1.4 Problem Statement

.A.S we mentioned earlier, consistent global checkpoints have many applications

in distributed computations. .\ central cpiestion in applications that use consistent

global checkpoints is to determine whether a consistent global checkpoint that includes

a given set of local checkpoints can exist and to determine one or all consistent global

checkpoints that contain a given set of local checkpoints. Xetzer and Xu [40] presented

the necessary and sufficient conditions under which such a consistent global checkpoint

can exist, but they did not explore how such consistent global checkpoints can be

constructed. In this thesis, we present a theoretical framework for identifying the

checkpoints that can be used to construct consistent global checkpoints containing

a target set of checkpoints. .A.s an application of our results, we present a simple

and elegant algorithm for enumerating all consistent global checkpoints containing

a target set of checkpoints, which can be used for enumerating all consistent global checkpoints of a distributed computation.

10 Checkpointing algorithms have been traditionally chissified as synchronous and

asynchronous. There exist algorithms which are neither asynchronous nor synchronous.

We call such algorithms as quasi-synchronous. These three classes of algorithms were

treated in the literature as if they were three disjoint classes. In this thesis, we argue

that there is a containment relation among these three classes of algorithms. This

view results in a finer classification and characterization of the qiuisi-synchronous

checkpointing algorithms. This finer classification helps in analyzing the properties

and limitations of quasi-synchronous checkpointing algorithms belonging to various classes. The classification also helps in identifying the issues involved in designing quasi-synchronous checkpointing algorithms and sheds light on some open problems.

Based on the insight gained from the classification, we design a quasi-synchronous checkpointing algorithm. The checkpointing algorithm has lower checkpointing over­ head when compared to the existing algorithms in its class. It makes every checkpoint useful. It also facilitates the construction of a consistent global checkpoint containing any local checkpoint easily which helps speed up the recovery when a failure occurs.

Based on the checkpointing algorithm, we also present an asynchronous recovery al­ gorithm. The recovery algorithm can handle concurrent failure of multiple processes.

It does not use vector timestamps to track dependency like other recovery algorithms do. As a result, the control information pigg} backed on the application messages is very small. The recover}- algorithm uses neither pessimistic nor optimistic message logging for the purpose of recovery. Messages are logged selectively but pessimistically which means only those messages that could be required for replay during recovery are logged into the stable storage.

II 1.5 Organization of the Thesis

In C hapter 2 . we present a theoretical framework for reasoning about the consis­ tency of checkpoints and present an algorithm for finding consistent global checkpoints of a distributed computation. In Chapter 3. we present a characterization and classifi­ cation of the quasi-synchronous checkpointing algorithms. Based on the classification presented in Chapter 3. we present a low-overhead quasi-synchronous checkpointing algorithm in Chapter 4 and compare it with the existing quasi-synchronous algo­ rithms. Based on the quasi-synchronous checkpointing algorithm presented in Chap­ ter 4. a low-overhead recovery algorithm that can tolerate a single failure is presented in Chapter 5. In Chapter 6 . we extend the recovery algorithm to handle concurrent failure of multiple processes and compare it with the existing recovery algorithms. In

Chapter 7. we present a summary of the results and directions for future research.

12 CHAPTER 2

FINDING CONSISTENT GLOBAL CHECKPOINTS IN A DISTRIBUTED COMPUTATION

2.1 Introduction

In this chapter, the problem we address is how to determine the local checkpoints of processes to construct consistent global checkpoints. A solution to this problem forms the basis for many algorithms and protocols that must record oii-the-Hy con­ sistent global checkpoints or determine post-mortem which global checkpoints are consistent. To introduce the problem, let us consider how to construct a consistent global checkpoint containing some arbitrary set S of local checkpoints from some

(but not all) processes. First, since a consistent global checkpoint contains one local checkpoint from each process, we must select one candidate checkpoint from each pro­ cess not represented in S. and combine these candidates with S to form a consistent global checkpoint. However, these candidates must be selected carefully to ensure the resulting global checkpoint is consistent. Figure 2.1 illustrates the basic problem with an example three-process execution. Given two local checkpoints such as A and C. if they are to belong to the same consistent global checkpoint, consistency recpiires that neither happens before the other; this condition is neces.sarij for the checkpoints to

1.3 be consistent. In Figure 2.1. since .4. C. any global checkpoint contciining them will never be consistent, since such a checkpoint would always have a causal path between two of its local checkpoints (.4 and C). In contrast, checkpoints B and C have no causal path between them and indeed they can be combined with the second checkpoint of Po to form a consistent global checkpoint, as shown by the dashed line.

Checkpoints lying on the dashed line form a consistent global checkpoint because no two of them are causallv related.

B

ml m3

m2 m4

C D

Figure 2.1: Example execution: B and C can be used to construct a consistent global checkpoint (dashed line) but .4 and C. or B and D. cannot.

From this observation, one might be tempted to conclude that any set of local checkpoints no two of which are causally related can always be used to construct a consistent global checkpoint. However, this is not true. In Figure 2.1. for example, checkpoints B and D are not causally related, but they cannot be part of a consistent global checkpoint together. There is no checkpoint in Pj that can be combined with both B and D while maintaining consistency. Because of message m4. D cannot be combined with the second checkpoint of Fb (or any earlier checkpoint in A ) since those checkpoints have a causal path to D. Similarly, because of message m3. B is

14 not be combined with any later checkpoints in P, since B has a causal path to those

checkpoints. Thus, no checkpoint in P, can be used to construct a consistent global

checkpoint containing both B and D.

In this chapter, we explore these issues in detail, proving precisely which local

checkpoints can be used in conjunction with a given set of checkpoints S to con­

struct consistent global checkpoints. This problem was first considered by Xetzer

and Xu [40] who proved the conditions necessary and sufficient for .some consistent

global checkpoint to be built from 5. but they did not define the set of possible con­

sistent global checkpoints or present an algorithm to construct them. VVe build on

this work by analyzing the set of all consistent global checkpoints that can be built

from S. We prove exactly which sets of local checkpoints from each process can be

combined with those in S and still retain consistency. We also present an algorithm

that enumerates all such consistent global checkpoints. These results provide a dee[)cr

understanding of what constitutes consistency and can be applied to applications that

use consistent global checkpoints.

Throughout this chapter, thei^'‘ checkpoint of process Pp is denoted by Cp.,. the

checkpoint interval of process Pp consists of all the events that lie between its

(/ - 1)'^‘ and checkpoints (and includes the (i - 1)'^‘ checkpoint but not /'*). We

also assume that each process has a first checkpoint representing the initial state of

the process and a last checkpoint representing the final state of the process, i.e.. the state reached at termination. The checkpoint recording the initial state of process Pp

is ^p.o*

1.5 The rest of the chapter is organized as follows. In Section 2.2. we present the related work. In Sections 2.3 and 2.4. we present a theoretical framework for rea­ soning about consistency and give an algorithm for finding all the consistent global checkpoints that contain S. We also characterize the location of the maximal and minimal consistent global checkpoints, and discuss a graph (the rollback-dependancy graph), introduced by Wang [56]. that can be used for implementing our algorithm for finding consistent global checkpoints. In Section 2.5. we summarize the results of this chapter.

2.2 Background and Related Work

The definition of consistency states that for a set S of local checkpoints to be a consistent global checkpoint. S must contain one local checkpoint from e.ach process, and for any two checkpoints A. B E S. neither A B nor B A holds. .\s discussed above, from this observation one might conclude that, if we consider a smaller set 5 containing local checkpoints from some processes but not others, having all local checkpoints in S be mutually unordered is also sufficient to ensure that a consistent global checkpoint containing S can be built. However, as shown in

Figure 2.1. the subtle nature of consistency makes this conclusion false. Xetzer and

Xu [40] originally addressed this subtlety and proved the necessary and sufficient conditions for a given set of checkpoints to be part of a consistent global checkpoint.

Our results in this chapter are based on this past work [35]. discussed next.

The basic subtlety is that, although no causal path may exist between any two checkpoints in 5. this alone is insufficient to ensure that local checkpoints from pro­ cesses not represented in S can be added to S to form a consistent global checkpoint

16 G. The absence of causal paths does not fully capture the requirements on 5 to he part of a consistent global checkpoint. To determine exactly when S can be part of a consistent global checkpoint. Xetzer and Xu [40] generalize the notion of causal paths.

They introduce the notion of zigzag paths, which we callZ-paths for brevity. They prove that it is the absence of Z-paths between checkpoints in 5 which guarantees that S can be extended to a consistent global checkpoint. Since Z-paths capture the exact conditions for consistency, it is a powerful notion for reasoning about problems that involve consistent global checkpoints [2. 3. 4. 6 |.

Definition 2 .4 Z-path exists from Cp,, to iff

1. p = q and i < j (i.e.. one checkpoint precedes the other in the same process),

or '

2. there exist rne.ssages rrii. rn-y. ■ ■ ■ ■ m,fn > I) such that

(a) nil w I’Snt by proce.'is Pp after Cp_,.

(b) if nik (1 < A: < n) is received by Pr. then ni^-i is sent by Pr in the same

or later checkpoint interval (although mk~i may be sent before or after nif.-

is received), and

(c) tiin is received by Pq before C,pj.

.4 checkpoint C is said to be in a Z-cycle iff there exists a Z-path from C to itself.

The notion of Z-paths is a generalization of the notion of causal paths. It is clear that if there exists a causal path from one checkpoint to another checkpoint, then

' Xetzer and Xu’s definition only contains the second clause, but it is convenient to also define a Z-path to exist from .4 to B if .4 and B belong to the same process and .4 precedes B.

17 that path is also a Z-path: however, the converse is not true in general. For example, in Figure 2.2. the message sequence m,. m-i establishes a Z-path from .4 to B but this message sequence does not form a causal path from .4 to B. Thus, a causal path is always a Z-path. but a Z-path may not be a causal path. It is clear that existence of

Z-paths is a transitive relation. In other words, if there exist Z-paths from .4 to B and B to C. then there exists a Z-path from .4 to C.

.A.nother difference stems from Z-paths not always defining a partial order. .4 Z- path can exist from a checkpoint back to itself (a Z-cycle). In contrast, causal paths never form cycles. In Figure 2.2a. a Z-cycle exists involving checkpoint C. Message rnZ is sent after C. m4 is sent in the same inteiwal in which m i is received, and ;u4 is received before C. completing the Z-path from C to itself (the Z-cycle).

A

-J* Vm I \^m4 j m3 'V m4 /'m3 // U-

mi mi

B

Figure 2.2: Z-paths: (a) Z-path from .4 to B and Z-cycle involving C. (b) .4ny cut through A and B is inconsistent, as well as any cut involving C (all the dashed lines are inconsistent).

To see the significance of Z-paths between checkpoints, consider the dashed lines in Figure 2.2b. which show the global states that can include .4 and B. or C. The

Z-path from .4 to B forces any global state (or cut) passing through .4 and B to be inconsistent since the cut must cross either ml or m2. Cutting through ml means

18 th at A happened before the checkpoint taken after ml is received: cutting through

m2 means that the checkpoint taken before ml is received happens before B. In

either case, the existence of the Z-path renders it impossible to construct a consistent

global checkpoint, containing both .4 and B. Similarly, any cut passing through C

is inconsistent since it must cut across either m3 or m4. and as above the message

being cut always renders the global checkpoint inconsistent.

To formally reason about Z-paths. we use the following notation.

D efin itio n 3 Let A .B be individual checkpoints and R.S be sets of checkpoints. We define the relation over checkpoints and sets of checkpoints as follows:

1. A '^ B iff a Z-path exists from .4 to B.

2. A-2kS iff a Z-path exists from A to some member of S.

3. S A iff a Z-path exists from some member of S to A. and

4- R S iff a Z-path exists from some member of R to some member of S.

We define the relation 45, among checkpoints similarly. Thus. .4 45. iff a causal path exists from .4 to B (i.e.. if .4 B). etc. Using this notation, the results of

Xetzer and Xu [40] are easily described. The basic result of Xetzer and Xu is that a set of checkpoints can be extended to a consistent global checkpoint if and only if there is no Z-path between any two (not necessarily distinct) checkpoints in the set.

Xote that if S is any set of checkpoints. S S implies that the checkpoints in S are all from different processes. First, we prove the following lemma which comes handy in proving other results. The main idea involved in the proof of this lemma is derived from Xetzer and Xu [40].

19 L em m a 1 Let S be any set of checkpoints such that |S| < -V and S fL. S . Then,

each process Pp not represented in S has at least one checkpoint, say Cp_,. .'nick that

Proof: Suppose the lemma is not true. Then, there exists a process Pp not repre­

sented in S such that for every checkpoint Cp., of Pp. Cp., S or S Cp.,. Let

5,"/, = {C,,.|Q, « S( and = {C,.,|5« C,,}.

By our assumption, all the checkpoints of Pp are contained in U ).

C laim : n = 0.

Suppose the claim is not true. Then, there exists C G C S^,ght)- Tins means,

for some checkpoints P . P € 5. D C and C ^ E. Since existence of Z-paths is a

transitive relation, it follows that D E contradicting the fact that S S. Hence

the claim follows.

Observe that if Cp., 5 then Cp.j S. \fj < i. since Cp.j Cp., Vj < / and

is a transitive relation: similarly, if 5 Cp.,. then S Cp.j Vj > i. From this

observation, it follows that each checkpoint in has index less than the indices of

all the checkpoints in

Moreover, since no message is received by Pp before its initial checkpoint, there can

be no Z-path from any of the checkpoints in S to the initial checkpoint of Pp-. hence the

initial checkpoint of Pp must lie in By a similar reasoning, the final checkpoint of Pp must lie in So. both the sets 5,^^, and are nonempty. Let Cp./ be the checkpoint, with largest index such that Cp./ € ThenCpj^i G Thus.

Cp./ S and S Cp./+|. This means there exists a message sequence nii. iii ,. ■ • ■. nq, that forms a Z-path from a checkpoint C G S to the checkpoint Cp./*i and a message secpience m ,. m.,. • • •. ni'f. that forms a Z-path from Cp./ to a checkpoint P G 5. Then. 20 clearly the message sequence m-. m-,. ■ ■■. rrin. m\. in'o. ■ ■ ■. m'l. forms a Z-path from C to D. Hence S S. which is a contradiction. This contradiction arose because of our assumption that ever}' checkpoint of Pp has a Z-path from or to some checkpoint in 5. Hence, even,' process Pp not represented in S has at least one checkpoint Cp., such that(S Cp.,) A (Cp., 5). Hence the lemma. □

The following theorem and its two corollaries are due to Xetzer and Xu [40]. We present a proof of the theorem for completeness.

Theorem 1 .4 net of checkpoint.'^ S can be extended to a con.sistent global checkpoint if and only if S S.

Proof: {<=) Suppose S S. If |5| = .V. then since S S implies S ^ 5. it follows that S consists of one checkpoint from each process that are pairwise causally unordered. Hence 5 is a consistent global checkpoint. Suppose |5| < .V. Then by

Lemma 1. each process not represented in S has at least one checkpoint C such that

(5 C) A (C ^ S). Let s ' = {Cp., I Cp.,is the earliest checkpoint of a process Pp not represented in S such that(5 Cp.,) A {Cp.i S)}. By the choice of S'. it follows that {S' S] /\ {S s ’) and S’ contains one checkpoint from each process not represented in S.

Claim: S' ffk S '.

Suppose the claim is not true. Then there exist checkpoints Cp.,. C,,.^ 6 S' (not nec­ essarily distinct) such that Cp., 45. C,.j. Let m\. lUo. • • -. rn„ be the message sequence that constitutes the Z-path Cp., 45. C,,.j. Since the message is received by P,, before the checkpoint C,.j. C,,j cannot be the initial checkpoint of P,,. Hence there is at least one checkpoint of P,, that precedes C^.j. All such checkpoints must have a Z-path to some checkpoint in 5. In particular C,.j_i 45. S. Let m',. m'.,. -. rn'f. be

2 1 the message sequence that constitutes the Z-path Cq,j-\ 5. Clearly the message

sequence rrii.m^. ■ ■ ■. rUn. m\. mô. • • ■. ni], forms a Z-path from Cp., to a checkpoint in

S. This is a contradiction to the fact that (5' 7 ^ 5). Hence our assumption is wrong

and our claim holds. Let T = (SiJ S'). Clearly T contains one checkpoint from each

process. Since (S' 5") A (S 5) A (S S') A (S ' S). it follows th at T ^ T

and hence T and hence T is a consistent global checkpoint containing S.

(=>) Conversely, if S S. we will show that S cannot be part of a consistent

global checkpoint. .Assume that S ^ S. Then there exist two checkpoints A. B Çl S

(not necessarily distinct) such that .4 45. B. We prove that .4 and B cannot be part

of a consistent global checkpoint by using induction on the number n of messages

comprising the Z-path from .4 to B.

If n = 1 then .4 B and hence .4 and B cannot be part of a consistent global

checkpoint together. Xow assume that a Z-path of n messages nq. - -. nq, from

one checkpoint to another implies that they cannot be part of the same consistent global checkpoint. We will show that the existence of a Z-path of a? -f- I messages

A/q. • • •. irin. rn,i^i from .4 to B implies that they cannot belong to the same consistent global checkpoint.

Let C be the checkpoint immediately following the receive event of aaa„ in the process which receives aaa„: then aazi. • • ■. aai„ constitute a Z-path from .4 to C. By the inductive hypothesis. .4 cannot belong to the same consistent global checkpoint with

C or with any checkpoint following C in the same process. Thus, for .4 and B to belong to the same consistent global checkpoint, the global checkpoint must include a checkpoint in C'.s process that precedes C. However, by the definition of Z-path. message aaa„^i must be sent after any such checkpoint, meaning that any checkpoint

99 that precedes C happens before B. Thus. B cannot be combined with any checkpoint

preceding C and .4 cannot be combined with C or any checkpoint succeeding C to

form a consistent global checkpoint. Hence C's process does not have any checkpoint

that is consistent with both .4 and B and hence .4 and B together cannot be part

of a consistent global checkpoint. Hence S cannot be extended to a consistent global

checkpoint. □

C o ro lla ry 1 .4 .set of checkpoint.'i S /.s a consistent global checkpoint if and onli/ if

S ff- S and !5| = .V. where .V is the number of processes.

P ro o f: Follows from Theorem I. —

C o ro lla ry 2 .4 checkpoint C can he part of a consistent global checkpoint if and onlg

if it is not involved in a Z-cycle.

Proof: Follows by taking 5 = {C} in Theorem 1. □

2.3 Finding Consistent Global Checkpoints

.\lthough Xetzer and Xu proved the exact conditions under which a set of check­

points S can be used to construct a consistent global checkpoint, they did not discuss

how to actually construct all consistent global checkpoints. Our main results in this

chapter concern this issue. Wang [56] presents algorithms for finding certain consis­

tent global checkpoints that contain 5. but he focuses only on the so-called minimal and mfm'mai consistent global checkpoints (discussed below). He uses a graph called the rollback-dependency graph (or R-graph) and presents several applications of min­ imal and maximal consistent global checkpoints. We identify the checkpoints that can be used for constructing consistent global checkpoints containing any given set of

23 checkpoints S. and also present an algorithm for constructing ail the consistent global

checkpoints containing S. showing the minimal and maximal ones as special cases.

In Section 2.4 we discuss the R-graph. and show how it can be used for checking

the existence of Z-paths between checkpoints. Thus, the R-graph is useful for the

construction of consistent global checkpoints.

2.3.1 Extending 5 to a Consistent Global Checkpoint

Given a set S of checkpoints such that S S. we first analyze what other

checkpoints can be combined with 5 to construct a consistent global checkpoint.

There are three important observations.

First, none of the checkpoints that have a Z-path to or from any of the checkpoints

in 5 can be used because, by Theorem I. no checkpoints between which a Z-path exists

can ever be part of a consistent global checkpoint. Thus, only those checkpoints that

have no Z-paths to or from any of the checkpoints in 5 are candidates. W'e call the

set of all such candidates the Z-cone of S. Similarly, we call the set of all checkpoints

that have no causal path to or from any checkpoint in S the C-cone of 5.' The Z-coik'

and C-cone help us visualize the location of consistent global checkpoints containing

S. Since causal paths are always Z-paths. the Z-cone of S is a subset of the C-cone

of S. as shown in Figure 2.3 for some arbitrary S. Xote that if a Z-path exists from

checkpoint in process P, to a checkpoint in S. then a Z-path also exists from every checkpoint in P, preceding to the same checkpoint in S (because Z-paths are transitive): similarly, causal paths are transitive as well. Formally, for a given set

-These terms are inspired by the so-called light cone of an event e which is the set of all events with causal paths from e (i.e.. events in e’s future) [33]. .Although the light cone of e contains events ordered after e. we define the Z-cone and C-cone of 5 to be those events with no zigzag or causal ordering, respectively, to or from any member of S.

24 s of checkpoints, the Z-cone of S and the C-cone of S can be defined as follows:

Z-cone(.9) = {Q , | (Cp, f S) A (.9 f Cp,)}

C-cone(^) = {Cp, | (Cp, f 5) A (.9 f Cp,)}

I------#-

Edge ot C -e o n ^ ) Edges ot Z-eoneiS i Edge ot C*cünco» I—I---- 1 - I------1— -I-

I— I— I— ^— I— I— I------1------1—I------1------1— - I - I 1— ------1----- 1— I I - i ---- 1— I— ►

I— I— ------1— ^ ------1------1------1^ — I— 1-. ------

___Z-paih-s 10 S______Z-sone(SI ______Z-paths Irom_S______! : C-pa.hs ,n,p S i C-cone(S) i

Figure 2.3: The Z-cone and the C-cone associated with a set of checkpoints S.

The second observation is that although candidates for constructing a consistent

global checkpoint from 5 must lie in Z-cone(5). not all checkpoints in Z-cone are

usable. If a checkpoint in Z-cone is involved in a Z-cycle. then it cannot be part of

a consistent global checkpoint by Corollary 2. From Lemma 2 below, it follows that

if we remove from consideration all those checkpoints in Z-cone that are in Z-cycles.

then the remaining checkpoints are exactly those useful for constructing a consistent

global checkpoint from S. That is. each such checkpoint can be combined with S to construct some consistent global checkpoint.

25 Definition 4 Let S be a .iet of checkpoint.'^ .

Pq. the .set is defined as

S^efui = {Q.< i (Q.< ^ Z-cone(S)) A (C,., f - Q,)}.

In addition, we define

Srtseful = LJ •S'u.se/u/-

L em m a 2 Let S be a set of checkpoints .such that S fL. S. Let (T^., be any checkpoint

of process P q such that Cq,i ^ S. Then. S U can be extended to a con.sistent

global checkpoint if and only if Cq,, € Susc/ui-

Proof: Suppose S u can be extended to a consistent global checkpoint. Then,

by Theorem 1. (S J {C,.,}) ^ (S J {C,.,}). This implies (S C,,.,. C,., ^ S. and

C,., -p Cq_.) 'vhich implies C,., G Ç 5„,e/u/-

Conversely. suppose C,., G S^^efui- Then. Q., G ^hice Cq_, is a checkpoint

of process Pq. Hence. S -p C,., p S. and C,., P .Moreover S P S. and

hence S U {C,.,} 5 U {C,.,}. So. by Theorem 1. S U can be extended to a

consistent global checkpoint. □

Lemma 2 states that, if we are given a set 5 such that S P S. we are guaranteed

that any single checkpoint from Susefui can belong to some consistent global check­

point that also contains S. However, our final observation is that, if we attempt to construct a consistent global checkpoint from S by choosing a subset T of checkpoints

from Susefui to combine with S. we have no guarantee that the checkpoints in T have

no Z-paths among them. In other words, simply because all the checkpoints in Susefui are in Z-cone(5) (and thus have no Z-paths to or from any checkpoint in S). Z-paths may still exist between members of Susefui- Therefore, we have one final constraint to

26 place on the set T. namely, the checkpoints in T must have no Z-paths among them.

Furthermore, since S S. by Theorem 1 at least one such T must exist.

T h eo rem 2 Let S he a set of checkpoints such that S 'p- S and let T be any set of checkpoints such that S C)T = 0. Then. S Li T is a consistent global checkpoint if and only if

L T C Sjisefut^

2. T p~ T. and

S. iSuT i = A’.

Proof: Suppose S U T is a consistent global checkpoint. Let C Ç. T. 5 U {C} can be extended to a consistent global checkpoint since S U T is one such extension. Hence.

C € Snsefui by Lemma 2. Thus, every checkpoint in T belongs to Hence.

T Ç Susefui- Since S U T is a consistent global checkpoint. (S U T) -p- (S U T) by

Theorem 1. In particular. T p- T. By definition, the number of checkpoints in any consistent global checkpoint is .V so |S U T| = .V.

Conversely, suppose S is a set of checkpoints such that S P S and T is any set of checkpoints disjoint from 5 satisfying the three conditions. Since T Ç Susefui- for any two checkpoints C and D such thatC E T and D E S. C P D and D p C hold.

Hence. T p S. and S p T . Moreover. S p S. and T p T. Hence. (SuT) p (SuT).

Since |5 U T| = .V. 5 U T is a consistent global checkpoint by Corollary 1. □

2.3.2 Minimal and Maximal Consistent Global Checkpoints

Theorem 2 shows us exactly which sets of checkpoints can be combined with S to form consistent global checkpoints. Two interesting special cases we discuss next are the minimal and maximal consistent global checkpoints that contain 5. Intuitively, these are the earliest and latest consistent global checkpoints containing S that can be constructed.

Definition 5 Let S be any set of checkpoints such that S S. Let = S U T be a consistent global checkpoint, where T = {Cpi.,,. } and T n 5 = 0.

Then.

I. .M IS the maximal consistent global checkpoint containing S iff for any coimstent

global checkpoint M ’ = S u T" containingS. whereT = - - Cp..j^}.

we have Vn : I < n < A:.

d. M is the minimal consistent global checkpoint containing S iff for any consistent

global checkpoint M = S u T containing S . where T = { O i o i - }•

we have Vn : 1 < n < A. i„ < J,i-

Wang [56] showed that the minimal (maximal) consistent global checkpoints con­ taining S are those formed by choosing from each process not represented in S the earliest (latest) checkpoint that has no Z-path to or from any member of S. Mewed in terms of the Z-cone(S). these checkpoints form what we might call the "leading" and "trailing" edges of the Z-cone. as illustrated in Figure 2.4.

This observation implies that the Z-cone possesses some interesting properties.

First, the leading and trailing edges always exist. This follows from Theorems I and 2. In the degenerate case, the leading and trailing edges contain exactly the same checkpoints, and only one such T exists. In general, however, the leading and trailing edges may be distinct.

28 I— I— I- A 1— ^ A r -

I— 1- 1- I I r

A l - l - 4 ------1— 1-

Z-eone(S> 2-conctS» Rr\t chkpts wiih no Z- La.Ni chkpts with no Z*path> to or from S Z paths to or trom S (leading edge; «trailing edge)

Figure 2.4: The minimal and the maximal consistent global checkpoints containing a target set S.

Second, the checkpoints making up the leading (and trailing) edge never have Z- paths between them (including any Z-cycles). Since the Z-cone is defined to be those checkpoints with no Z-paths to or from 5. it follows that the checkpoints containing the leading and trailing edges (i.e.. the minimal and maximal checkpoints) are alway.s consistent. These properties are embodied in the following theorems. In the proof of

Theorem 1 . in fact the minimal consistent global checkpoint is constructed.

T h eo rem 3 Let S be a set of checkpoints such that S S and let

^ I g Z-Conef.?; A V A: > 0

Then 5"'“-^ is the maximal consistent global checkpoint containing S.

Proof: If Cp., G S. then Cp., is the only checkpoint that belongs to and hence

Cp., € 5'"“'^. Thus, every checkpoint in 5 belongs to 5"*“^ and hence S Ç 5'"“"^. Since

29 s 'p- s. / 0 Vç and hence |S"‘“'^| = .V. Let 5 '"“-^ = 5 U T for sonie T such that S n T = 0. Note th a t T is determined by 5 uniquely. If |S| = .V. then T = 0 and 5 itself is a consistent global checkpoint by Corollary 1 and hence it follows that

^rnax — ^ the maximal consistent global checkpoint containing S. So. we assume

|5| < -V. First, we prove that 5 U T is a consistent global checkpoint. To prove that

S u T is a consistent global checkpoint, it is sufficient to prove that all the conditions stated in Theorem 2 are satisfied. T Ç Susefui since T Ç Ç Susefui-

Claim: T f T.

Proof of Claim: Suppose T T. Since T Ç Susefui- C C ^ C £ T. Hence, there exist distinct checkpoints Cp_,.Cgj € T such that Cp^ Cgj. Since C,.j 6 Susefui-

S U can be extended to a consistent global checkpoint by Lemma 2. Since

Cp.t C,,.J. Cp_k ^ Cq.j V & < I. .Also, since Cp,, € T. Cp,k 0 Suseful V A > i which implies V S) V (S V Ar > L Thus. (Cp.t - Cp.,.) V

(Cp., (5U{C,.,})) V((SU{C,.,}) Cp.,) VA- > 0 and hence (Su{Q.,})Ly„, = 0. which means Pp contains no checkpoint that can be used to extend S U {C,.j} to a consistent global checkpoint. This is a contradiction to Lemma 2. Hence, our assumption that T T is incorrect and hence T T. Clearly. |5 U T| = .V. Thus, r satisfies all the conditions of Theorem 2 and hence 5 U T = 5'""-^ is a consistent global checkpoint by Theorem 2.

Next, we prove = 5 U T is the maximal consistent global checkpoint con­ taining S'. Let T { C p ; . Cp^.;.,. ' ' . Cp^.,^.}. Suppose T' {Cpj.ji.Cp.j.j._,. • • •.Cp^.j^,} is a set of checkpoints such that S U T' is a consistent global checkpoint. It is suf­ ficient to prove that Vn : 1 < n < k. Suppose not. then > /„ for some n. Then, for this n. Cp^.j„ 0 S,^."e/„; since Cp,.,„ 6 S """ and j„ > Hence. Cp„.j„

.30 cannot belong to any consistent global checkpoint containing S. by Lemma 2. This is a contradiction to the fact that S u T ' is a consistent global checkpoint containing

S. Hence. < in Vn : 1 < n < A: and = S U T is the maximal consistent global checkpoint containing S. □

T h e o re m 4 Let S be a set of checkpoints such that S fLr S and let

.T"" = {Q j I Q.; 6 A V A: < ^

Then is the m inim al consistent global checkpoint containing S.

Proof: If Cp., € S. then Cp., is the only checkpoint that belongs to and hem i'

Cp., G 5"“". Thus, every checkpoint in S belongs to 5"“" and hence S Ç S"“". Since

S S. # 0 Vf/ and hence |S"""| = N. Let S"“" = 5 U T for some T such that S n T = 0. Xote th a t T is determined by S uniquely. If |5| = .V. then T = 0 and 5 itself is a consistent global checkpoint by Corollary 1 and hence it follows that

S"“" = S is the minimal consistent global checkpoint containing S. So. we assume

|5| < -V. First, we prove that 5 U T is a consistent global checkpoint. To prove that

S u r is a consistent global checkpoint, it is sufficient to prove that all the conditions stated in Theorem 2 are satisfied. T Ç S,,.,^/^/ since T Ç S"“" Ç S„.,e/,,(.

C laim : T ^T .

Proof of Claim: Suppose T T. Since T Ç Su.se/ui- C C C € T. Hence, there exist distinct checkpoints Cp.,.C,.j G T such that Cp., Cp., G and C,,j G from the definition of T. Since Cp., C,.j. Cp., C,,.*.- V A- > j.

This implies Cp., C V C G since C,.j is the first checkpoint in Since

Cp., G T Ç Susefui- S U {Cp.,} can be extended to a consistent global checkpoint by Lemma 2. Since 5 Ç (S U (Cp.,}). (5 U {Cp.,})^.,^/,,; Ç .Moreover, since

3 1 Cp., c VC E 5u^e/u/- follows that Cp., (SU{Cp.,})%„gy„/ which is a contradiction since there can be no Z-path from Cp., to any checkpoint in (S U {Cp.,})^^.^y„, by the definition of the set (5 U {Cp.,})^^^^^,. Hence. T ^ T. Clearly. |5 U T| = .V. Hence.

5 U r = is a consistent global checkpoint by Theorem 2.

Next, we prove S""" = SuT is the minimal consistent global checkpoint containing

S’ L e tT {Cp,. Cp. ,j . ' ' '. Cp^.,,.}. Suppose {Cp,.^,. Cp^.j^• * * * • Cp,..j„.} is a set of checkpoints such that 5 U 7^ is a consistent global checkpoint. It is siifficienr to prove that Vn : 1 < n < k. Suppose not. then _/„ < in for some n. Then, for this n. Cp^,j„ 0 since Cp„.,„ G S'"'" and jn < in- Hence. Cp,.j„ cannot belong to any consistent global checkpoint containing S. by Lemma 2 . This contradicts the fact that SU 7^ is a consistent global checkpoint. Hence. ]„ > in Vn : 1 < n < k and

S'"'" is the minimal consistent global checkpoint containing S. □

2.3.3 Algorithm for Enumerating Consistent Global Check­ points

So far we have shown exactly which checkpoints can l)e used to extend S to a consistent global checkpoint. Our next main result is an algorithm to enumerate all such consistent global checkpoints. .Algorithms exist in the literature for computing various sets of consistent global checkpoints, such as for global predicate detection [ 12 ] and recovery [36]. but none explicitly compute all the consistent global checkpoints that include a given set S. Wang's work [56] shows how to compute the maximal and the minimal consistent global checkpoints that contain S. Our algorithm is not targeted toward a specific application, as it simply computes the set of all consistent global checkpoints containing 5. but it illustrates the use of our theoretical results in determining consistent global checkpoints. Our algorithm is novel in that it restricts

32 its selection of checkpoints to those within Z-cone(S) and it checks for the presenc e

of Z-cycles within the Z-cone. In the next section, we show how to perform this check

using a graph, discussed originally by Wang [56]. from which Z-cones and Z-paths can be detected.

ComputeAllCgs{S) { le t G = 0 if S S th e n let AllProcs be the set of all processes not represented in S C ampute All C guFrom (S. All Procs ) r e tu r n G } ComputeAllCgsFram{T. ProcSet) { if (ProcSet = 0) then 10 G = G u {T } 11 else 12 le t Pq be any process in ProcSet 13 for each checkpoint C G T^se/ui do 14 CornputeAllCgtiF rom(T U {C}. ProcSct \ {Pq}) 15 }

Figure 2.5: Algorithm for computing all consistent global checkpoints containing S.

Our algorithm is shown in Figure 2.5. The function ComputeAllCg.'i(S) returns the set of all consistent global checkpoints that contain S. The crux of our al­ gorithm is the function ComputeAllCgsFrom(T. ProcSet) which extends a set of checkpoints T in all possible consistent ways, but uses checkpoints only from pro­ cesses in the set ProcSet. .After verifying that S S. ComputeAllCga simply calls

Compute.AllCgsF rorn. passing a ProcSet consisting of the processes not represented in S (lines 2-5). The resulting consistent global checkpoints are collected in the global

33 variable G which is returned (line 6 ). It is worth noting that if S = 0. the algorithm computes all consistent global checkpoints that exist in the execution.

The recursive function ComputeAllCgsFrom(T. ProcSet) works by choosing any process from ProcSet. say P,. and iterating through all checkpoints C in

Recall that Lemma 2 states that each such checkpoint extendsT part-way toward a consistent global checkpoint. This means that T \J C can itself be further extended, eventually arriving at a consistent global checkpoint. Since this further extension is simply another instance of constructing all consistent global checkpoints that contain checkpoints from a given set. we make a recursive call (line 14), passing T VJC and a

ProcSet from which process Pg is removed. The recursion eventually terminates when the passed set contains checkpoints from all processes (i.e.. ProcSet is empty). In this case T is finally a global checkpoint, as it contains one checkpoint from every process, and it is added to G (line 10). When the algorithm terminates, all candidates in

Sxisefiti have been used in extending S. so G contains every possible consistent global checkpoint that contains S. The following theorem argues correctness.

T h e o re m 5 Let S be a .seit of checkpoint.'^ and G he the .set returned by the call

Compute.AllCg.i(S). If S 'p* S. then T £ G if and only if T is a con.si.stent global checkpoint containing S. That is. G contains exactly the consistent global checkpoints that contain S.

Proof: Claim: {{T ^ T)A(ProcSef contains all the processes not represented in T)A

(5 Ç T)) is an invariant during the execution of the algorithm, where T and ProcSet are the parameters being passed to Compute.AllCgs From.

Proof of Claim: It is sufficient to prove that the invariant holds before every call to

Compute.AllCgsFrom. It holds before the first call because at that time T = S and

34 the set AllProcs contains all the processes not represented in S. Xote that if T T. and ProcSet ^ 0. then for each P, €ProcSet. T^^efui this is because con­ tains precisely those checkpoints of P, that can be combined with T to extend T to a consistent global checkpoint by Lemma 2: and each such process should have at least one such checkpoint that can be combined with T to extend it to a consistent global checkpoint, by Theorem 1 ; also, for each checkpoint C € Tlsefui- ^U{C} Tu{C} by

Lemma 2 and Procset\P^ contains all the processes not represented in Tu{(T}. Thus, if ((P ^ r ) A [ProcSet contains all the processes not represented in T) A [S Ç P) ) holds before the call to Compute AllCg.'^ F ram. then it also holds after the call to

Compute AllC gs F ram. and hence is an invariant. It follows from this invariant and Corollary 1 that for all P € G. P is a consistent global checkpoint contain­ ing S. since P is added to the set G only if Proc.set = 0. Conversely, suppose P is a consistent global checkpoint containing 5. Let T \ S = {Cp,.,,.■ ■ ■ Cp„,,„} and Tj = S U {Cpi.,,. Cp.,.,.,. • • • Cp^.,^} [I < j < n) and To = 0. From Lemma 2.

Vj : 1 < j < n.Cp^_i^ € (S U Thus. P would have been included in the set

G. Hence. G consists of precisely those consistent global checkpoints that contain 5. □

2.4 Finding Z-paths in a Distributed Computation

Tracking Z-paths on-the-fly seems to be difficult and remains an open problem. In this section, we describe a method for determining the existence of Z-paths between checkpoints in a distributed computation that has terminated or stopped execution, using the rollback-dependency graph (R-graph) introduced by Wang [56]. First, we present the definition of an R-graph.

35 Definition 6 The rollback-dependency graph of a distribuied computation is a di­

rected graph G = {I'.E). where the vertices I ' are the checkpoints of the distributed computation and an edge (Cp.,.C,.j) from the checkpoint Cp_, to the checkpoint C,,j belongs to E if

1. p = g and j = i -h 1. or

2. p ^ q and a message m sent from the checkpoint interval of Pp is received

by Pq in its checkpoint interi’al (i.J > 0).

Construction of the R-graph

When a process Pp sends a message rn in its checkpoint interval, it pigg} backs the pair [p. i) with the message. When the receiverPq receives rn in its checkpoint interval, it records the existence of an edge from Cp., to Cqj. When a process wants to construct the R-graph for finding Z-paths between checkpoints, it broadcasts a request message to collect the existing direct dependencies from all other processes and constructs the complete R-graph. We assume that each process stops execution after it sends reply to the request so that additional dependencies between checkpoints are not formed while the R-graph is being constructed. For each process, a volatile checkpoint that represents the volatile state of the process [56] is added.

An Example of an R-graph

Figure 2.7 shows the R-graph of the computation in Figure 2.6. In Figure 2.7.

C’-’.-i- tind C 3.3 represent the volatile checkpoints, the checkpoints representing the last state the process attained before terminating.

36 c,,, Ci.: p. i -

m3 C:., / c , .

\ m5 ' / mh C,., 4 -

Figure 2.6: A distributed computation.

1 . 0 i.i I.: C..3

2.0 - ^ Volaille checkpoints

Figure 2.7: The R-graph of the distributed computation in Figure 2.6.

We denote the fact that there is a path from C to £) in the R-graph byC D

(it only means that there is a path: it does not however specify any particular path).

For example, in Figure 2.7. Ci.o C '3 v. When we need to specify a particular path, we give the sequence of checkpoints that constitute the path. For example.

(

R-graph and the Z-paths between checkpoints. This correspondence is very useful in

determining whether or not a Z-path exists between two given checkpoints.

Theorem 6 Let G = [W E) be the R-yraph of a distributed computation. Then, for

any two checkpoints Cp., and C,,_j. Cp,, C/.j if and only if

1. p = q and i < j. or

2. Cp.;_t Cq,j in G (note that in this case p could still be equal to qj.

Proof: (<=) Sufficiency: li p = q and i < j. then clearly Cp., C,,.j. So. we

assume that ->{{p = r/) A (/ < j)) holds and Cp.,_i C,,.j in G. Let (Cp.,*i =

= Q.;) be a path of minimum length from Cp.,^i to C,,,j in

the graph G. To prove Cp., C,.j. we use induction on in. the length of the minimal

path from Cp.,^i to C,.j in G.

Let m = 1. In this case, we need to prove that if (Cp.,_i.C,,.j) € E. then Cp.,

C,pj. which follows from the definition of R-graph.

.\ow. let m > 2. .Assume that for any two checkpoints C and D. if there exists a path of length k {k < m) from C to D in the graph G. then there exists a Z-path from C to D. By induction hypothesis, since (Cp.„i = Cpo.,„. Cp,Cp,„_,) is a path of length m - 1 from Cp.,_, to Cp„_,.,^_, in G. Cp., Cp„_,

Case (i): Pm-i = Q- In this case. Cp^_,.,„._, precedes C,.j in the same process P,, and hence Cp„._,.,^_, C,.j. Since is a transitive relation, it follows that Cp., C,,_j.

Case (ii): Pm-i # q- In this case, a message .M sent from the {i„,-.iY'^ checkpoint interval of Pp^_, was received by in its J'* checkpoint interval. It is clear that the

3 8 messages that caused the Z-path from Cp., to Cp^_together with the message

M constitute a Z-path from Cp., to C^.j and hence Cp., C,,.j.

(=>) Necessity: Conversely, suppose Cp., C,.j. .Assume that -■((/; = r/)A( / < j))

holds. Then, there exists a sequence of messages .\/i. .lA. • • •. .\/„ (n > 1) satisfying

the conditions given in Definition 2. Now. we want to prove that Cp.,^, C,,j in

G. To prove this, we use induction on n. the length of the sequence of messages that

constitute the Z-path. If n = 1. then .\[\ was sent after Cp , and received before C,,.j.

This implies that there exist integers i' and j . i' > i and j < j such that .1/, was

sent in the checkpoint interval of Pp and was received by Pg in its j'"' checkpoint

interval. Thus. {Cp^.C^y) € E. Since i > i. i-i- I < i . Thus. / -i- 1 < i . j < j. and

(Cp- .C,^y) €E implies that Cp.,-i C,,.j.

Now. let n > I. We assume that the result is true for all Z-paths consisting of

fewer than n messages. Suppose the message secpience .\/i. .lA. • • •. .\/„ constitutes

the Z-path from Cp., to C,.j. .Nssume that Mi. My. ■ ■ ■. M„-i constitutes a Z-path

from Cp., to Cr.k for some r and k. By induction hypothesis. Cp.,., Cr.k- Note

that -\/„ must have been sent by Pr either in its checkpoint interval or some later checkpoint interval.

C a se (i): .Message .\A is sent by Pr in its k‘^ checkpoint interval. In this case.

(O.fc-C/.j) € E. Since Cp.,^, Cr.k- it follows that Cp.,+; C„.j.

Case (ii): Message .\/„ is sent by Pr in its checkpoint interval for some / > k.

In this case. Cr.k Cr.i and (Cr.t-Cg.j) e E. Since Cp.,., Cr.k- it follows that

C p.,., Cg.j. This proves the theorem. □

39 The following Corollan' gives the necessan- and sufficient conditions for a set of

local checkpoints to be part of a consistent global checkpoint in terms of paths in the

R-graph.

Corollary 3 Let G = (I . Æ") be the R-graph of a distributed computation and let S

be any set of checkpoints. Then the following three statements are equivalent.

I. S can be extended to a consistent global checkpoint.

T Snext s in G. where Snext = | Cp., G 5}.

P roof: 1 <=> 2 follows from Theorem 1: 2 s=> 3 follows from Theorem 6 . □

2.5 Summary of Results

In this chapter, we presented a theoretical framework for determining consistent global checkpoints in a distributed computation run. It is based on the notion of zigzag paths introduced by Xetzer and Xu [40]. We presented a characterization of the maximal and the minimal consistent global checkpoints containing a given set of local checkpoints. Based on the theoretical framework, we presented an algorithm for finding all the consistent global checkpoints containing a given set 5 of local check­ points. If we take 5 = 0. then the algorithm gives the set of all the consistent global checkpoints of a distributed computation run. We established the correspondence between the Z-paths and the paths in the R-graph which helps finding the existence of Z-paths between checkpoints.

40 CHAPTER 3

CHARACTERIZATION AND CLASSIFICATION OF CHECKPOINTING ALGORITHMS

3.1 Introduction

In Chapter 2. we developed a theory for locating the consistent global checkpoints

containing a given set of checkpoints, when processes take checkpoints asynclironously.

We also presented an algorithm for finding consistent global checkpoints based on the

notion of Z-paths. On-the-fly determination of consistent global checkpoints recpiires

an on-line method for finding the existence of Z-paths between checkpoints. Track­

ing Z-paths on-line is a difficult problem. As a result, this algorithm may not be

useful for determining consistent global checkpoints on-the-fly. Moreover, when pro­ cesses take checkpoints asynchronously, it could very well happen that processes took checkpoints such that none of the checkpoints is part of a consistent global checkpoint.

Figure 3.1 illustrates a distributed computation in which two processes take check­ points asynchronously. Xote that none of the checkpoints taken is useful since all the checkpoints lie on Z-cycles. Figure 3.1 shows the worst case scenario; in general, a number of checkpoints could be useful.

41 Figure 3.1: .A. distributed computation with asynchronous checkpointing.

The number of useless checkpoints taken by processes can be reduced by requir­ ing processes to take communication-induced checkpoints, in addition to checkpoints taken independently [36. ô-l. 57|. Recall that we called the checkpointing algo­ rithms that require processes to take communication-induced checkpoints as quasi-

•sync/ironous checkpointing algorithms. Checkpoints taken by processes independently are called basic checkpoints and the communication induced checkpoints are called

/orced checkpoints. The primary goal of a cpiasi-synchronous checkpointing algorithm is to minimize the number of useless checkpoints and maximize the number of useful checkpoints in a computation. For example, in Figure 3.1. if each process took a forced checkpoint prior to receiving every message, then all the checkpoints taken would have been useful. Throughout this chapter, the checkpoint of process P,, is denoted by Cp,.. the check-point interval o[ process Pp consists of all the events that lie between its ( / — 1 )'^ and i*-^ checkpoints (and includes the (/ — 1 )'^‘ checkpoint but not The checkpoint recording the initial state of process Pp is denoted by Cp.o-

3.1.1 Objectives

Quasi-synchronous checkpointing algorithms are attractive because they minimize the number of useless checkpoints without introducing any undesirable effects. In this chapter, we provide a theoretical framework for characterization and classification of

42 quasi-synchronous checkpointing algorithms. The characterization and the chissific-a-

tion provide a deeper understanding of the principles underlying quasi-synchronous

checkpointing algorithms, help us evaluate such algorithms, and provide guidelines for

designing more efficient checkpointing algorithms. The classification also provides a

clear understanding of the properties and limitations of the checkpointing algorithms

belonging to each class.

The rest of this chapter is organized as follows. In Section 3.2. we provide a

characterization of quasi-synchronous checkpointing algorithms. In Section 3.3. we

present a classification of quasi-synchronous checkpointing algorithms. The merits

and significance of the classification are discussed in Section 3.4. We also discuss

existing open problems in Section 3.4. Section 3.5 summarizes the results of the

chapter.

3.2 A Characterization of Quasi-Synchronous Checkpointing

.\s we saw earlier (Figure 3.1). when processes take checkpoints asynchronously.

some or all of the checkpoints taken may be useless. In quasi-synchronous check­

pointing. processes take communication-induced checkpoints to reduce the number of

useless checkpoints: the message pattern and knowledge gained about the dependency

between checkpoints of processes trigger communication-induced checkpoints so that

the number of useless checkpoints is minimized or eliminated. Let us first understand

how checkpoints become useless and how we can convert useless checkpoints into

useful checkpoints.

Definition 7 .4 non-causaJ Z-path from a checkpoint Cp_, to a checkpoint C,,j f.s a sequence of messages niy. m>. ■ ■ ■. mn(n > 2) satisfying the conditions of Definition 2

43 m . 11 p-,

im .

Figure 3.2; Xon-causal Z-paths.

Huch that for at least one i (1 < / < n). nii is received by some process after

sending the message rrij^i in the same checkpoint interval.

Thus, non-causal Z-paths are those Z-paths that are not causal paths: in par­

ticular. Z-cycles are non-causal Z-paths. By Theorem 1 . if there exists a non-causal

Z-path between two (not necessarily distinct) checkpoints, then these two checkpoints together cannot be part of a consistent global checkpoint. .Moreover, non-causal Z-

paths between checkpoints are hard to track on-line, and hence the presence of non- causal Z-paths complicates the task of finding consistent global checkpoints. However, non-causal Z-paths between checkpoints are preventable if processes take additional checkpoints at appropriate places. For example, in Figure 3.2. the message sequence nii.m-y constitutes a non-causal Z-path from Cu to since rn-, is sent before re­ ceiving the message mi in the same checkpoint interval: if A took a checkpoint, say A. before receiving the message ttii but after sending the message ni). then this non-causal Z-path could have been prevented and as a result the checkpoints C u and

C3.1 could have been used to construct the consistent global checkpoint {Cij. .d.C.i.i }■

Similarly, the message sequence m3, nq is a non-causal Z-path from C>_) to itself (in

44 fact a Z-cycle): this Z-cycIe could have been prevented If process Py took a check­

point. say B. after sending the message my but before receiving message m-y which

would have made C j.-j useful for constructing a consistent global checkpoint (in fact.

{B. Co.j. C 3.1} would have been one such consistent global checkpoint).

Thus, even though non-causal Z-paths between checkpoints are harmful, they are

preventable if processes take additional checkpoints at appropriate places. Prevent­

ing all non-causal Z-paths between checkpoints by making processes take additional

checkpoints at appropriate places not only makes all checkpoints useful but als(j fa­

cilitates construction of consistent global checkpoints incrementally and easily. This

is because, in the absence of non-causal Z-paths. any set of checkpoints that are

not pairwise causally related can be extended to a consistent global checkpoint by

Theorem I and causality between checkpoints can be tracked on-line by using vector

timestamps [37. 43. 44. 46].

Thus, the primary issues involved in designing a cjuasi-synchronous checkpointing algorithm are (i) how to efficiently determine appropriate events for processes to take communication-induced checkpoints so that non-causal Z-paths can be eliminated and (ii) how to minimize the number of communication-induced checkpoints taken.

Depending upon the strateg}' adopted to address these issues, non-causal Z-paths between checkpoints can be prevented to varying degrees. Depending on the degree to which the non-causal Z-paths are prevented, quasi-synchronous checkpointing al­ gorithms exhibit different properties and can be classified into various classes. This classification helps understand the properties and limitations of various checkpointing algorithms which is helpful for comparing their performance; it also helps in designing more efficient algorithms. In fact, from the knowledge gained from the classification.

45 we developed an efficient quasi-synchronous checkpointing algorithm in Chapter 4.

Xext. we present a classification of quasi-synchronous checkpointing.

3.3 Classification of Quasi-Synchronous Checkpointing

We classif>' quasi-synchronous checkpointing algorithm s into four different classes,

namely. Strictly Z-Path Free (SZPF). Z-Path Free (ZPF). Z-Cycle Free (ZCF). and

Partially Z-Cycle Free (PZCF). This classification is based on the degree to which

the formation of non-causal Z-paths are prevented. We present the properties of the algorithms belonging to each class and discuss the advantages and disadvantages of algorithms belonging to one class over the other. We also present the relationship of the classification to existing work in the literature.

3.3.1 Strictly Z-path Free Checkpointing

Strictly Z-path free checkpointing eliminates all non-causal Z-paths between check­ points altogether and is the strongest of all the classes. We first give a formal definition of strictly Z-path free checkpointing and then discuss the advantages and disadvan­ tages of a system that is strictly Z-path free.

Definition 8 .4 checkpointing pattern is said to be strictly Z-path free (or SZPF) if there exists no non-causal Z-path between any two (not necessarily distinct) check­ points.

In a SZPF system, since there is no non-causal Z-path between checkpoints, every

Z-path is a causal path. The following theorem gives the necessary and sufficient conditions for a system to be SZPF. This theorem is helpful in verifying if a given checkpointing algorithm makes the system SZPF.

46 Theorem 7 .4 checkpointing pattern is SZPF if and only if in every checkpoint in­ terval. all the message-receive events precede all the message-send events.

Proof: (=>) Suppose there exists a checkpoint internal in which a niessage-scncl event precedes a message-receive event. In other words, there exists a process Pp and messages mi and m -2 such that Pp sends m-j and then receives mi in the same checkpoint interval. Further, suppose mi is sent by P^ after taking checkpoint .4 and m -2 is received by Pr before taking checkpoint B. Then, the message sequence mi. m-i forms a non-causal Z-path from .4 to B since m-y is sent in the same checkpoint interval before receiving m\. This implies that the system is not SZPF. Hence, in every checkpoint interval of a SZPF system, all the message-receive events precede all the message-send events.

{<=) Conversely, if all message-receive events precede all the message-send events in each checkpoint interval, then clearly there can be no non-causal Z-path between checkpoints and hence the system is SZPF. □

From Theorem 7. it is clear that the checkpointing pattern shown in Figure 3.3 is

SZPF. In this figure, the forced checkpoint Co.i is taken, to prevent the non-causal

Z-path m-y.m^ from Ci.o to C^i- Similarly, the forced checkpoint C-y.-.] is taken to prevent the non-causal Z-path m^. m^ from C u to C 3 1 : the forced checkpoint is taken to prevent the non-causal Z-path m^.nvi (in fact a Z-cycle) from C 2.1 to itself.

We next discuss the properties of a SZPF system. Il

m .

I basic checkpoint D forced checkpoint

Figure 3.3: SZPF Checkpointing.

Properties of a SZPF System

SZPF system has many interesting and desirable properties. We first show that in a SZPF system, each checkpoint taken is useful and constructing consistent global checkpoints is easy.

L em m a 3 In a SZPF system, all the checkpoints are useful for constructing consis­ tent global checkpoints.

Proof: A Z-cycle is a non-causal Z-path from a checkpoint to itself. However, in a

SZPF system, there is no non-causal Z-path between any two (not necessarily distinct) checkpoints and hence none of the checkpoints lies on a Z-cycle. Hence, by Corollary 2. all the checkpoints are useful for constructing consistent global checkpoints. □

In a SZPF system, any pair of checkpoints between which there is no causal path can be part of a consistent global checkpoint. In fact, a more general result is stated in the following theorem. The following theorem not only presents a necessary and sufficient condition for a given set of local checkpoints to be part of a consistent global checkpoint but also provides a method for constructing them incrementally.

48 T h e o re m 8 hi a SZPF system, a set of checkpoints S can be exteiided to a con.-il'itent global checkpoint iff S S.

Proof: In a SZPF system, for any two checkpoints .4 and B. B if and only if

.A. B. Hence, for any set of checkpoints S. S S if and only if S S. Hence, the proof follows from Theorem 1. □

Thus, in a SZPF system, if S is any set of checkpoints that are not pairwis*' causally related (i.e.. 5 5). each process not represented in 5 is guaranteed to have a checkpoint .4 such that .4 -^ 5 and 5 .4. .4fter adding such a checkpoint .4 to S. the resulting set 5" = S u {.4} has the property S' ^ S'. Thus, we can incrementally extend 5 to a consistent global checkpoint. Since causality (i.e.. the 45. relation) can be tracked on-line using vector timestamps or similar other mechanisms [.37.

43]. constructing consistent global checkpoints incrementally using this method is simple and practical. In a non-SZPF system however, it is not easy to construct consistent global checkpoints incrementally because tracking non-causal Z-paths on­ line is difficult.

L e m m a 4 In a SZPF system, for any .set of checkpoints S. the C-cone(S) ;.s same as the Z-cone(S).

Proof: In a SZPF system, every Z-path is a causal path and hence the lemma follows. □

The fact that the Z-cone(S) and the C-cone(5) are identical in a SZPF system facilitates the construction of the maximal and the minimal consistent global check­ points containing a given set S. If S 5. then by adding to S the latest checkpoint from each process that is not causally related to any of the checkpoints in S (i.e.. the

49 checkpoints lying on the trailing edge of the C-cone(5)). we can obtain the maximal consistent global checkpoint containing 5. The minimal consistent global checkpoint containing S can be constructed by adding to 5 the earliest checkpoint from each pro­ cess that is not causally related to any of the checkpoints in S (i.e.. the checkpoints lying on the leading edge of the C-cone(S)). Thus, in a SZPF system, determining maximal and minimal consistent global checkpoints is simple.

Relation to Existing W ork

The No-Receive-After-Send method: The Xo-Receive-.A.fter-Seiid (XRAS) check­ pointing method [1. 56] disallows any message to be received in any checkpoint in­ terval once a message has been sent in that checkpoint interval. Thus, all message send events precede all message receive events in each checkpoint interval. Hence, it follows from Theorem 7 th a t the XRAS checkpointing m ethod makes the system

SZPF. .A distributed computation taking checkpoints using the XRAS checkpointing method is shown in Figure .3.4.

I basic checkpoint D forced checkpoint

Figure 3.4: Checkpointing in XRAS method.

50 The Checkpoint-After-Send method: In Checkpoint-After-Send (CAS) method [ôGj. a checkpoint must be taken after even.- send event. Thus, any checkpoint interval can have at most one message-send event and it must appear after all the receive events in the interval. Hence. CAS method of checkpointing makes the system SZPF by The­ orem 7. Since XRAS method allows several message-send events to take place in the same checkpoint interval, the C.A.S method will have higher checkpointing overhead than the XRAS method. However, the C.\S checkpointing method has the following very interesting and useful property: "the set consisting of all the latest check-points of all the processes forms a consistent global checkpoint:'. However, this property conies at the expense of high checkpointing overhead. For the distributed computation in

Figure 3.4. the checkpoints taken using C.\S method is shown in Figure 3.5.

I basic checkpoint D forced checkpoint

Figure 3.5: Checkpointing in CAS method.

The Checkpoint-Before-Receive method: In Checkpoint-Before-Receive (CBR) method [56]. a checkpoint must be taken before every receive event. Thus, any check­ point interval can have at most one message-receive event and it precedes all the message-send events in that interval. Hence it makes the system SZPF by Theo­ rem 7. The checkpoints taken for the distributed computation of Figure 3.4 using

5 1 the CBR method is shown in Figure 3.6. Xote that the CAS method and the CBR

method will have the same checkpointing overhead since the number of forced check­

points taken in both cases is equal to the number of messages exchanged. However, in

the CBR method, the latest checkpoints of processes do not form a consistent global

checkpoint.

I ------1— X \ ------1

-X- \ n \ 1- I basic checkpoint U forced checkpoint

Figure 3.6: Checkpointing in CBR method.

The Checkpoint-After-Send-Befbre-Receive method: In the Checkpoint-.After-

Send-Before-Receive (C.A.SBR) method [56]. a checkpoint must be taken after ev­ ery message-send event and before ever}' message-receive event. In the C.\SBR method, any checkpoint interval can have at most one message-receive event and/or one message-send event and the receive event should precede the send event if an interval contains both a receive event and a send event: thus, the C.\SBR method makes the system SZPF by Theorem 7. In the C.A.SBR method, processes take twice as many forced checkpoints as in either C.A.S method or CBR method: this is because, in C.A.S method or CBR method, corresponding to each message one forced check­ point is taken whereas under the C.A.SBR method, corresponding to each message two forced checkpoints (one by the sender and one by the receiver of the message)

52 are taken. For the distributed computation of Figure 3.4. the checkpoints taken using

the C.A.SBR method is shown in Figure 3.7.

\“

I basic checkpoint D forced checkpoint

Figure 3.7; Checkpointing in C.A.SBR method.

3.3.2 Z-path Free Checkpointing

In a SZPF system, absence of non-causal Z-paths between checkpoints makes

all the checkpoitits useful and also facilitates the construction of consistent global

checkpoints incrementally. We can have these desirable features of a SZPF system

without actually eliminating all non-causal Z-paths. It turns out that we can get all

benefits of a SZPF system by eliminating only those non-causal Z-paths corresponding

to which there is no sibling causal path*. This relaxed requirement yields the ZPF

model defined below.

Definition 9 .4 checkpointing pattern is said to be Z-path free (or ZPF) iff for any

two checkpoints .4 and B. .4 B iff B.

Thus, in a ZPF system, even though non-causal Z-paths may exist between check­ points. they always have a sibling causal path and thus all such non-causal Z-paths

■^If there exists a Z-path from A to B. and also a causal path from A to B. the causal path is called a sibling of the Z-path.

53 can be tracked on-line through the sibling causal paths. Figure 3.8 shows a distributed

computation that is ZPF but not SZPF. In this figure, forced checkpoints are taken

to prevent non-causal Z-paths corresponding to which there exists no sibling causal

path. For example, forced checkpoint C 1.2 is taken by to prevent non-causal Z-path

m3, mo (in fact a Z-cycle from C om to itself): forced checkpoint C>,i is taken by process

Po to prevent the Z-path m^. (a Z-path from Ci.o to C 3 . 1 ) . However, even though

there exists a non-causal Z-path mo.rn^ from Ci.i to C 3.2. no forced checkpoint is

taken by process P> to prevent this Z-path because there exists a sibling causal path

nio. m o -

l.O

m.

I basic checkpoint D forced checkpoint

Figure 3.8: ZPF Checkpointing.

Properties of a ZPF System

.A. ZPF system has all the interesting properties of a SZPF system. The following

lemma shows that each checkpoint taken is useful in a ZPF system.

L em m a 5 In a ZPF system, all the checkpoints are useful for constructing consistent global checkpoints.

54 Proof: In a ZPF system. C C => C C for any checkpoint C. However. C C is never true, since relation is irreflexive. Therefore. C ^ (T for any C and hence none of the checkpoints is on a Z-cycle. Hence, by Corollary 2. all the checkpoints are useful for constructing consistent global checkpoints. □

The following theorem gives a necessary* and sufficient condition for a given set of local checkpoints to be part of a consistent global checkpoint and also provides a method for constructing them incrementally.

T h eo rem 9 In a ZPF .system, a set of checkpoints S can be extended to a consistent, global checkpoint iff S S.

Proof: In a ZPF system, for any two checkpoints .4 and B. A B if and only if

.4 B. Hence, for any set of checkpoints 5. S 5 if and only if 5 ^ 5. Hence, the proof follows from Theorem I. □

The following lemma shows that SZPF ==> ZPF.

L em m a 9 If a sy.stern is SZPF. then it is ZPF. but the converse is not true.

Proof: In a SZPF system, there is no non-causal Z-path between any two (not necessarily distinct) checkpoints which trivially implies that for any two checkpoints

.4 and B. . \ ^ B iff .4 4S. B. Hence, a SZPF system is a ZPF system. The converse is not true. For example, the checkpointing pattern in Figure 3.8 is ZPF but not SZPF. since the message seciuence m-i. forms a non-causal Z-path from C u to C;; v.O

It is also easy to see that for any given set of checkpoints S. Z-cone(S) and

C-cone(5) are identical in a ZPF system and hence finding the nniximal and min­ imal consistent global checkpoints containing a target set of checkpoints is simple.

Thus, a ZPF system hasall the important features of a SZPF system — constructing

• 5 5 consistent global checkpoints incrementally is simple and every checkpoint taken is useful for constructing consistent global checkpoint. In addition, for a given com­ putation. ZPF checkpointing is likely to have less checkpointing overhead than any

SZPF checkpointing. This is because in ZPF checkpointing, processes have to take forced checkpoints only to prevent non-causal Z-paths corresponding to which there exists no causal path whereas in SZPF checkpointing, processes have to take forced checkpoints to prevent all the non-causal Z-paths.

Definition 10 .4, ZPF-checkpointing algorithm is "optimal" if it makes processes take forced checkpoints "only" to prevent non-causal Z-paths corresponding to which there is no causal path.

Observation: Designing an optimal ZPF checkpointing algorithm seerns to be im­ possible.

This is because designing an optimal ZPF algorithm recpiires processes to have knowl­ edge about future events. For example, in Figure .3.8. upon receiving the message m >.

P-2 cannot decide whether or not to take a forced checkpoint in order to prevent the non-causal Z-path an optimal ZPF checkpointing algorithm should not force

P> to take a checkpoint before processing the message m,. because corresponding to the non-causal Z-path m-2. there is going to be a causal path ni-i. nic, in the future.

Equivalence of RX)-trackable and ZPF systems

We show that the Rollback Dependency Trackable System (RD-Trackable System)

[•56] is ccpiivalent to the ZPF system. First we define the RD-Trackable system using the terminology' of Z-paths. Each process Pp maintains a vector Dp of size .V. Entry

Dp[p\. initialized to 1. is incremented every time a new checkpoint is taken, and thus

56 always represents the current internal number or equivalently, the sequence number

of the next checkpoint of Pp'. even." other entr>' Dp[q\.q # p. is initialized to 0 and

records the highest sequence number of any intervals of on which Pp's current

state transitively depends. When Pp sends a message M. the current value of Dp is

piggybacked on M. When the receiver Pq receives M. Pq updates its vector D,, as

follows: := m ax(.\/.D [r]. Z?,[r]). 1 < r < .V. where M.D denotes the vector

piggybacked on M. When Pq takes the next checkpoint C,.j- the value of the vector

Dq at that instant is assigned as the timestamp for the checkpoint Q .j. and is denoted

by Cqj.D: after taking the checkpoint. Dq[q] is incremented.

D efin itio n 11 .4 checkpointing pattern is said to satisfy rollback-dependency track-

ability (or is RD-trackable^ iff for any two checkpoints Cp_, and Cqj. Cp., Cq_j if

and only ifCq,j.D[p] > i + 1.

The following theorem establishes the equivalence of the ZPF system and the

RD-trackable system.

Theorem 10 .4 checkpointing pattern is ZPF if and only if it is RD-trackable.

Proof: Xote that a process Pq updates the value Dq[p\.p ^ q only as a result of a

message reception. From the definition, it follows that in a RD-trackable system, for

any two checkpoints Cp., and C,,.j. Cp., Cq_j if and only if Pq received a message

M before taking the checkpoint Cq,j and M causally depended on a message sent by

Pp in its [i -f-1)'* checkpoint interval or later (i.e.. after taking the checkpoint Cp.,):

in other words, in a RD-trackable system. Cp., C^.j if and only if a message M that causally depended on a message sent by Pp after its checkpoint Cp., was received by Pq before the checkpoint C,.j. Thus, a system is RD-trackable if and only if for

• 5 7 any two checkpoints Cp., and C,.j . Cp., C,.j <==>- Cp., C,,.j. Thus proving tiie

theorem. □

Relation to Existing Work

Fixed-Dependency-After-Send method: In Fixed-Dependency-After-Send (FD.A.S)

method [56j. when a process Pp sends a message, it piggybacks the current value of

the dependency vector Dp. After the first message send event in any checkpoint in­

terval of a process Pq. if it receives a message M. then it processes the message if

.\I.D[r] < Dq[r\ Vr: otherwise, it first takes a checkpoint, updates its dependency vec­

tor Dq and then processes the message. Thus, in each checkpoint interval, after the

first send event, the dependency vector remains unchanged until the next checkpoint.

Fixed-Dependency-IntervaJ method: In the Fixed-Dependency-Interval (FDI)

method [56]. when a process Pp sends a message, it piggybacks the current valiu'

of the dependency vector Dp. When a process Pq receives a message .\I. Pq processes

the message i{ .M.D[r\ < Dq[r] Vr: otherwise, it first takes a checkpoint, updates its

dependency vector Dq and then processes the message. Thus, a process is allowed to

send and receive messages in a checkpoint interval as long as it will not change the

dependency vector in that interval.

Venkatesh et al.’s method: In the checkpointing algorithm of \enkatesli et al. [51].

processes are allowed to perform checkpointing independently based on their individ­

ual requirements. Checkpoints are assigned unique sequence numbers. Each process

Pp maintains a dependency vector Dp where for each q. Dp[q\ represents the sequence number of the latest checkpoint of Pq as perceived by Pp. When a process sends a message, it piggybacks the current value of the dependency vector with the message.

When a process Pq receives a message .1/. Pq processes the message if .\I.D[r] < 58 D,[r] Vr: otherwise, it first takes a checkpoint, sets Dq[r] := .\[ax{Dq[r\. M.D[r]) Vr.

and then processes the message.

Baldoni et al.’s method: Baidoni et al. [5] proposed a checkpointing algorithm

that makes the system ZPF. Their method does not force processes take checkpoints

to prevent those non-causal Z-paths corresponding to which there exists a causal path

in the past.

Wang [56] showed that both the FDAS and the FDI methods make the system RD-

trackable. Hence, from Theorem 10. both FDAS and FDI methods make the system

ZPF. It is easy to see that \enkatesh et al.'s checkpointing method is similar to the

FDI method described above and hence makes the system ZPF. Note that FD.\S.

FDI. and \enkatesh et al.'s checkpointing methods are not SZPF by Theorem 7. Of

all these four methods. Baldoni et al.'s method will have the lowest checkpointing overhead.

3.3.3 Z-cycle Free Checkpointing

All checkpoints taken in a ZPF system and a SZPF system are useful. If the objective of a quasi-synchronous checkpointing algorithm is Just to make all check­ points useful, it is not necessary to make the system either ZPF or SZPF. To make all checkpoints useful, it is sufficient to prevent only the Z-cycles. by Corollary 2. Thus, we propose a further weaker model below where only Z-cycles are prevented.

Definition 12 .4 checkpointing pattern is .said to be Z-cycle free (or ZCF) iff none of the checkpoints lies on a Z-cycle.

Figure 3.9 shows a distributed computation that is ZCF but not ZPF. In this figure, the forced checkpoint C\_ 2 is taken by Pi to prevent the Z-cycle m.;. m-z from

59 C-y.2 to itself. The message sequence uq. m., forms a non-causal Z-path from Ci.o to

C3.1: also, the message sequence nio.m^ forms a non-causal Z-path from Ci.i to C.t.j; however. R does not take forced checkpoints to prevent these non-causal Z-paths.

c.

P:

I hisic checkpoint D forced checkpoint

Figure 3.9: ZCF Checkpointing.

The following theorem gives a sufficient condition for a system to be ZCF. For any message M. let M.sn denote the sequence number of the latest checkpoint of the sender of M that precedes the event send{M).

T h e o re m 11 .4 checkpointing pattern is ZCF if for evertj message M received by any process P^.

M.sn > i = > 3 j > i such that (Cq^ receire(M)).

(That is. a message sent after taking a checkpoint with .•sequence number i is received by a process only after taking a checkpoint with sequence number > i).

Proof: The proof is by contradiction. Suppose there exists a Z-cycle from checkpoint

Cp., to itself. Then, there exists a message sequence .\/i. .l/o. ■ ■ ■. M ,fn > 1) such that

60 1. Ml is sent by Pp after Cp., (hence Mi.sn > i).

2. if A/t (I < k < n) is received by then Mk~i is sent by Pr in the same or later

checkpoint internal (although Mk~i may be sent before or after A 4 is received),

and

3. A/„ is received by Pp before Cp., (i.e.. receire(Mn) Cp.,).

Note that for each A: ( 1 < k < n). message Mk~i is sent in the same or later checkpoint interval in which A4 is received. Since A/i..sn > i. it follows from the condition given in the theorem and condition 2 above that A4-*« > i VA: (1 < A' < n). In particular.

Mn-'i'i > i- Hence 3 j > i such thatCp_j receive(Mn) which is a contradiction to the fact that receive{Mn) Cp.,. since Cp.j receiL'e(Mn) Cp.,. j > i is impossible as the sequence number of checkpoints in a process increase monotonically.

Hence, no checkpoint lies on a Z-cycle. Hence the theorem. □

Theorem II could be useful in verifying whether a given checkpointing algorithm makes the system ZCF. It is easy to see that the condition given in the theorem is only sufficient but not necessary for a system to be ZCF.

Properties of a ZCF system

An important feature of a ZCF system is that every checkpoint taken is useful, since none of the checkpoints is on a Z-cycIe. It is also easy to see that a ZPF sys­ tem is a ZCF system but not conversely. A ZCF system allows the formation of non-causal Z-paths among checkpoints that are not Z-cycles: therefore, it has less checkpointing overhead than a ZPF system. If a Z-path between two checkpoints is not prevented, the two checkpoints together cannot be part of a consistent global

61 checkpoint: however, individually the two checkpoints can still be part of a consis­

tent global checkpoint if they are not on Z-cycles. For example, in Figure .3.9. the checkpoints Cu and C 3.2 cannot be part of a consistent global checkpoint together because of the Z-path mo.m^: however, the sets {Ci j. C).i. Cj.i} and {Ci Cj j. Cs.j} are consistent global checkpoints containing Ci.i and C 3.2 respectively.

Even though every checkpoint in a ZCF system is useful for constructing a con­ sistent global checkpoint, constructing a consistent global checkpoint incrementally is difficult due to the presence of non-causal Z-paths. There is a trade off between weaker checkpointing model and the eiiseness of constructing consistent global checkpoints.

D efin itio n 13 .4 ZCF-checkpointing algorithm t.s "optimal" if it rnakr.'i proœ.Msrs take minimum number of forced checkpomts to prevent all Z-cycles.

Observation: Designing an optimal ZCF checkpointing algorithm .'ieenis to he im­ possible.

This is because Z-cycles are non-causal Z-paths and tracking non-causal Z-paths on­ line requires processes to have knowledge about future events. Xext. we present a ZCF quasi-synchronous checkpointing algorithm and explain how the problem of finding consistent global checkpoints is handled in that algorithm.

Relation to Existing Work

Briatico et al.’s algorithm: The algorithm of Briatico et al. [10] forces the receiver of a message to take a checkpoint if the sender's checkpoint interval number tagged with the message is higher than the current checkpoint interval number of the re­ ceiver. From Theorem 11. this checkpointing method makes all checkpoints Z-cycle free because a message sent in a checkpoint interval is never received in a checkpoint

62 interval with a lower internal number. Checkpoints with the same sequence number

form a consistent global checkpoint, however, the global checkpoint containing a given

checkpoint found using this method may be far away from the maximal consistent

global checkpoint containing the given checkpoint, since a process may be taking

checkpoints at a low pace and not receiving any messages from the processes that are

taking checkpoints at a faster pace. Briatico et al.'s algorithm does not actually track

Z-cycles and prevent them. It prevents Z-cycles using a heuristic and as a result it

may force processes to take forced checkpoints even when there is no need.

3.3.4 Partially Z-cycle Free Checkpointing

In ZCF checkpointing, none of the checkpoints taken lies on a Z-cycle and hence

all checkpoints are useful. This property of ZCF checkpointing is useful in many ap­

plications. For example, in rollback recovery, when a process fails, the failed process

needs only rollback to its latest checkpoint and other processes will have consistent lo­ cal checkpoints to which they can rollback. In systems where failures are not frerpient or in systems which can afford to rollback to a great distance in the event of a failure, it may not be necessary to make all checkpoints useful. Thus, it might be sufficient to make only some of the checkpoints useful. We call such checkpointing patterns as partially Z-cycle free. Formally, partially Z-cycle free checkpointing pattern is defined as follows.

Definition 14 .4 checkpointing pattern is said to he partially Z-cycle free (PZCF) if not all checkpoints are Z-cycle free.

The primary advantage of such checkpointing is that it has less checkpointing overhead than ZCF checkpointing: however, the presence of Z-cycles implies that

63 some of the checkpoints taken will be useless and it also complicates the process of finding consistent global checkpoints.

Relation to Existing Work

Xow. we present two checkpointing algorithms from the literature which makes the system PZCF.

Wang et ad.’s Lazy Checkpoint Coordination: Wang and Fuchs [54] proposed lazy checkpoint coordination in which each message is piggybacked with the sequence number of the current checkpoint interval. They define the laziness Z. a predefined system parameter which is a positive integer. During the normal execution, each process Pp maintains a variable 1 which is initialized to Z and incremented by Z each time the checkpoint Cp nZ is taken where n is some positive integer. When

Pp in its x'* checkpoint interval is about to process a message M tagged with the sender's checkpoint interval number ij > \ '. Pp is forced to take checkpoint Cpwhen'

/ = \_!j/Z\. For any positive integer n. the checkpoints with secjuence number iiZ are

Z-cycle free because a message sent after the checkpoint with sequence number nZ is never received by a process before taking checkpoint with sequence number nZ.

Checkpoints whose sequence numbers are not multiples of Z may be on Z-cycles.

So. this checkpointing method makes the system PZCF but not ZCF. From this observation, it also follows that for any positive integer n. checkpoints with seciuence number nZ form a consistent global checkpoint. This checkpointing method will have less checkpointing overhead than that of Briatico et ai [10] since processes are forced to take a checkpoint only if the sequence number received in the message exceeds by at least Z from the sequence number of its latest checkpoint which has a secjuenc-e number of the form nZ. However, the method for finding consistent global checkpoints

64 will yield less number of consistent global checkpoints than the one proposed in [lOj

if Z > 1 and there will be useless checkpoints.

Xu and Netzer’s Adaptive Checkpointing Algorithm: To our knowledge. Xu and

Xetzer [57] were the first to attempt design of a checkpointing algorithm which tracks

the Z-cycles on line and prevents them. In their method, each process Pp maintains a

dependency vector D\'p of size .V. The entry D\p[p] denotes the current checkpoint

number of Pp. For the purpose of detecting Z-cycles. when a new checkpoint is taken,

the current value of the vector D\ p is copied to another variable Z \ 'p. When Pp sends

a message M to P,. the current value of the vector DVp as well as the current value of Z\p[q] is piggybacked on M. Xote that Z\p[q].q ^ p is the sequence number of

the latest checkpoint of P, that has a causal path to the current checkpoint ofPp-. the integer value ZVp[q] piggy backed with M is denoted by M.Zid. W hen the receiver

P, receives M. Pq takes a checkpoint before processing the message if M.Zid is same as its current checkpoint number: after processing the message. Pq updates its vector

DVq as follows: D\'q[r\ := max{.M.D\'[r]. D\'q[r]). 1 < r < .V. where M.D\' denotes the vector piggybacked on M. This checkpointing method prevents all the Z-cycles of the form M\. M>. - ■ ■. Mn where .14 .14- 1 V& > 2 (note that it is not necessary that .l/i .I/o). Thus, this checkpointing method prevents all those Z-cycles that have a significant causal component. It is clear that not all Z-cycles are prevented and hence it is PZCF but not ZCF.

3.4 Discussion

Many of the existing quasi-synchronous checkpointing algorithms in the litera­ ture have been designed without realizing the fact that non-causal Z-paths between

65 checkpoints are the primary cause for making checkpoints useless: we have shown

that even though many of the designers of these algorithms did not realize that they

are trying prevent the formation of non-causal Z-paths between checkpoints that is

exactly what they are trying to do. We presented four models of quasi-synchronous

checkpointing, namely. PZCF. ZCF. ZPF. SZPF and showed that SZPF => ZPF

= > ZCF==> PZCF. Clearly, synchronous checkpointing algorithms make the system

SZPF. Figure .3.10 illustrates the relationship among these checkpointing models. In

Figure 3.10. as we move away from the center, we come across checkpointing algo­

rithms that give processes more autonomy in taking checkpoints but the number of

useless checkpoints may increase.

In a SZPF system, all checkpoints are useful for the purpose of constructing con­

sistent global checkpoint. Constructing consistent global checkpoints incrementally

is easy since we can easily check for causality while constructing consistent global

checkpoints.

In terms of finding consistent global checkpoints and making checkpoints useful

for the purpose of constructing consistent global checkpoint, a ZPF system has the

same advantages as a SZPF system. However, an optimal algorithm that makes the

system ZPF is better than any algorithm that makes the system SZPF Ijecause it has

the potential for having lower checkpointing overhead. Designing an optimal ZPF

quasi-synchronous checkpointing algorithm remains an open problem.

In a ZCF system, all checkpoints are useful. However, due to the presence of

non-causal Z-paths between checkpoints, constructing consistent global checkpoints

incrementally is difficult. There are no efficient methods for constructing consistent global checkpoints in a ZCF system. So. finding a method to construct consistent

6 6 global checkpoints efficiently in a ZCF system remains an open problem. Designing an optimal ZCF quasi-synchronous checkpointing algorithm also remains an open problem.

^ {•ér.'S:

Figure 3.10: Relationship between the various checkpointing models proposed.

3.5 Summary of Results

When processes take checkpoints independently, some or all of the checkpoints taken may be useless for the purpose of constructing consistent global checkpoints. Quasi-synchronous checkpointing algorithms force processes take communication- induced checkpoints to minimize the number of useless checkpoints. Depending on the extent to which the number of useless checkpoints are minimized, we classified the ciuasi-synchronous checkpointing algorithms into various classes. This classification provides a clear understanding of the limitations and properties of quasi-synchronous checkpointing algorithms belonging to various classes. We discussed the merits of checkpointing algorithms belonging to one class over the checkpointing algorithms belonging to other classes. This classification also helps in designing more efficient algorithms and evaluating existing algorithms. We pointed out that designing an optimal ZPF quasi-synchronous checkpointing algorithm remains an open problem.

Similarly, designing an optimal ZCF checkpointing algorithm as well as finding an efficient method to determine consistent global checkpoints in a ZCF system remain open problems.

0 8 CHAPTER 4

A QUASI-SYNCHRONOUS CHECKPOINTING ALGORITHM

4.1 Introduction

In Chapter 3. we observed that if our objective is to make all the checkpoints useful, we should make sure that none of the checkpoints taken lie on a Z-cycle.

We also observed that avoiding the formation of Z-cycles by actually tracking them seems to be difficult. In this chapter, we present a low-overhead quasi-synchronous checkpointing algorithm which makes the system ZCF. We also present an efficient and simple method for finding consistent global checkpoints in the system. Since all checkpoints taken are useful, the algorithm ensures the existence of a recovenj hue' containing any checkpoint of any process. This property of the algorithm helps us bound rollback during failure recovery. There is no extra message overhead involved in the checkpointing coordination. Only a sequence number is piggybacked with every computation message.

The rest of this chapter is organized as follows. In the next section, we present a quasi-synchronous checkpointing algorithm that makes the system Z-cycle free.

*a consistent global checkpoint

69 In Section 4.3. we present a method for determining consistent global checkpoints

when processes take checkpoints using our checkpointing algorithm. We analyze the

overhead involved in the checkpointing algorithm in Section 4.4. In Section 4.5.

we compare our checkpointing algorithm with existing checkpointing algorithms. In

Section 4.6. we summarize the results of this chapter.

4.2 The Algorithm

In this section, we present a quasi-synchronous checkpointing algorithm and prove

that it makes the system Z-cycle free.

4.2.1 Informal Description of the Algorithm

In the proposed algorithm, each process is allowed to take checkpoints cusyn-

chronously. In addition, processes are forced to take checkpoints as a result of the

reception of some messages. The forced checkpoints help make all the checkpoints

useful. Each checkpoint is assigned a unique sequence number. The sequence number

assigned to a checkpoint is picked from a local counter which is incremented peri­ odically. The local counters maintained by the individual processes are incremented at periodic time intervals, the time period being the same in all processes. Thus, the sequence numbers of checkpoints in a process increase monotonically and the sequence numbers of the latest checkpoints of all the processes will remain close to each other. .A.s we will see later, this property of the sequence numbers of the latest checkpoints help in advancing the recovery line. When a process Pp sends a message, it appends the sequence number of its current checkpoint to the message. When a process receives a message, if the sequence number accompanying the message is greater than the sequence number of the latest checkpoint of then, before pro­

cessing the message, it takes a forced checkpoint and assigns the sequence number

received in the message to the checkpoint taken. When it is time for a process to

take a checkpoint, it skips taking a checkpoint if its latest checkpoint has sequence

number greater than the current value of its counter (this situation could occur as

a result of the forced checkpoints and drift in local clocks). This strategy- helps in

reducing the checkpointing overhead (i.e.. the number of checkpoints taken).

Xext. we present the quasi-synchronous checkpointing algorithm formally. The

variable n e x tp of process P p represents its local counter. sUp contains the sequence

number of the latest checkpoint of Pp. C ..m denotes the sequence number assigned

to the checkpoint C. AP.sn denotes the sequence number piggybacked with message

M. The Quasi-Synchronous Checkpointing Algorithm

Data Structures at Process Pp ■sTip := 0: {Sequence number of the current checkpoint, initialized to 0. This is updated every time a new checkpoint is taken.} n ex tp := I; (Sequence number to be assigned to the next basic checkpoint, initialized to 1} When it is time forprocess Pp to increment n e x tp n e x tp := n extp -+- 1: {n e x tp incremented at periodic time intervals of x time units} When process Pp sends a message M A/..STI := .'iTip-. (sequence number of the current checkpoint appended to .V/} sen d {M): Process Pq, upon receiving a message M from process Pp if .sriç < M.S71 then (if sequence number of the current checkpoint is less than Take checkpoint C: checkpoint number received in the message, then C .sn := M .s n : take a new checkpoint before processing the message} STiq ;= M .s n : Process the message. When it is time for process Pp to take a basic checkpoint if nextp > snp then (skips taking a basic checkpoint if n e x tp < .srip (i.e.. if it already sUp := nextp-. to o k a forced checkpoint with sequence number > nextp)} Take checkpoint C: C .sn := .snp-.

7 1 lorced checkpoini

4

Figure 4.1: Example illustrating the checkpointing algorithm.

4.2.2 An Example

We illustrate the checkpointing algorithm using the example in Figure 4.1. Mes­

sage .\/i forces P 3 to take a checkpoint with sequence number 1 before processing Mi

because .l/j.-sn = 1 and sri:i{= 0) < 1 while receiving the message. Similarly, message

.1/5 also forces P 3 to take a checkpoint before processing the message. However, no other message makes the receiving processes to take a forced checkpoint.

Theorem 12 The quasi-synchrnnous checkpointing algorithm presented above makes

the system ZCF.

Proof: A message sent by a process after taking a checkpoint with sequence num­ ber n is received and processed by a process only after taking a checkpoint with sequence number > n. Hence the checkpointing algorithm makes the system ZCF by

Theorem 1 1 . □

Xext. we present a consistent global checkpoint collection algorithm based on the quasi-synchronous checkpointing algorithm. 4.3 Consistent Global Checkpoint Collection

When process Pp wants to determine a consistent global checkpoint containing

its latest local checkpoint, it sends a checkpoint request message along with the

sequence number m of its latest local checkpoint to all the processes. Upon receiving

this message, a process Pq sends the sequence number C.sn where C is the earliest

checkpoint of Pq such thatC.sn > rn: if such a checkpoint does not exist (i.e.. all the

checkpoints of P, have sequence numbers < rn). then Pq takes a checkpoint and assigns

m as sequence number to the checkpoint taken and sends the number rn to Pp. .\fter

Pp receives response from all the processes with sequence numbers rrii. rru. ■ ■ ■. rri\. it

declares {Ci.„„. ' -Cv.m.v} (here rrip = rn) as a consistent global checkpoint.

Thus, a process can always establish a recovery line consistent with its latest

checkpoint in a non-intrusive manner. This is a very desirable feature for rollback

recovery in distributed systems. For example, when a process Pp fails, it can restart

from its latest checkpoint and request the other processes to rollback to a checkpoint consistent with Fp's latest checkpoint.

The Consistent Global Checkpoint Collection Algorithm

When process Pp wants to collect a consistent global checkpoint sen d request-check.point(p. .sUp) to all processes including process Pp-. ,\fter receiving reply(q.rriq) from each process Pq. Declare 5 = {Cq,m,, | 1 < g < A'} as a consistent global checkpoint: Process Pq, upon receiving request-check-point(p.rn) from Process Pp If {rn > .•iTiq) th e n .sriq := m : Take checkpoint C: C .s n := .sriq: sen d reply{q. suq) to process Pp-. Else Find the earliest checkpoint C such that C .s n > m: se n d reply{q.C.sn) to Pp. An Example Illustrating the Consistent Global Checkpoint Collection

I n F i g u r e 4.1. if process P4 wants to establish a global checkpoint consistent

w ith its local checkpoint C4.4. it sends reqiiest.check.point{A.A) to all the processes

including itself. In response to this m essage. P3. sends replij{2>.A) and P, sends

reply(A.A). H o w e v e r . P2 will take a new checkpoint w ith sequence num ber 4 b e f o r e

s e n d i n g reply{2.A) t o P4. Sim ilarly. Pi takes a new checkpoint w ith sequence num ­

b e r 4 before sending repiy{l.A) t o P.\. .\fter receiving all these replies. P , declares

{C1.4. C ).4. C3.4. C4.4} as a consistent global checkpoint. In this exam ple, all the check­

points in the consistent global checkpoint have the sam e sequence num ber. However,

in general, this is not necessarily the case.

4.3.1 Correctness of Consistent Global Checkpoint Collec­ tion

Observation 1: I f Pp r e c e i v e s reply{q.m^) f r o m P,, {I < q < -V) in response to its

request_check,point{p. m) m essage, then

1. there exists a local checkpoint of process P,, s u c h t h a t in,, > m a n d

2. all checkpoints taken by P^ prior to checkpoint Cg^n,, bave sequence num bers

l e s s t h a n m.

Observation 2: For any m essage M s e n t b y Pq (I < q < X). send{M)

C /.m ,, < n iq .

Observation 3: For any m essage M received by Pq [I < q < A). receire(M)

C q,m ,, = > M.sn < n iq . X ote that the converse is not true.

Observation 4: Pp processes a received m essage M only after it has taken a check­

point w ith sequence num ber > .U..sn.

7 4 Safety Property:

Theorem 13 If a process Pp declares S = {Ci.mi • Cv.m.v } ^ consistent

global checkpoint, then S indeed is a consistent global checkpoint.

Proof: Let m = rup. Xote that process Pp must have sent request .check .point(p. rn)

in order to declare 5 as consistent global checkpoint. From Observation 1. we have

V(/. 1 < f/ < -V : m , > rn. (4.1)

Suppose S is not a consistent global checkpoint. Then, there exists a message M sent by some process P, to some process p. such that .•iend{M)

receive{M) Cr.m.- So.

M .sn > m , from Observation 2 (4.2)

M.sn < rUr from Observation 3 (4.3)

From equations 4.1. 4.2. and 4.3 we get.

rn < m,, < M.sn < rn^ (4.4)

From Observation 1. all checkpoints taken by process p. prior to checkpoint Cr.m. have sequence numbers less than rn. Since M.sn > rn. p. must have processed M only after a checkpoint with sequence number > rn had been taken (from Observation

4). Since receive{M) Cr.mr- rnessage .\I must have been received by Pr before the checkpoint Cr.mr taken. Hence there exists a checkpoint C^,„' of p. that was taken before Cr.mr such that m,. > m. This contradicts the fact that all the checkpoints taken by process p. prior to checkpoint Cr.mr have seciuence numbers less

t 0 than rn. Hence our assumption that S is not a consistent global checkpoint is wrong.

Hence, the theorem. □

The following corollary gives a sufficient condition for a set of local checkpoints to be a part of a consistent global checkpoint.

Corollary 4 Let S = {Cpl.mp^^Cp...rnp., ^Pt.mp^ } be a set of local checkpoints from distinct processes. Let m = min{mp^.rnp., }. Then. S can be extended to a consistent global checkpoint if\fl{l

.mch that nip, > m.

Proof: Without loss of generality, we can assume rn = nip,. Xow. suppose Pp, initi­ ates a consistent global checkpoint collection by sending requestx'heck.point[px. nip, ) to all the processes and declares S' = . C2.,„o Cv.m.v} ^ts a consistent global checkpoint after receiving replies from all the processes.

Claim: 5' D S.

Proof of Claim: In response to the message reque.st jcheck.point[px. nip,) of Pp,. Pp,

{I < I < k) must have sent replij{pi. nip,) because Cp,,mp, is the earliest checkpoint of

Pp, whose sequence number is > in. Hence, for each I : {1 < I < k). Pp, should have included Cp,,mp, as the checkpoint of Pp, in the set S'. So. S Ç S '. Hence. 5 can be extended to a consistent global checkpoint. □

Corollary 4 gives a sufficient condition for a set of checkpoints to be a consistent global checkpoint.

Corollary 5 Let S = {C1.m1-C 2.rn, C.v.m.vl be a set of local checkpoints one from each process. Let ni - rnin{nii. nio m,\ }. Then. S is a consistent global checkpoint if V i (1 < / < .V). C,.m, is the earlie.'it checkpoint of P, such that rn, > rn. Proof: This is a special case of Corollary 4. □

Liveness Property:

Theorem 14 The consistent global checkpoint collection algorithm terminates in fi­

nite time.

Proof: When a process initiates consistent global checkpoint collection, it sends a

requestxheck-point message to all the processes. Upon receiving this message, a

process either sends a reply immediately or sends a reply after taking a checkpoint.

Thus, assuming the message delay (i.e.. maximum time taken by a message to reach

from one process to another process) is bounded by d and the maximum tinie taken

by a process to take a checkpoint is bounded by c. the initiating process would receive

reply for its request from all the processes in '2*d-\-c time (we assume the processing

time of the message is included in d). Thus, the consistent global checkpoint collection

algorithm terminates in finite time. □

4.4 An Overhead Analysis

In this section, we analyze the overhead involved in the quasi-synchronous check­

pointing algorithm. In checkpointing algorithms, generally three kinds of overheads

exist. First is the extra control messages required for checkpoint coordination, second

is the control information sent on computation messages, and the third is the num­

ber of forced checkpoints that need to be taken (i.e.. the checkpointing overhead).

In the qiuisi-synchronous checkpointing algorithm, each computation message is pig­ gybacked with the seciuence number of the current checkpoint. There is no other

< I message overhead. So. we only analyze the checkpointing overhead involved in the quasi-synchronous checkpointing algorithm as is done in [54].

Let basic^checknum denote the total number of checkpoints the processes would take asynchronously: let quasijiynchronausjchecknum denote the total number of checkpoints the processes would take for the same computation when the quasi- synchronous checkpointing algorithm is used. .A.S in Wang [54]. we define the induction ratio R for a com putation as

quasi ^synchronous.checknum R = basic.checknum

We use R to analyze the checkpointing overhead of the ciuasi-synchronous checkpoint­ ing algorithm.

7»T

0 I- - i JAJ • ba>jc chcckp»)int (a)

forced checkpoint

(b)

Figure 4.2: Communication-induced checkpointing coordination: (a)asynchronous checkpointing and communication pattern: (b) quasi-synchronous checkpointing for the same communication pattern. In Figure 4.2(a). processes take checkpoints periodically once in every time interval

of X time units. Processes may take checkpoint anywhere in that time interval. It is

clear that none of the checkpoints taken in Figure 4.2(a) is useful since all of them

lie on Z-cycles. When we use the quasi-synchronous checkpointing algorithm for the

same computation, the checkpoints taken are shown in Figure 4.2(b). Xote that

the number of checkpoints in Figure 4.2(b) is same as the number of checkpoints in

Figure 4.2(a). In Figure 4.2(b). ever}- checkpoint taken is useful. Thus, without any additional checkpointing overhead, the quasi-synchronous checkpointing algorithm makes every checkpoint useful for this computation.

Theorem 15 Under periodic checkpointing, .‘iuppose each process takes a basic check­ point in the later half of every time interval of x time units, and the local clocks of the processes drift by at most S where d < * x. Suppose further that the variable tu xt^j is incremented every x time units by each proce.ss Pp. Then, the qua.si-synchronous checkpointing algorithm has no additional checkpointing overhead over periodic check­ pointing. Thus. /? = 1.

Proof: Without loss of generality, we can assume that all processes start simulta­ neously (if processes start at different times, then they initialize the value of their variable ne.vt accordingly).

Claim : For any p. 1 < p < -V. process Pp takes at most one forced checkpoint in the time interval {t * x.{t I) * j].

Proof of Claim: Suppose Pp takes two or more forced checkpoints in time interval

(f * X. {t -h i) * .r] for some integer f > 0. Let C, and C) be any two such forced check­ points taken in that order. Since each process Pp attempts to take a basic checkpoint once in each time interval (f* j. (t-f- 1 ) and nextp is incremented once at the end

79 of each such time interval, the sequence number of the current checkpoint of process

Pp at the beginning of this time interval is > t. Thus, C\.sn > t + l and C-,-sn > t + 2.

This implies thatPp. in its time interval (t * x. {t -h I) * x] received a message M such

that M.sn > t + 2. which implies that there exists some process which took a basic

checkpoint with sequence number > t -t 2. Thus. P^ must have already passed the

first half of its time interval {(t -t I) * x. {t + 2) * .r] while Pp is still in its time interval

{t * X. (t 4- 1) * x]. This is impossible since the drift in the local clock values of the

processes is < ^ * x. This contradiction proves our Claim.

It is also clear that if a process Pp takes a forced checkpoint in a time interval

(f * X . (f + 1) * x], it skips taking the basic checkpoint in that time interval since

nextp < .snp. So. in each time interval (t*x. (t-\- 1) *x]. a process either takes a basic

checkpoint or a forced checkpoint but not both. Thus, after time t * x. each process

would have taken exactly t + l checkpoints including its initial checkpoint if the

quasi-synchronous checkpointing algorithm is used on top of periodic checkpointing.

Thus, the total number of checkpoints taken by processes using ciuasi-synchronous

checkpointing algorithm is same as the total number of checkpoints the process would

have taken under the asynchronous periodic checkpointing. Hence. /? = 1. □

Theorem 16 Assume that the variable nextp is incremented every x time units where

X is the smallest of the checkpoint interval times of all the processes. If processes take

basic checkpoints at their own pace, then the induction ratio R < \Q] where Q is

the maximum of the ratios of the lengths of the basic checkpoint intervals of any two processes. Moreover, no process need to rollback to a distance more than \Q~\ under

rollback recovery.

80 Proof: VVe consider the checkpointing overhead in the worst case scenario. The worst

case scenario occurs when each process broadcasts a message to every other process

after taking a checkpoint and the message delay is 0. It is clear that the process

with the smallest checkpoint internal will force all other processes to take a forced

checkpoint after it takes a basic checkpoint. Thus, at any given time, the total number

of checkpoints taken by any process is same as the total number of basic checkpoints

the process with smallest checkpoint time interval would have taken under the basic

checkpointing scheme.

Thus, if Q denotes the maximum ratio of the lengthii of the basic checkpoint inter­

val of any two processes, the induction ratio R < i Q]. Generally. R will be much

less than \Q'\ because if the processes with smaller checkpoint intervals do not send

messages to processes with larger checkpoint intervals frecjuently. then fewer forced

checkpoints will be taken. Moreover, even if no messages are exchanged between the

processes. |.snp - < [Q] at any given time. i.e.. the sequence numbers of the

latest checkpoints of the processes will differ by at most \Q], This is because the

process that has the smallest checkpoint interval will take a basic checkpoint every .r

time units, whereas the process with the largest checkpoint interval Q * x. will take

a basic checkpoint every Q * x time units. Hence, the sequence number of the latest checkpoint of the process that has the largest checkpoint interval will catch up with that of the process with the smallest checkpoint interval after everyQ *x time units.

This bounds the rollback distance to at most fQ] in the case of a failure. Hence the theorem. □

.\n example illustrating the case where each process takes basic checkpoints at its own pace is shown in Figure 4.3(a). Process P, takes only one basic checkpoint during

8 1 0 'i ^ r i ] ^ f I ; I

0 \ il \ ! 1 \ i ' ' /

ba-MC checkpoint (a)

0 I ^ toired checkpoint

RI h uD *t - t I f r n \ - u & * 7 - 1 i D / ' / ' / \\ / I I / \ / N' 1^1----- «------L _|__j— :----- L_|

(b)

Figure 4.3: Communication-induced checkpointing coordination: (a)asynclironou.s checkpointing and communication pattern: (b)checkpointing using our algorithm.

which process Po takes three basic checkpoints. If Pi fails after receiving message

but before taking checkpoint Ci.i. it will rollback to the beginning and it will also force

P) to rollback to the beginning under rollback recovery, thus rendering the checkpoints taken by process Pj useless. For the same computation and basic checkpointing pattern. Figure 4.3(b) shows the checkpoints taken by the processes using the quai- synchronous checkpointing algorithm for the computation of Figure 4.3(a). Even though Pi is forced to take more checkpoints than it would have taken otherwise, it helps the recovery line to progress as the processes progress in their computation.

This helps in reducing rollback in the event of failure. For example, if Pi fails after receiving message it would rollback to checkpoint C 1.3 and as a result P, will be forced to rollback to Cj 3.

82 4.5 Comparison with Existing Algorithms

In this section, we compare the quasi-synchronous checkpointing algorithm with

the existing algorithms. The algorithms proposed in [29. 7] have a two-phase struc­

ture. This causes processes to suspend the normal computation for making checkpoint

decisions which greatly increases the overhead during normal computation. The quasi-

synchronous checkpointing algorithm does not cause any such overhead and avoids

domino effect completely during recovery.

The synchronous checkpointing algorithm of Silva and Silva [14] requires a fixed

process to take a checkpoint and send request messages to all the other processes

for taking a checkpoint consistent with its current checkpoint: it also requires che

sequence number of the current checkpoint taken to be piggybacked with each com­

putation message. This requires the extra overhead of control messages for each

checkpoint taken and also if the coordinator fails, a new coordinator needs to be

elected. The checkpointing algorithm of Elnozahy and Zwaenepoel [I 6 | is sim ilar to

that of Silva and Silva [14].

In .A.charya et al.'s [1] checkpointing algorithm for mobile computing systems, a

process takes a checkpoint whenever a message reception is preceded by a message transmission. This might force the processes to take as many checkpoints as the number of messages if the message reception and transmission are interleaved, which would result in high checkpointing overhead. The checkpointing overhead of the quasi- synchronous checkpointing algorithm however does not depend on the communication pattern. Prakash and Singhal [42] proposed a low-cost checkpointing algorithm for mobile computing systems. In their algorithm, processes advance their checkpoints asynchronously. When a process advances its checkpoint, it requests all the processes

8 3 from which it received a message to advance their checkpoint and all the processes that receive this request in turn request the processes from which they received a message since they took the last checkpoint and so on.

Wang et al. [-54] proposed lazy checkpoint coordination for bounding rollback propagation. Like our approach, their technique requires the checkpoint number being pigg} backed on the computation messages so that the receiving processes can take an extra checkpoint when required. However, their approach has higher checkpointing overhead than our checkpointing algorithm since processes do not skip taking basic checkpoints after taking forced checkpoints.

4.6 Summary of Results

In this chapter we presented a novel quasi-synchronous checkpointing algorithm.

The algorithm makes the system ZCF and hence ever}- checkpoint taken is useful.

The algorithm does not require any additional control message overhead. It has no additional checkpointing overhead when superimposed over the traditional periodic checkpointing. However, when it is superimposed on a system which does not take checkpoints at periodic time intervals, it results in an additional checkpointing over­ head: in this case, the checkpoint induction ratio is < \Q]. where Q is the maximum of the ratios of the checkpoint interval times of any two processes.

84 CHAPTER 5

RECOVERY UNDER SINGLE PROCESS FAILURE

5.1 Introduction

In Chapter 4. we presented a quasi-synchronous checkpointing algorithm. In this

chapter, we present a low-overhead recovery algorithm based on the ((uasi-synchronous checkpointing algorithm. The algorithm is fully asynchronous, i.e. a failed process

needs only to roll back to its latest checkpoint and inform other processes about the

rollback: it can resume normal computation without waiting for other processes to roll back to a consistent state. To handle various types of messages that arise due to rollback, messages are selectively logged at the receiver end: this selective message logging reduces message logging overhead during the normal operation. Moreover, it does not use vector timestamps to track dependency among events unlike the existing recovery algorithms.

T he rest of the chapter is organized as follows. In Section 5.2. we present a basic recovery algorithm based on the cjuasi-synchronous checkpointing algorithm presented in Chapter 4. The basic recovery algorithm rolls back the processes to a consistent global checkpoint when a failure occurs. In Section 5.3. we extend the basic recovery algorithm to a comprehensive recover}' algorithm which handles the various types of 85 abnormal messages that arise during rollback appropriately and restores the system

to a consistent state in the event of a failure. In Section 5.4 we compare the recovery

algorithm with the existing work. Section 5.5 summarizes the results of this chapter.

5.2 Basic Recovery Algorithm

In this section, we present a basic recovery algorithm based on the quasi-synchronous

checkpointing algorithm presented in Chapter 4. The basic recovery algorithm rolls

back the processes to a consistent global checkpoint in the event of the failure of a

process. The notations sUp. C.sn. have the same meaning as in Chapter 4. We

assume that if a process fails, no other process fails until all the processes are rolled

back to a consistent global checkpoint. Now. we present the basic recovery algorithm

(hereafter called the BRA).

The Basic Recovery Algorithm

When process Pp fails Roll back to the latest checkpoint C: send rollJback-to(C.!>Ti) to all the other processes.

Process Pq on receiving roll-backJ,o{n) message If suq > n th en Find the earliest checkpoint C of Pq such thatC .s n > n: Roll back to C: .'iUq := C..sTi: Discard all the checkpoints beyond C Else (In this case the process does not rollback at all} Take a checkpoint C\ C .sn := n: suq := C..sn: { u p d a te .sn,,}

5.2.1 An Explanation of the Basic Recovery Algorithm

The BR.A. works as follows: W'hen a process Pp fails, it restarts from its latest checkpoint and sends rolLbackJo(n) message to all the other processes, where n is

8G the sequence number of the latest checkpoint of Pp. Upon receiving this message,

a process P, restarts from the earliest checkpoint whose sequence number is > n:

if there is no such checkpoint (i.e.. all the existing checkpoints of P,, have sequenc e

numbers < n). then it takes a checkpoint and assigns n as the sequence number

for the checkpoint taken. So. we can assume for simplicity that all the processes

(including the ones that just take a checkpoint) roll back to the earliest checkpoint

whose sequence number is > n. The following theorem establishes the fact that if a

process fails, then all other processes in fact rollback to a consistent global checkpoint.

Theorem 17 As.sume that upon rocetving roll.backJo(n) nie.^.'iagc froni groce.s.< P,,. proce.s.'i Pq rolLs back to its checkpoint (1 < c/ < -V). Then, the .set

^ { ^ I.m i • P '2 .m >- ■ ■ ■ • U'.V.rn.v }

is a consistent global checkpoint. Here, nip = n.

Proof: Note that n = niin{niq\C, E 5} and hence m, > n.\/Cq,n,,, € S. .Morecjver. if Cq mq E S. then all the checkpoints of P, that precede have sequence numbers less than n. Hence, by Corollary 5. S is a consistent global checkpoint. □

5.3 A Comprehensive Recovery Algorithm

Under the basic recovery algorithm, when a process fails, all the processes roll back to a consistent global checkpoint by Theorem 17. However, this may not restore the system to a consistent state. Restoring the system to a consistent state involves handling the messages received/sent before and after the rollback appropriately. Gen­ erally. message logging along with checkpointing is used to handle the various types of abnormal messages that arise during failure recovery. Existing message logging schemes can be broadly classified into two categories -optimistic and pessimistic.

In pessimistic message logging, messages received by a process are stored in the stable storage before they are processed. This ensures that messages that have been processed are available in the stable storage when required during recovery. Since all the messages received by a process are stored first in the stable storage, handling lost messages becomes simple in the event of failure recovery'. The drawback of pessimistic message logging however is that it causes performance degradation, since each message received by a process must be stored in the stable storage before processing.

On the other hand, in optimistic message logging, as messages are received, they are logged to volatile storage and processed: they are periodically flushed to a stable log. This allows the system to use its idle time for volatile to stable log flushes, resulting in more efficient use of the processor. However, since messages stored in the volatile storage are lost during process failure, when a process rolls back to a checkpoint due to a failure, some of the messages may not be available in the stable log for replay.

Our comprehensive recovery algorithm uses neither pessimistic nor optimistic mes­ sage logging. It uses selective pessimistic message logging, in which only those mes­ sages that could be needed for replay after a rollback are logged. With each checkpoint

C. there is an associated message log. denoted as C.rnesglog. message received af­ ter checkpoint C is logged into C.rnesglog only if the message is likely to be replayed if the process rolls back to C in the future.

88 Before presenting the comprehensive recover}' algorithm , we classify the messages

that need to be handled during recover}' into different types. Classifying messages into

various types will help clarify the message management issues involved in recovery.

5.3.1 A Message Classification

M

recovery line

Figure -5.1; \'arious Types of Messages.

We use Figure 5.1 to help classify different types of messages that need to he

handled during a recover}'. Let us assume that process Pi in Figure 5.1 fails at .V and

rolls back to its latest checkpoint Ci.s- .A.s a result, processes Po. Pj. and P^ roll back

to the checkpoints Co.o-O.s- and C 4.8 . respectively, under the BRA. The recovery line corresponding to this failure is shown in the figure by a zigzag line.

Lost messages: These are messages whose send events are not undone but whose

receive events are undone. Such messages arise when a process rolls back to

a checkpoint prior to the reception of the message while the sender does not

89 roll back to a checkpoint prior to the send event of the message. In Figure Ô.I.

messages Mi and -V/3 are lost messages.

Delayed Messages: These are messages whose receive events are not recorded be­

cause the message was received after the receiving process had rolled back. In

this case, the receiving process should be able to determine if the send event

of the message will be undone due to rollback. For example, message .\[> in

Figure .5.1 is a delayed message.

Orphan messages: Messages whose send has been undone but whose receive has

not been undone. Orphan messages do not arise if processes roll back to a

consistent global checkpoint. So. orphan messages do not arise under BRA.

Duplicate Messages: Duplicate messages arise due to replaying of messages from

the log during recovery. For example, in Figure 5.1. after rolling back to Ci.s.

if Pi replays the message 1./., from its log. then Mi will be a duplicate rnessagi'

because Mi will be resent by P3 after its rollback, if we assume deterministic

computation.

.\’ow. we extend the BRA to a comprehensive recovery algorithm which handles all the above types of messages appropriately and restores the system to a consistent state in the event of a failure. Since processes roll back to a consistent global checkpoint under BRA. orphan messages do not arise. Thus, we need only find ways to handle the following types of messages: messages lost because of rollback, delayed messages that are received after a failed process recovers from failure, and duplicate messages.

90 5.3.2 Restoring the Processes to a Consistent Global Check­ point

Each recovery has an associated incarnation number and a recovery line number.

Whenever a process fails, it initiates recovery with a new incarnation number and a new recovery line number. The recover}- line number and the incarnation number determine the recovery- initiated by a process uniquely. Each process Pp keeps the current incarnation number and the recover}- line number in the variables incp and recJinep. respectively. The values of all these variables are kept in the stable storage so that they are available in the event of process failure. Initially. V p. inCp = 0 . and recJinep = 0 . When a process Pp fails and restarts, it rolls back to its latest checkpoint C. increments inCp. sets recJinep := C.sn(= sUp). and sends the message rollback{i.ncp. recJinCp) to all other processes. Upon receiving this message, a process

Pq sets inCq ;= inCp and recJinCq := recJinCp and rolls back to its earliest checkpoint whose sequence number is > recJinep-. if Pq does not have such a checkpoint (i.e. all the checkpoints of Pq have sequence numbers less than recJinep). then it does not roll back but takes a new checkpoint with sequence number recJinep and continues.

Thus, all the processes roll back to their earliest checkpoint whose sequence number is > recJinCp.

The checkpoints to which the processes roll back in response to the message rollback(inCp. recJinep) form a consistent global checkpoint Sp (from Theorem 17). where

Sp = I Cr.mr îs the earliest checkpt of Pr with seq. number > recJinep}.

.Vote th a t Sp is uniquely determined by recJinep. So. we call recJinCp. "the recovery line number" of the rollback. .Vote that recJinep is basically the sequence number

91 of the latest checkpoint of the failed process Pp. In general, if Sp is the recovery line

established by the message rollback{inCp. recJinep). then the checkpoints that lie to

the left of Sp have sequence numbers less than recJinCp and those that lie to the

right will have sequence numbers greater than recJinep. This nice property of the

recovery line number helps in determining the lost, delayed, and duplicate messages.

In Section 5.3.3. we explain how this property is exploited in our algorithm to handle

the various types of messages during recovery.

For example, in Figure 5.2. if process Pi fails at the point .V. then it rolls back to

its latest checkpoint Ci.iq. and sends rollback{l. 1 0 ) message to all other processes.

In response to this message, processes Pj. P,. and P 4 roll back to the checkpoints

C%_i2. C;}.io- and Cuo- respectively. The recovery line for this rollback is shown by the

zigzag line in the figure. The recovery line number for this recovery is 10. Xote that

the sequence numbers of all the checkpoints that lie to the left of this recovery line

are less than 10 and the sequence numbers of all the checkpoints that lie to the right

are greater than 10.

5.3.3 Handling Messages

In addition to the sequence number of the latest checkpoint, each message is piggy­ backed with the current recovery line number and the incarnation number associated with the recovery line number. The incarnation number and the recovery line number pigg} backed with a message M are denoted by M.inc and M.recJine. respectively.

We selectively log the received messages to handle messages lost during rollback.

This selective message logging helps reduce the overhead involved in message logging.

92 First, we explain how messages lost due to rollback are determined and replayed and also how duplicate messages are handled.

Handling Lost and Duplicate Messages

When a process P, receives the message rollback{inCp. recJinPp) from a process

Pp. it finds its earliest checkpoint C with sequence number > recJinep. combines the message logs associated with the checkpoints that succeed C and assigns that as

C.mesglog: in other words, all the messages that were logged after the checkpoint

C form the message log of C. Then, it restores the checkpoint C and replays from

C.rnesglog only those messages whose send will not be undone: in other words, a process replays only those messages that originated to the left of the the current recovery line and delivered to the right of the current recovery line. This message replaying strategy- takes care of the lost messages and eliminates duplicate messages.

This is because lost messages are precisely those messages which were sent from the left of the recovery line and received to the right of the recovery line. Since only lost messages are replayed, duplicate messages do not arise due to replay. More formally.

Rule for handling lost and duplicate messages: .-\fter a process P^ rolls back

to a checkpoint C. it replays a message from it message log if and only if it

was received after the checkpoint C and M..sn < recJinCq.

In Figure .5.2. if process P, fails at the point X. it rolls back to its latest checkpoint

C l.10- increments inci to I. sets recJinci := 10. and sends rollback(l. 10) message to all other processes. When Po receives this rollback message, it sets inc-, := I. recJine-2 := 10. and rolls back to C 2.12 since C 2.12 is the earliest checkpoint of P> whose sequence number is > 10 . .\fter rolling back to C 2.12. P-i replays the messages

93 inc = 1 rec line = 10 ^ Fails P I VI,

P

M,

P

P 4

^ Messages sent after the rollback

Messages sent before the rollback

Figure 5.2: Handling of messages during recovery.

M\ and M-2 but does not replay this is because the send of .\/i has not been undone by Pi. the send of My has not been undone by P 3. while the send of .I/3 was undone by P 4 due to its rollback to the checkpoint Cuo- In the figure, messages sent before the rollback are shown by solid arrows and messages sent after the rollback are shown by broken arrows. So. M.inc = 0 for all the messages shown by solid arrows and M.inc = 1 for all the messages shown by broken arrows.

Next, we discuss how delayed messages are handled and also discuss how messages are logged selectively. To answer these questions, it is sufficient to answer the more general question of how a received message is handled.

94 Handling Delayed Messages and Message Logging

Suppose process Pp receives a message .\/ from process P,. The following three cases arise:

C ase (i): M was sent in a previous incarnation (i.e.. M.inc < inCp).

This means that P, was not aware of all those recoveries with incarnation numbers inc such that M.inc < inc < inCp at the time of sending the message M. For any incarnation number n. let recnum(n) denote the recovery line number associated with the recovery with incarnation number n. When P^ comes to know of the recover}- with incarnation number .U.mc+I. it will rollback to its earliest checkpoint whose sequence number is > recnum{M.inc + 1). So. P, will rollback prior to the send event of .1/ iff .l/.-sn > recnum{M.mc + 1 ). Hence, M should be processed by Pp if and only if

< recnum{M.inc+ 1): otherwise, it should be discarded. Thus. M is a delayed message in this case iff .\/..srt < recnarn(M.inc + 1) If .\/..sn < recnum{M.inc + 1). then M is logged into Cur.chkptp.niesylog before being processed because if Pp has to rollback in the future to the current checkpoint due to the failure of some process and if the recovery line number for that rollback is greater than M.sn. then the send of .1/ will not be undone but its receive will be undone and hence M will have to l)e replayed. So. the rule for determining a delayed message is as follows:

Rule for determining delayed messages: message M received by a process

Pp is a delayed message iff M.inc < inCp and M..sn < recnutn(M.inc + 1).

For example, in Figure 5.2. .\p is logged and then processed by P> because M.i..^n =

6 < recnum(Mi.inc + 1) = 10 and M.i.inc = 0 < inc) = 1. whereiis M^ is discarded by Pj because .1/5..sn = 11 > recnum{M.i.inc + 1 ) = 10 and M^.inc = 0 < inc, = 1.

95 Case(ii): M was sent in the current incarnation (i.e.. M.inc = incp).

In this case, if M.sn < sUp. then M is logged before being processed: this is because

if Pp needs to rollback to the current checkpoint. M will have to be replayed if the

recover}' line number of that rollback would be > M..sn. If M.sn > sUp, then is

processed by Pq after Pq takes a checkpoint with sequence number M.sn and .M is

not logged. If M.sn = sUp. the message is processed without logging or taking a new

checkpoint. From Case (i) and Case (ii). we have the following message logging rule.

Message Logging Rule: W'hen a process Pp receives a message .\I. it logs the

message before processing it iff {(M.inc < inCp) and (d/.-sn < recnurni.M.inc -h

1))) or {{M.inc = inCp) and {M.sn < snp)).

For example, in Figure 5.2. message d /7 was sent by process Pi after it had recovered

and it was received by P;j after F-j had rolled back. In that case. M-.inc = inc> = I

and M-.sn = 10 < sno = 12: d /7 has to be logged before being processed because

if P-i has to roll back to checkpoint C2.12 due to the failure of some process in the

future, then Pi will have to replay M- if the recovery line number of that rollback is

> 10 (i.e.. 11 or 1 2 ).

Case(iii): M was sent in a future incarnation (i.e.. .M.inc > incp).

In this case. Pp comes to know of a recovery initiated by a process through the

message d/ (note that in this case inCp + 1 = .M.inc since we assume that a new

failure occurs only after all the processes have rolled back in response to the previous failure). So. it sets reclincp := M .recJine and inCp := .M.inc. and rolls back to its earliest checkpoint with sequence number > .M.recJine. .A.fter rolling back, since

.M.inc = inCp. the message is handled as in Case (ii). In the figure, d/g would be

96 an example of such a message if it was received by F, before its rollback due to the failure of P\.

Salient Features of the Recovery Algorithm

Following are the important features of the recovery algorithm.

• When a process fails and sends the rollback message, all other processes know

exactly to which checkpoint they have to rollback and all the processes know

exactly what messages need to be replayed to cope with messages lost due to

rollback. Rollback message sent by the failed process is the only control message

exchanged during the recover}- initiated by a failed process. Xo additional

communication among processes is needed.

• Recovery is fully asynchronous. Xo process needs to wait for other processes to

rollback. Xo explicit synchronization is required during the recovery. .After all

the processes roll back, the system is restored to a consistent state.

• .A failed process only needs to roll back to its latest checkpoint. There always

exists a consistent global checkpoint which contains this checkpoint.

• With each message, only three integers are piggybacked. Xo additional control

messages are exchanged during failure-free operation. Thus, during the failure-

free operation, the message overhead is very low.

• Messages are only selectively logged. So message logging overhead is low.

• Delayed, duplicate, and lost messages are handled easily. Handling all these

types of messages do not require any explicit synchronization. Thus, the over­

head involved during the recovery time is very low.

97 Xow. we present the comprehensive algorithm formally.

5.3.4 Formal Description of the Algorithm

VVe first summarize the notations used.

C.sn: Sequence number of the checkpoint C.

C.rnesglog : Message log associated with checkpoint C.

.\/..sn : Sequence number piggy backed with the message .\/.

M.inc : Incarnation number piggy backed with the message M.

.M.recJine : Recovery line number piggybacked yvith the message .\I.

Curjchkptp : The current checkpoint (i.e.. the latest checkpoint) of process Pp.

All the messages are application messages except the rollback message which is a control message sent by a failed process.

Comprehensive Recovery Algorithm

Initialization of Data Structures at Pp •STip : integer!:= 0 ): {sequence number of current checkpoint, initialized to Ü} nextp : integer!:= 1): {seq. number to be assigned to the next biisic checkpoint, initialized to 1} incp : integer!:= 0 ): {sequence number of current incarnation, initialized to 0 } recJinep : integer(:= 0): {latest recovery line number, initialized to 0.} When it is time for process Pp to increment nextp nextp := nextp + 1: {nextp incremented at periodic time intervals) When it is time for process Pp to take a basic checkpoint if nextp > snp then {skips taking a basic checkpoint iï nextp < .'nip (i.e.. if it had Take checkpoint C\ already taken a. forced checkpoint C with C..sn > nextp).] C.sn := nextp-. .-iTip := C..'iTi:

When process Pp sends a message M M.sn := snpi {sequence number of current checkpoint piggybacked with A/} M.recJine := recJinep-. {current recovery line number piggybacked with A/} M.inc := incp-. {current incarnation number piggybacked with message M) send (M):

When process Pp receives a message M if (M.inc < inCp) then

98 if (M .an < recnuTn(M.inc + 1) then {M is a delayed message} log the message XI in to Cur.dikptp.rnfiisgloy: Process the message XI: else Discard the message XI: {this message will be resent} else if {XI.inc = incp) then if (XI.sn < SUp) then {Note that in this case. .V/..sri > recJinep} log XI in to Cur jchkptp.mesglog: else if [Xl.sn > sUp) then Take checkpoint C C .sn ;= Xl.sn: snp := C.sn: Process XI: else if (Xl.inc > incp) then recJinCp := Xl.recJine: inCp := Xl.inc: Roll-Back{p): {Procedure call } ReplayJoggedMiessages(j}): {Procedure call} Process the message XI: continue as normal;

Recovery initiated by process Pp after failure Restore Curjchkptp-. inCp := inCp + 1: recJinCp := srip: send rollback[incp.recJinep) to all other processes: ReplayJoggedjmessages{p): {Procedure call} continue as normal:

Process Pp upon receiving rollback{incq. recJincq) from process Pq if (incq > incp) then {if(incp > incq) Pp is aware of the recovery with incp := me,: incarnation inc.q through a message sent by a process recJinep := recJinCq-. that already rolled back } RollJback{p): {Procedure call} Replay Jogged jmessages(p): {Procedure call} continue as normal: else Ignore the rollback message:

Procedure RollJback(p : integer) if recJinCp > snp then { In this case, no need to roll back but takes a new checkpoint Take checkpoint C: that becomes part of the recovery line} C.sn := recJinCp'. Slip := C.sn:

99 else Find the earliest checkpoint C such that C.s/i >recJuiCp SUp := C.sn: Cur^chkptp ;= C: C ur jchkptp.mesglog := Combinef^ sn>recJinep(^ .mesglog) Restore Cur.chkptp: Delete all the checkpoints beyond Curjchkptp from the stable storage:

P rocedure ReplayJogged-messages{p : integer): for each message M € Cur.chkptp.mesglog do if M.sn < recJinCp th en Replay the message M else Delete M from Curxhkptp.mesglog:

5.3.5 Correctness of the Algorithm

First, we summarize the following observations.

Observation 5: .A.fter a process Pp rolls back to checkpoint C. it replays a message

M from its message log if and only if it was received after the checkpoint C and

M..sn < recJinep.

Observation 6 : .A. message M with .M.inc < inCp received by a process Pp is a delayed message iff < recnum{M.inc1 ).

Observation 7: .A message received and processed by Pp after the current check­ point C is logged into C.mesglog if and only if either {M.inc < inCp and M.sn < recnurn(.M.inc+ 1)) or (.M.inc = incp and .M..sn < .srip). .Also. Pp discards a received message .M iff (M.inc < inCp and .M.sn > recnum(.M.inc -h I)). It follows that a message .M received and processed by Pp is logged into C ur jchkptp.mesglog ony if

M.sn < SUp.

Observation 8 : .A process Pp rolls back to a checkpoint in one of the following three situations:

100 1. Pp fails.

2 . Pp receives a rollback message which carries a incarnation number > incp. or

3. Pp receives a message M with M.inc > incp. (In this case, M must have been

sent by a process which already rolled back).

L em m a 7 When a process Pp rolls back to a checkpoint C. a message M that was

received after C is a lo st message if and only if M.sn < recJinep.

P ro o f: If Pp rolls back to C. C is the earliest checkpoint of Pp such that C.sn >

recJinCp and all the other processes also roll back to their earliest checkpoint whose sequence number is > recJinep. So. the send event of M will not be undone by its sender iff M..sn < recJinCp. Hence, a message .1/ th a t Wivs received after C is a lost message iff .M..sn < recJinCp.O

Since messages are logged selectively, we need to prove that all the messages lost due to a rollback will be available in the stable storage for replay. This is pro\ed in the following lemma.

L em m a 8 When a process Pp rolls back to a checkpoint C. lo st rne.ssages are avail­ able in stable storage for replay.

Proof: W hen process Pp rolls back to a checkpoint C. each of the other processes also rolls back to its earliest checkpoint whose sequence number is > recJincp. So. if process Pp rolls back to a checkpoint C. a message M that was received after C will not be resent by its sender if it was sent from the left of the current recovery line (i.e.. if .M.sn < recJinep). Since recJinep < C.sn. from Observation 7. it follows that all lost messages would be in the stable storage. □

101 L em m a 9 When a process Pp rolls back to a checkpoint C. it replays a message M

from its stable storage if and only if M is a lost message.

Proof: From Observation 5. after Pp rolls back to C. it replays a message M from its

stable storage if and only if M was received after checkpoint C and M.sn < rec dine p.

Thus, it is enough to prove that a message received after C is a lost message if and

only if it was received after the checkpoint C was taken and M.sn < recJinep. This

follows from Lemma 7.0

L e m m a 10 The recovery algorithm handles d elay ed messages appropriately.

Proof: Xote that a message ,\[ received by a process Pp is a delayed message if and

only if M.inc < incp and M.sn < recnum{M.inc + I) from Observation 6 . It is also clear from the algorithm that Pp processes a message M with M.inc < incp if and only if .1/..S/! < reaium{M.inc + 1 ): otherwise, it discards it. Thus, delayed messages are handled appropriately. □

Theorem 18 If proce.ss Pp sends rollhack(inCp. recJinSp) message after its failure,

the system is restored to a consistent .state after all processes roll back in re.sponse to

this me.ssage.

Proof: -A. process knows about this rollback message either after receiving it directly from the failed process or after receiving a message sent by a process that already received this rollback message, from Observation 8 . In any case, as a result of this rollback message, all processes roll back to their earliest checkpoint whose sequence number is > recJinCp. By Theorem 17. the checkpoints to which the processes rollback form a consistent global checkpoint. A process that rolls back handles lost

102 and delayed messages appropriately by Lemmas 9 and 10. Duplicate messages do not

arise because only lost messages are replayed after a process rolls back by Lemma 9.

Thus, the system is restored to a consistent state after all processes rollback.□

.A. drawback of this recovery algorithm is that it can tolerate only one failure at

a time. In practical applications however, several processes could fail concurrently.

In the next chapter, we present an asynchronous recovery algorithm to cope with concurrent failure of multiple processes.

5.4 Comparison W ith Existing Work

Many of the existing recovery algorithms use vector tim estam ps [22. 24. 41. 47. 48.

50] to track dependency between checkpoints and events. \ ector timestamps generally result in high message overhead during failure-free operation. The optimistic recovery algorithm proposed by Strom and Vemini [50] suffers from the domino effect. The recovery protocols based on vector clock proposed by Peterson and Kearns [41] is synchronous and tolerates single process failure. It requires the channels to be FIFO.

Recovery proposed by Sistla and Welch [47] is synchronous for single process failure, requires the channels to be FIFO and uses vector timestamps. Smith and .lohnson [48] proposed an asynchronous recovery algorithm for multiple process failures: however, the size of the vector timestamp is (9(.V‘ * /) where / is the maximum number of failures of any single process. Johnson and Zwaenepoel [24] proposed a centralized protocol to optimistically recover the maximum recoverable state.

Our recovery algorithm does not require vector timestamps. Channels need not be

FIFO. Recovery is fully asynchronous. Recovery requires only one rollback message to be sent to all other processes. Maximum number of rollbacks of a process per

10.3 failure is 1. When a process whose latest checkpoint has maximum sequence numher fails, it will not force any other process to roll back. If the process whose latest checkpoint has lowest sequence number fails, it will force the other processes to roll back to a distance of at most Q checkpoints where Q is the maximum of the ratios of the checkpoint interval lengths of any two processes.

5.5 Summary of Results

The quasi-synchronous checkpointing algorithm guarantees the existence of a re­ cover}' line consistent with any checkpoint of any process all the time. The recovery algorithm exploits this property of the checkpointing algorithm to restore the system to a state consistent with the latest checkpoint of a failed process asynchronously.

Unlike other existing recovery algorithms, our recovery algorithm does not use vector timestamps for tracking dependency between checkpoints. The overhead involved in the recovery is low since messages are logged and replayed selectively and there is no explicit synchronization overhead involved during recovery.

104 CHAPTER 6

RECOVERY UNDER CONCURRENT FAILURE OF MULTIPLE PROCESSES

6.1 Introduction

In Chapter 5 we presented a recovery algorithm based on the qiiasi-synchronoiis

checkpointing algorithm of Chapter 4. This recovery algorithm can handle only single

failure. However, in real-world situations, multiple failures can occur concurrently,

i.e.. before recovery due to one failure is completed, more failures could occur. In this

chapter, we present a recovery algorithm which can deal with concurrent failure of

multiple processes.

The rest of the chapter is organized as follows. In Section 6.2. we present the

system model. In Section 6.3. we present the recovery algorithm. Correctness of the

recovery algorithm is proved in Section 6.4. In Section 6.5. we compare the recover\'

algorithm with the existing algorithms. Section 6 .6 summarizes the results.

6.2 System Model

In this chapter, we assume that a distributed computation consists of .V secpiential processes denoted by P q. P,. ■ P\ _, running concurrently on a set of computers iu

105 the network. Xote that processes are numbered from 0 to A' — 1 instead of 1 to A .

Since our algorithm uses mod N operation on incarnation numbers of recoveries, this

assumption on the range of process ids helps in the presentation of the algorithm.

The computation is asynchronous. Messages are exchanged through reliable commu­

nication channels, whose transmission delays are finite but arbitrary'. Xo assumption

is made about the FIFO (First-In-First-Out) nature of the communication channels.

Processes are fail-stop. .All failures are detected immediately and result in halting

failed processes and initiating recovery action [50]. .Multiple processes could fail con­

currently. .A process can be inactive due to failure for an arbitrarily long but finite

time.

6.3 A Comprehensive Recovery Algorithm

In this section, we address the issues involved in handling concurrent failure of

multiple processes and present a recovery algorithm which can handle concurrent

failure of multiple processes. First, let us explain by means of an example why our

recovery algorithm given in the previous chapter will not restore the system to a

consistent state when multiple failures occur concurrently. Consider the distributed

computation of Figure 6 .1. P q fails at the point F q and rolls back to its latest check­

point Co.i. increments mco to I and sets rerJineo to 1. .After the rollback, it sends

the message rollback{\. 1) to all other processes. .As a result, all other processes roll back to the recovery line labeled ( 1. 1 ).

.After the recovery for the failure at Fq is complete, suppose P\ and P> fail concur­ rently at the points indicated by Fi and respectively. .As a result of the failure at

F[. Pi rolls back to its latest checkpoint Ci 5. increments incx to 2 and sets rr-rjitu \

106 Fails ( 2 .5 )

Fails

Fails

( . ) 0 0 ( 1 . 1 ) (2.3)

Figure 6.1: .A.n example with concurrent failure of multiple processes.

to 5: then it sends the message rollback(2.ô) to all other processes. Similarly. P> rolls back to C-2.:i and sends the message rollback(2.3) to all other processes. W'hen Pu receives the rollback message sent by P\. it rolls back to its checkpoint Co.t; as shown in the figure. In the meantime, f , receives the rollback message sent by A and rolls back to its checkpoint C;,The partial recovery lines established as a result of these rollbacks are shown in the figure with labels (2.3) and (2.5). Now. when P q and P, receive the rollback message sent by Po. they will ignore the rollback message because the incarnation number received in the rollback message is same as their current in­ carnation numbers. Similarly, when P, and Pq receive the rollback message sent by

Pi. they will ignore the rollback message for the same reason. However, this situation will put the system in an inconsistent state because the message Mq sent by P> prior to its rollback will be accepted and processed by Pi (this is because, upon receiving this message Pi will log this message and process it since M qAuc = 1 < inc\ = 2 and

107 -\/3..sn = 3 < recnumiM^.inc + 1 ) = -5). So our recover}' algorithm will not leave the system in a consistent state when multiple failures occur concurrently.

The main problem with our recover}' algorithm is that processes cannot recog­ nize concurrent failures from the incarnation numbers associated with the recoveries.

So. to handle concurrent multiple failures, failed processes must assign incarnation numbers in such a way that the other processes will be able to recognize concurrent failures and take suitable actions.

6.3.1 Basic Idea

To handle concurrent failures correctly, we propose the following: First of all. to enable processes recognize concurrent failures, a failed process increments its in­ carnation number in such a way that other processes can recognize concurrent fail­ ures by comparing the incarnation number received with its own incarnation num­ ber. To be precise, when a process Pp fails, it sets the incarnation number incp as incp := ({{inCp div X) -f 1) * .V) -f p instead of simply incrementing it by I (here dir is the integer division operator). In other words, incp is set to the next higher integer which differs from its current value by at least .V and is also congruent to p mod X. During failure-free operation. inCp = inc,, VO < p.q < X - I and hence if two processes Pp and P,, fail concurrently and increment their incarnation numbers in this way and initiate rollback, other processes can recognize the initiations to be due to concurrent failures by checking if {inCp dicX) = (inc,, dicX). Let us illustrate how processes identify concurrent recoveries and take suitable actions with an example.

Consider the distributed computation of Figure 6.2. P q fails at point F q and rolls back to its latest checkpoint Co.i. sets incQ := {{incQ die 4)-t-l)*4-f0 (i.e.. incQ := 4).

IDS sets recJineo := 1. and sends the message rollback[A. 1 ) to all other processes. Upon

receiving this message, all processes roll back to checkpoints on the recovery line

labeled (4.1). After the recover}', suppose Pi and P, fail concurrently at the points indicated by p, and Fj respectively. .Ufter the failure at Fi. Pi rolls back to its latest checkpoint C 1.5. sets inci to {(inci die 4) 4- 1) * 4 -r 1 (i.e.. inci ;= 9) and sends the message roUback(9.ô) to all other processes. Similarly. P_> rolls back to Cj.j and sends the message rollbacks 1 0 .3) to all other processes. When Pq receives the rollback message sent by P[. it rolls back to its checkpoint Co.s as shown in the figure since

Cq.6 is the earliest checkpoint of Pq with sequence number > -5. In the meantime.

Pi receives the rollback message sent by P, and rolls back to its checkpoint Cci-

The partial recovery lines established as a result of these rollbacks are shown in the figure with labels (10.3) and (9.5). .\fterwards. when Pi receives the rollhack(lO.'A) message sent by P,. even though the incarnation number received in the message is different from P[.s current incarnation number 9. since (10 dir 4) = (9 dtr 4). Pj recognizes that the rollback message received was due to a failure concurrent with the failure for which it rolled back earlier.

.After Pi recognizes that the rollback message is due to a concurrent failure, it decides whether to rollback or not by comparing the recovery line numbers. The recovery line number received in the message is 3 whereas P(.s current recovery line number is 5. Since 3 < 5. P[.s current state may not be consistent with P'.s state after the rollback: this is because it is possible that a message sent by P> after taking checkpoint C 2.3 but before failure F> has been received and processed by Pi. Hence.

Pi has to roll back to its earliest checkpoint with sequence number > 3: i.e.. P rolls back to U’1.5 again. Similarly. Pq rolls back to its checkpoint Cqa in response to the

109 ^ Fails (9.5)

1 V 1 I \ ^ Fails j yF, K 1 ' r A 1 • y ^ Fails K\ 1 Î 1 :I 0 p. 1/ * J 1

<0 .0 , i4.lt ( 10.3)

Figure 6.2; Handling concurrent failure of multiple processes.

rollback message sent by A and deletes Co.e from its stable storage. However, when

A and A come to know about the rollback(9.ô) message sent by P. they ignore

that rollback message since even though the rollback message is due to a concurrent

failure, the recovery- line number received in the rollback message is greater than

their current recover}' line numbers. The complete recovery line established due to

these concurrent failures is shown in Figure 6.3. Thus, when multiple failures occur concurrently, all processes roll back in response to the rollback message of a failed

process that carries the smallest recover}' line number. Thus, among the processes that failed concurrently, the process which sent the rollback message with the smallest recovery line number succeeds in establishing the recover}' line. In case of a tie (i.e.. two or more processes initiate recovery concurrently with the same recovery line number, then the process with the smallest id succeeds in establishing the recovery

110 line. Note that recJine mod .V always gives the id of the process which initiated the

recovery).

^ Fails

Fails

Fails

(0.0 ) (4.1) (10.3)

Figure 6.3: Recovery line established after concurrent multiple failures.

The example above does not address all types of concurrent failures. For example,

in Figure 6.2. if Pq or Pi fails after establishing the partial recovery line with label

(9.5) and they establish another partial recovery line before knowing about the failure at F-y. how will the partial recovery lines converge to total recovery lines when all processes come to know about all the failures. Moreover, the above example does not illustrate how various types of messages that arise due to rollback are handled.

For handling all types of concurrent failures and various types of messages that arise during recovery, it is not enough if each message is piggybacked with the current recovery line number and the associated incarnation number. The rollback message and all the application messages need to be piggy backed with the incarnation numbers and the associated recovery line numbers of all the recovery lines established so far.

I l l Next, we describe in detail what information is piggybacked with the application

messages and the rollback messages and how the information piggybacked is used for

handling concurrent failures appropriately.

Each process Pp maintains a variable called inc.rec.setp which is an ordered set

of ordered pairs of integers of the form {inc.rec). where inc represents the incarna­

tion number and rec represents the associated recovery line number of a recovery line established in the past. When a process Pp fails, it rolls back to its latest checkpoint, updates its incarnation number incp as inCp := ((incp die .V) 4- 1 ) * 4- p and the recovery line number recJinCp as rec.linep := sUp. adds (inCp. rec.linep) to the set inc.rec.setp and sends a rollback(inc.rec.setp) message to all other processes (each application message is also piggybacked with the current value of iric.rec.setp. and the value of inc.rec.setp piggybacked with a message M is denoted by M.inc.rec.sct). If

(/ncl. reel) and {inc2. rec'2) are two elements in the set inc.rec.setp. we say (/ncl. reel) precedes (inc2. rec2). denoted as (inci. reel) -< [inc2. rec2). iff inel < inc2. We prove in Lemma 11 that under the ordering inc.rec.setp is a totally ordered set. So. hereafter, whenever we talk about the largest and smallest elements in inc.rec.setp. it is with respect to this relation Next, we discuss in detail how rollback messages are handled.

6.3.2 Handling Rollback Messages

W hen Pp receives rollback{inc.rec.set^) message from P,,. it compares its current value of inc.rec.setp with the value of inc.rec.set^ received in the message. The fol­ lowing two cases arise, case (i): inc.recjietq Ç inc.rec.setp.

112 In this case. Pp has already established recovery lines corresponding to every entry

in inc.rec^etq which means Pp knew about all these recoveries through application

messages received from other processes that have rolled back and hence it ignores

the rollback message (recall that each application message is piggybacked with the

current value of inc-recset).

case (ii): inc-rec-netq g inc-recsetp.

Two sub-cases arise in this case. case (ii)(a): inc.rec.setp C inc.rec.setq (a proper subset).

In this case, inc.rec.setq contains at least one element that does not belong to inc.rec.setp. Let {inci. reel) be the smallest such element, (inci. red) corresponds to the earliest recovery that Pq is aware of which Pp is not aware of. Hence. Pp assigns inCp := inci. recJinCp := reel, adds {incp.recjincp) to inc.rec.setp and rolls back to its earliest checkpoint with sequence number > recJinCp and replays the lost mes­ sages (messages whose receive were undone but whose send will not be undone). How lost messages are determined is discussed below in Section 6.3.4. .After replaying the lost messages, if inc.rec.setp / inc.rec.setq. then Pp has to catchup with the recov­ eries corresponding to the remaining entries in inc.rec.setq. To do this, it only needs to add those entries to inc.rec.setp and update inCp and recJinep to correspond to the greatest element in inc.rec.setp. .After updating inCp and recJinep. it takes a new checkpoint if the sequence number of its latest checkpoint is less than recJinep in order to advance its recovery line. case (ii)(b): inc.rec.setp (f. inc.rec.setq (neither is a subset of the other) .

In this case, each of the sets inc.rec..setp and inc.rec.setq contains at least one el­ ement that is not contained in the other. This means that there are two recoveries

1 1 3 initiated concurrently by two different processes — one of the recoveries is known

to Pq and the other is known to Pp but none of these two processes is aware of

both the recoveries. In this case. Pp needs to decide whether or not to rollback. Let

(inci. reel) be the smallest element e {inc,rec.setq — inc-rec^setp) and let (inc2. rer'l)

be the smallest element €{inc.rec.setp — inc.rec.setq). So. Pp comes to know of the

recovery (inci. reel) which Pq was aware of. Pp also learns that Pq was not aware of

the recovery (inc2 . rec2 ) while sending the rollback message and these two recoveries

are concurrent. So. (inci dir .V) = (inc2 dir X). If reel < rec2. then since Pq has

already rolled back to its earliest checkpoint with sequence number > reel, it would

have discarded messages that have causally depended on any message sent by Pp with sequence number > reel: this means Pp has to roll back to its earliest checkpoint with sequence number > reel after deleting the entry (inc2. rec2 ) from inc.rec.setp and adding the entry (inci. reel) to inc.rec.setp. If reel = rec2. then Pp rolls back to

its earliest checkpoint with sequence number > reel if and only if q < p (this is to ensure that if two processes initiate recovery concurrently with the same recovery line number, only the recovery initiated by the process with smallest id succeeds in establishing the recovery line). If reel > rec2. then Pp ignores the rollback message because Pq will roll back to the recovery line determ ined by {inc2.rec2) when P,, comes to know about the recovery (inc2. rec2). If Pp rolls back, then after rolling back and replaying the messages lost due to rollback. Pp needs to catch up with other recovery lines if any. To catch up with the remaining recovery lines. Pp basically finds the element-wise minimum of the ordered sets inc.rec.setp and inc.rec.setq and assigns it as the new value for inc.rec.setp and updates incp and rec.linep to corre­ spond to the largest element in inc.rec.setp. It also takes a checkpoint if its current

114 checkpoint has sequence number < recJinep. The details of how the minimum of two

sets is calculated are given in the formal description of the algorithm.

An Example

Figure 6.4 shows a distributed computation consisting of four processes. Pq and P>

fail concurrently at points indicated by Pq and Pj. respectively. .As a result. Po and

Pj. respectively send ro//6acA:({(0. 0). (4. .3)}) and ro// 6acAr( {(0. 0). (6 . 4)}) messages

to all other processes. Suppose P, receives the rollback{{{Q.Q). (A.'i)}) message first

and rolls back to its checkpoint with sequence number 4: the partial recovery line established by Pq and P[ is shown with label (4.3).

(6.4)

Figure 6.4: Partial recovery lines established due to concurrent failures.

Next, suppose P 3 receives the rollback{{{0.0). {6. A)}) message from P> and rolls back to the checkpoint with sequence number 0 : the partial recovery line established by P-2 and P 3 is shown in the figure with label (6.4). Xext. suppose (i) Pi fails at the point indicated by Pi before it comes to know of the failure of P-y at F> and (ii)

115 Pr\ fails at the point indicated by before it comes to know of the failures at Fi

and F\. As a result of the failure at F\. P\ rolls back to its checkpoint C 1.7 and

sends the message rollback{{{Q.Q). (4. 3). (9. 7)}) to all other processes. Similarly, as

a result of the failure at F 3. F 3 rolls back to its checkpoint C 3.6 and sends the message

ro//6acA-({(0.0). ( 6 .4). (11. 6 )}) to all other processes. Next, suppose F> receives the

rollback{{[Q.Q). (6.4). (1 1 . 6 )}) message from F 3. and rolls back to its checkpoint C-y--

In the meantime, suppose Fq receives the roUback({{Q.O). (4.3). (9. 7)}) message from

Pi. and rolls back to its checkpoint Cq.s- At this point, neither Pq nor Pi knows about

the failures F> and F;: similarly, neither P, nor P 3 knows about the failures Fo and

Ft. All the partial recovery lines established at this point are shown in Figure 6.4.

(O.ll)

Figure 6.5: Recovery lines established after complete recovery.

When Po and P come to know about the failures at F? and F3. they will ignore the failure at p since the recovery line number corresponding to failure Fo is 4 which is greater than the recovery line number corresponding to failure at Fq. namely 3.

116 However, they will roll back in response to the failure at Similarly, when Pj and

P3 come to know about the failures at Fq and F[. they will rollback in response to

the failure Fq and only take a checkpoint with sequence number 6 in response to the

failure Fi- .A.fter all processes are aware of all the four failures, the complete recoverv

lines established are shown in Figure 6.5.

6.3.3 Handling Application Messages

When a process Pp receives a message M from a process P^. three cases arise:

C ase (i): M.inc^recset C inc.rec.setp

In this case. Pp is aware of at least one recovery initiated by some process which P,,

was not aware of while sending the message M. Let [inc. rec) be the smallest element

in inc.rec.setp such that (inc. rec) ^ M.inc.rec.set. Thus, while sending the message

.\/. Pq was not aware of the recovery associated with the pair {inc. rec). but it was

aware of every recovery prior to [inc.rec] that Pp is aware of. If .M.sri > rec. then

Pq will undo .send(.M) when it comes to know of the recovery (inc. rec) since Pq will

roll back to the earliest checkpoint with sequence number > rec. Hence, in this case.

Pp discards .\I. If .M.sn < rec. then the send event of .1/ will not be undone when P,

rolls back to its earliest checkpoint with sequence number > rec and hence M should

be processed by Pp.

Case (ii): M.inc.rec.set = inc.rec.setp

In this case, recoveries that Pp and Pq are aware of are the same. So. in this case. .\/ is a normal message and hence has to be processed.

Case (iii): M.inc.rec.set g inc.rec.setp

Let (inci. reel) be the smallest of those elements in [\[.inc.rec.set — inc.rec.setp).

1 1 7 Case (iii) (a): {M.inc.rec se t g inc.recsetp) A {incsecsetp Ç M.incsccset)

In this case, while sending the message M. Pq already knew about all the recoveries

that Pp is currently aware of. In addition, it is aware of some new recoveries which

Pp is not aware of. Hence. Pp must rollback first to the recovery line determined

by (mcl.recl) before processing the message. Thus. Pp assigns recJinep := reel.

incp := inci. adds (inci. reel) to inc.recse.tp and rolls back to its earliest checkpoint

with sequence number > reel and replays the messages lost due to rollback. Then, it

has to catch up with the recoveries associated with all those elements in M .inc.rec.scf

that are not in inc.rec.setp. The details of how this catch up is done is same as the

catchup in the case arollhack{inc.recsetq) message with inc.rec.setq g inc.rer..setp

is received. After catching up with the recovery, it treats .\I as a newly arrived

message and processes it.

Case (iii)(b): {M.inc.recset g inc.recsetp) A {inc.rec.setp g M.inc.rec.set).

In this case, the two sets M .inc.rec.set and inc.rec.setp contain at least one element

that does not belong to the other. So. there are concurrent recoveries initiated by

two distinct processes and Pp took action on one of them and Pq took action on

the other. In other words, corresponding to (m cl.recl). there exists {inc2. rec2) €

{inc.rec.setp - M.inc.rec.set) such that inci dir .V = inc2 die .V. In this case, if

(reel < rec2) or if (reel = rec2)A(mel < inc2). Pp rolls back to its earliest checkpoint with sequence number > reel: otherwise, since Pq will rollback corresponding to the recovery (/nc2 .rec 2 ). Pp does not roll back, however. Pp processes the message only if M.sn < rec2. If Pp rolls back to the recovery line determined by the pair

(iucl. reel), it deletes {inc2.rec2) from inc.rec.setp and adds (inci. reel) to it and replays the messages whose receives have been undone but whose send will not l)c

118 undone. After rolling back and replaying messages, if there are more elements in

(M.inc.rec^et — inc.rec.setp). it needs to catch up with the recoveries corresponding

to them. The details of how this catch up is done is same as the catchup in the case

a rollback{inc.rec.setq) message with inc.rec.setg g inc.rec.setp is received, .\fter

catching up with the recovery, it treats M as a newly arrived message and processes

it.

6.3.4 Message Logging and Message Replaying

When a process rolls back due to a failure, all the messages whose send event

lies to the left of the current recovery line and received to the right of the current

recover}- line are lost messages. So. when a process Pp rolls back, any message .\[ with

M.sn < recJinep whose receive wiis undone is a lost message and all such messages should be replayed. So. to cope with messages lost due to rollback, messages are selectively and pessimistically logged into stable storage. With each checkpoint C. there is an associated message log denoted by C.rnesglog. The latest checkpoint of a process Pp is denoted by Cur.chkptp. Before a message M is processed by Pp. .\I is logged into Cur.chkptp.mesglog if < Car.chkptp.sn. So. not all messages are logged: only those messages that could be required for replay in the event of rollback are logged. This simple selective message logging technique guarantees that all the messages lost due to rollback are available in stable storage for replay. Xext. we present the formal description of the algorithm.

119 6.3.5 Formal Description of the Complete Process Recovery Algorithm Recovery Algorithm handling Concurrent Feiilure of Multiple Process

Data structures and initialization at process Pp SUp :iuteger(:= 0): {seq. number of current checkpoint, initialized to 0} nextp :integer(:= I): {sequence number to be assigned to the next basic checkpoint} inCp : integer}:= 0): (incarnation number of latest recovery line, initialized to 0 } recJinep : integer}— 0): (latest recovery line number, initialized to 0.} inc^rec^etp ; ordered set of pairs of integers }:= (}0.0)}):

When it is time for process Pp to increment nextp nextp := nextp + 1; {nextp incremented at periodic time intervals} When it is time for process Pp to take a basic checkpoint if nextp > SUp then (skips taking a basic checkpoint if nextp < srip }i.e.. if it had Take checkpoint C: already taken a forced checkpoint C with C.sn > nextp).} C.sn := nextp-. .sHp := C.sn:

When process Pp sends a message M M..sn ;= .SUp: (seq. number of the current checkpoint piggybacked with message A/} M.incxec.set := incjrecsetp-. (current value of incj-ec_setp send }A/): piggj'backed with the message .V/}

120 When process Pp receives a message M if (AI.inc,recset C hic-rec^etp) then { a proper subset} Find the smallest element (inc.rec) E (mc_rec_setp — AI.incjrecset): if (Al.sn < rec) then log message AI in to Cur-chkptp.mesglog: Process message AI else Discard message AI: else if (AI.incjrec^et = inc-rec.setp) th en if (Al.sn < srip) then {note that in this case. A/..sn > recJinep} log message .AI in to Cur .chkptp.mesglog: else if (Al.sn > .snp) then Take checkpoint C: C.sn := AI..m: Slip := C..‘in: Process the message AI: else { if (AI.inc.rec.set g inc.rec.setp)} Find the smallest element (inci. reel) E (AI.inc.rec.set — inc.rec.setp): if (inc.rec.setp g AI.inc.rec.set.) th en Find the smallest element (inc2. rec2) E (inc.rec.setp — AI.inc.rec.set): if ((reel < rec2) V ((reel = rec2) A (in c i < inc2))) th en D elete (mc2. rec2) fro m inc.rec.setp: else if (iV/..sn < rec2) then log message AI into Curjchkptp.mesglog: Process the message AI: g o to last: else Discard the message AI : goto la.'it: recJincp := reel: inCp := inci; inc.rec.setp := inc.rec.setp U {(inCp. rec.linep)}: Roll.Back(p): Replay Jogged.messages(p) : Catch.up.with.recovery(AI .inc.rec.jiet. p): Treat message AI as if newly arrived and process it: last: continue as normal:

121 Recovery initiated by process Pp after failure Restore Curjchkptp-. inCp := {(incp div N) + i) * N + p: recJinep := srip: incjrecsetp := incjrec^etp U {{incp,recjinep)\: send rollback(inc-rec^etp) to all other processes: Replay Joggedjmessages(p): continue as normal:

Process Pp upon receiving rollback{incjrecset,j) from process P,, if (incjrecset^ g incjrec^etp) then Find the smallest (mcl.recl) E (inc.rec.setq — inc.rec.setp): if (inc.rec.setp g inc.rec.setq) th en Find the smallest (inc2.rec2) 6 (inc.rec.setp — inc.rec.setq): if ((reel < rec2) V ((reel = ree2) A (inci < inc2))) then Delete (inc2.rec2) from inc.rec.setp: else goto last: recJinCp := reel: incp := iriel: inc.rec.setp := inc.rec.setp U {(inCp. re.c.iinep)}: Roll.Back(p): Replay Jogged.messages(p) : Catch.up.with.recovery(inc.rec.setq.p): last : continue as normal:

Procedure Roll-Back(p : integer): begin if (recJinep > snp) then { In this case, no need to roll back but takes a checkpt so Take checkpoint C :that the checkpoint taken will be part of the recovery line) C .sn := recJinep-. sn„ := C.sn: else Find the earliest checkpoint C such that C..sn > recJinep-. .srip := C.sn: Cur.chkptp := C: Cur .chkptp.mesglog := Concater.sji>recjinei,(C.mesglog): Restore Cur.chkptp-. Delete all the checkpoints beyond Cur.chkptp from the stable storage: end;

122 Procedure ReplayJ.oggedjmessages{p : integer): begin for each message M 6 C ur -chkptp.mesglog do if (M.sn < recJinCp) then Replay message M else D elete M from Cur jchkptp.mesglog: end; Procedure Catch..up^with-.recovery[I : ordered set o f pairs of integers:p : integer): begin inc.recsetp := M in{I. inc-rec-.setp): Let (inc.rec) be the last element in inc_recsetp: recJinep := rec: inCp := inc: if .Slip < recJinCp then Take checkpoint C: C.sn := recJinep-. sTip := C.sn: end; Function M in(I.J : ordered set of pairs of integer.s) : set of pairs of integers: var temp : ordered set o f pairs of integers: begin if f = J then M in := I else begin tem p := 0; while ((I ^ 0) a n d (•/ 7^ 0)) do begin Let (m cl.recl) 6 / and (inc2.rec2) € J be the smallest elements of I and ./: { Note that for these elements inci div N = inc2 div .V } if ((reel < rec2) V ((reel = rec2) A (mcl < mc2))) then temp := tempU {(inci.reel)} else tem p : = tem p U {(ôtc2. rec2)} I := I — {(mcl.reel)}: J := .J — {(mc2.rec2)}: end; temp := (temp U / U ./): M in := tem p: { re tu rn tem p] end: end;

123 6.4 Correctness Proof

To prove that the recovery- algorithm for multiple failures proposed above is cor­

rect. we need to establish the following:

1 . When one or more processes fails and initiates a recovery, the checkpoints to

which all other processes roll back form a consistent global checkpoint.

2. The various types of messages that arise due to a rollback are handled appro­

priately.

We establish these facts about the recovery algorithm below by proving several lem­ mas and theorems.

Observation 1 : For each p. the current value of incp and rerJinCp is same as the first and second component of the last element in inc.rec.setp.

For. the set inc.rec.setp is updated when Pp fails and initiates recovery, or when Pp receives a rollback{) message, or when it receives a message M with M.inc.rcc.st t g inc.rec.setp. It is clear from the formal description of the algorithm that in all these cases, after updating inCp and recJinep. inCp and rcJinCp are set to the values that correspond to the last element in inc.rec.setp.

L e m m a 11 If {inci. reel) and (inc'2. rec'2) are two distinct elements in inc.rec.setp. then {inci dir .V) 7 ^ {inc'2 dir .V). For each i. the set .Vp = [inc dir X j (inc. rec) 6 inc.rec.setp} is a set of consecutive natural numbers starting from 0. In particular, under the ordering -<. inc.rec.setp is a totally ordered set.

Proof: .\ote that a process Pp updates the value of inc.rcc.,setp when

1. Pp fails and initiates recovery or

124 2. Pp receives a message M with M .inc.rec se t 2 inc.recsetp or

3. Pp receives roUback{inc.rec.set^) message from Pq such that {inc.recsetq g

inc.recsetp).

The set inc.recsetp is initialized to ((0.0)} V/j. Hence the lemma is true initially.

So to prove the lemma, it is sufficient to prove that after each time inc.recsetp is

updated, the lemma still holds.

C ase (i): Pp updates inc.rec.setp after its failure.

From Observation 1 . the current value of inCp is same as the first component of the

last element in inc.rec.setp. Whenever a process Pp fails, it increments its variable

inCp to the next higher value of the form k * .V + p. and adds [incp. recJinCp) to the set inc.recsetp. Thus, the lemma holds in this case.

Case (ii): Pp updates inc.rec..setp after receiving a message .\I with M.inc.rec.sct g inc.rec.setp.

In this case, if Pp rolls back, it sets the current value of inc.rec.sefp to be the component-wise minimum of the entries in inc.rec.setp and M.inc.rec.set and hence the lemma holds in this case also.

Case (iii): Pp updates inc.rec.setp after receiving roUback{inc.rec.setq) message from Pq such that {inc.rec.setq g inc.rec.setp).

This is similar to Case (ii). □

Theorem 19 When one or more processes fail and initiate recovery sending rollback{) message concurrently, the checkpoints to which all proce.ses rollback form a consistent global checkpoint.

125 Proof: During the failure free operation. inCp = inc^'i p.q. After a process Pp fails,

it increments its incarnation number inCp to the next integer of the form fc * - j - p

and adds the pair {inCp. rec lin e p) to the set inc.rec.setp and sends the message

rollback[inc.rec.setp) to every other process. So. when two processes Pp and P,

fail concurrently and increment their incarnation numbers. inCp die .V = incq dir X .

inCp mod X = p. and cnc, mod .V = q. Using this fact, a process receiving the rollback

messages recognizes the concurrent recover}' initiations: all the processes rollback to

the earliest checkpoint that has sequence number >recJinCp. where iincp.rerjincp)

is associated with the recovery initiated by a process Pp with smallest recoverv line

number. By Corollary 5. all such checkpoints form a consistent global checkpoint. □

Xext. we prove that the various types of abnormal messages that arise in rollback

recovery are handled appropriately.

L em m a 1 2 .4 message M received and processed by Pp is logged into stable storage

by Pp if and only if M.sn < srip {i.e.. if M.sn is less than the sequence number of the

current checkpoint of the receiving proce.ss. then that me.'isage is logged before being proce.ssed).

Proof: Let us assume that M was sent by P^. Three cases arise.

Case (i): {M.inc.rec.set C inc.rec.setp).

In this case. Pp has established more recovery lines than P,,. I.et (inc.rec) be the smallest element in inc.rec.setp that does not belong to M.inc.rec.set. .Vote from the formal description of the algorithm that such a message is processed by Pp only if M.sn < rec. Note also that in this case the message is also logged into stable storage before being processed. Since rec < snp. it follows that M..sn < siip.

126 Case (ii): {M.inc.rec.set — incjrec^etp).

Ill this case. M is logged into stable storage if and only if M .sn < srip.

Case (iii): {M .inc.rec ..set % inc.rec .set p).

This case is also similar to case (i). □

L em m a 13 When a process Pp rolls back due to the failure of itself or sonie other

process, it replays a message from its .•rtable storage if and only if its receive event

was undone due to rollback and send event will not be undone by its sender when the

sender rolls back.

Proof: A message M whose reception was undone due to rollback is replayed if

and only if .M.sn < recJinCp. M .sn < recJinCp implies that the send event of .\[

lies to the left of the current recover}- line and hence the process which sent .M will

not rollback beyond the send event of M. Thus, only messages lost due to rollback

are replayed.□

L em m a 14 Delayed messages are handled by our algorithm appropriately.

Proof: Note that if a message is a delayed message, then M.inc.rcc.set

inc.rec.setp. Such a message should be processed only if its send event has nor

been undone or will not be undone. Two cases arise.

C ase (i): .\I.inc.rec.set C inc.rec.setp

Note th at in this case M is processed if and only ii .M.sn < rec where {inc. rec) is the smallest element in inc.rec.setp such that{inc. rec) 0 M.inc.rec.set. .Vote that when the process that sent the message M comes to know about the recovery {inc.rec). it will roll back to its earliest checkpoint with sequence number > rec: hence it will

121 undo the send event of M if and only if .\/..sn > rec. So M is processed if and only

if .M..m < rec. otherwise, discarded .

Case (ii): M.inc-rec^et g inc_rer_setp

In this case also. M is processed only if < rec2 where {inc2. rec'l) correspond

to the recover}' with respect to which the sender of M will rollback. Hence a delayed

message is processed if and only if the sender will not undo the send event of M when

it rolls back. □

Theorem 20 The recovery algorithm re.'itores the ■sy.'item to a con.iistent .'>tate tvlieii failure.'^ occur.

Proof: When failures occur, processes rollback to a consistent global checkpoint

b}' Theorem 19. When a process rolls back, it replays only messages lost due to

rollback by Lemma 13. Delayed messages are handled appropriately by Lemma 14.

Since delayed messages and messages lost due to rollback are handled appropriately, duplicate messages do not arise. Orphan messages do not arise because processes rollback to a consistent global checkpoint in the event of a failure. Hence the system is restored to a consistent state in the event of failures. □

6.4.1 Reducing Message Overhead

In the recovery algorithm, the size of the set inc.rec.set piggybacked with each message grows with the number of recovery lines established. We can reduce this message overhead to a predeterm ined threshold as follows. W hen the size of the set reaches maximum predetermined threshold, a process sends a message to collect the current value of the inc.rec..set of all the processes. .After collecting the values from all processes, it finds the common prefix of all these sets and informs all other

128 processes about this common prefix. After receiving this common prefix, each process

stores this common prefix in stable storage and deletes this common prefix from its

set inc.recjiet. From that time on. processes need not piggy back this common prefix:

the prefix need to be kept to handle delayed messages since we assume message delay

is unpredictable.

6.4.2 Asynchronous Garbage Collection

During the normal operation, checkpoints and logged messages must be kept in stable storage until they are no longer required for the purpose of recovery for any

possible failure in the future. In our recover}' algorithm, each checkpoint has an associated message log. When a process rolls back to a checkpoint due to a failure, the messages that are required for replay are available in stable storage. Recall that when a process fails, the failed process rolls back to its latest checkpoint and all other processes roll back to their latest checkpoint with sequence number > the sec[iience number of the latest checkpoint of the failed process. Thus, if every process has taken a checkpoint with sequence number > k. where k is any positive integer, then each process can delete all of its checkpoints with sequence number less than k. So. if a process has received and processed at least one message M with .M.sn > k from every other process, then it can discard all of its checkpoints with sequence number less than k. In an application where each process communicates with every other process frequently, this asynchronous garbage collection method will be efficient because all the process will know about the sequence number of the latest checkpoints of every other process.

129 6.5 Comparison With Existing Work

In this section, we compare our recover}' algorithm with existing recovery algo­

rithms. Table 6.1 gives a comparison of our algorithm with existing recovery algo­

rithms. In this table, the algorithms are compared in terms of their requirements on

message ordering, the type of recover}', the maxim um number of tim es a process has

to roll back when a failure occurs, the size of additional information pigg}'backed with

each application message, and the number of concurrent failures the algorithm can

handle. Sistla and Welch [47] presented two optimistic recovery protocols. Their first

protocol requires adding extra information of size 0(X) to each application messages:

it avoids worst case exponential rollbacks by synchronizing recoverv: for each failure.

O(.V') messages are exchanged for synchronizing recover}'. Their second protocol uses

0{N^) messages for recovery but the extra information added with each application message is of size (9(1). Their protocols also require the channels to be FIFO and cannot handle concurrent failure of multiple processes. .Johnson and Zwaenepoel [241 present a general model for reasoning about recovery in message logging systems.

Based on the model, they present a centralized algorithm for determining the maxi­ mum recoverable system state at any given time. Juang and \enkatesan [26] present optimistic recovery algorithms that use 0(.V) messages in ring networks and (9(.V-) messages in arbitrary networks, each application message being appended with in­ formation of size (9(1). Peterson and Kearns [41] present a synchronous recovery protocol based on vector time. Their protocol cannot handle concurrent multiple failures.

Strom and Vemini [.50] introduced the area of optimistic recovery using checkpoint­ ing. Their recovery technique however suffers from the domino effect. ,A.s a result.

130 when a failure occurs, a process may have to rollback exponential number of times

with respect .V. Moreover, this algorithm tolerates only single failure and requires

the channels to be FIFO. Smith et al. [48] presented a completely asynchronous, op­

timistic recovery protocol which can handle concurrent failure of multiple processes.

They also use vector timestamps. The algorithm of Darnini and Garg [13] is based on

the notion of fault-toIfirant vector clock which helps in tracking causal dependencies

in spite of failures. It also uses a history mechanism to detect orphan states and ob­

solete messages. These two mechanisms along with checkpointing are used to restore

the system to a consistent state after the failure of one or more processes. Richard

and Singhal [21] presented a recover}' algorithm based on asynchronous checkpoint­

ing and optimistic message logging. Their algorithm also uses vector time to track dependency and it can handle concurrent failure of multiple processes.

.As noted above, in all of the bisynchronous recovery algorithms, the additional message piggybacked with each application message is of size at least 0(.V). For the first time, we presented an asynchronous recovery algorithm which does not use vector timestamps. The additional information piggybacked with application mes­ sages is proportional to the number of recovery lines established. Existing recovery algorithms use pessimistic or optimistic message logging. However, our algorithm uses neither optimistic nor pessimistic message logging: it uses selective pessimistic message logging in which only messages which may be required during recovery for replay are logged into stable storage, resulting in message logging overhead.

131 Required Asynchronous Maximum Message Xo. of message rollbacks overhead concurr. ordering per failure failures Sistla and Xo ! Welch [47] FIFO 1 (9(.\-) 1 1 •Johnson and Xo Zwaenepoel [23] Xone 0 (1 ).\' Peterson and Xo Kearns [41] FIFO 1 0(.V) 1 Strom and FIFO \e s 0 ( 2 ') 0(.V) Vemini [50] Smith. Johnson Xone \ ’es 1 0 (.v-/) A' and Tyger [48] Darnini and Xone \ ’es 1 0(.V) .V Garg [13] Richard and Xone \ ’es 1 0(.V) ■V Singhal [21] i 1 Our Xone \ ’es 1 0 (F ) .V Algorithm

Table 6 .1: Comparison with related work (.V is the number of processes in the system and / is the maximum number of failures of any single process. F is the total number of recovery lines established).

132 6.6 Summary of Results

In this chapter, we presented a comprehensive recover}- algorithm based on the quasi-synchronous checkpointing algorithm. The recover}- algorithm can handle con­ current failure of multiple processes. The checkpointing algorithm guarantees the existence of a recovery line consistent with the latest checkpoint of any process all the time. The recover}- algorithm exploits this property of the checkpointing algo­ rithm to restore the system to a state consistent with the latest checkpoint of a failed process asynchronously. When multiple processes fail concurrently, the system is re­ stored to a state consistent with the latest checkpoint of one of the failed processes, i.e.. the latest checkpoint of the process which has the least sequence number among the latest checkpoints of all the failed processes. Unlike many of the existing recov­ ery algorithms, our recovery algorithm does not use vector timestamps for tracking dependency between checkpoints. The additional information piggybacked with the application messages to track dependency is proportional to the number of n>covery lines established. We also described a method for reducing this message overhead further. Since messages are logged selectively, message logging overhead is low.

133 CHAPTER 7

SUMMARY AND FUTURE RESEARCH

The emergence of high-speed local area networks in the early 70s made distributed systems possible. The availability of low-cost high-performance personal computers, workstations, server computers, and the availability of distributed system software to support the development of distributed applications have made distributed systems more popular than expensive centralized and multi-user computers, \ow-a-days. distributed systems have become the norm in academic and industrial organizations for their computing needs.

7.1 Summary

Fault-tolerance is an important aspect of distributed system design. Checkpoint­ ing and rollback recovery are established technicptes for achieving fault-tolerance in distributed systems. In rollback recovery, when a process fails, all processes roll hack to a consistent global checkpoint. Even though consistent global checkpoints have been used in rollback recovery, efficient algorithm s for finding consistent global checkpoints were not available. In this thesis, we presented a theoretical framework for identifying the checkpoints that can be used to construct consistent global check­ points containing a target set of checkpoints. The framework is based on the notion of

134 zigzag paths introduced by Xetzer and Xu. Zigzag paths capture the precise require­

ments on the suitability of a set of checkpoints for being part of a consistent global

checkpoint. VVe applied the results of our framework and presented a simple and

elegant algorithm for enumerating all consistent global checkpoints of a distributed

computation, which has been an open problem. .\n important characteristic of this

algorithm is that it limits its search space to only those checkpoints that are useful

for constructing consistent global checkpoints, rather than searching the whole set of checkpoints.

When processes take checkpoints asynchronously. checkpoints could become use­ less. i.e.. some of the checkpoints may not be part of any consistent global checkpoint.

Quasi-synchronous checkpointing algorithms have been proposed to minimize useless checkpoints. Quasi-synchronous checkpointing algorithms proposed in the literature achieve this objective to varying degrees even though the designers of most of these algorithms do not explicitly state their objectives. Traditionally, synchronous, asyn­ chronous. and quasi-synchronous checkpointing algorithms have been treated as three disjoint classes of algorithms. We established a containment relationship among these three classes of algorithms. The containment relationship among the three classes of checkpointing algorithms helps us classify the quasi-synchronous checkpointing al­ gorithms into four finer sub-classes, namely, strictly z-path free, z-path free, z-cycle free, and partially z-cycle free. This finer classification reveals interesting properties of checkpointing algorithms belonging to various classes, which is discussed next.

In strictly z-path free checkpointing, all the checkpoints of processes are useful for constructing consistent global checkpoints. Moreover, the absence of noii-causal

1 3 5 z-paths between checkpoints facilitates the construction of consistent global check­

points. This is because, in the absence of non-causal z-paths between checkpoints,

ever}' checkpoint is part of a consistent global checkpoint and to construct a consis­

tent global checkpoint containing any set S of pairwise causally unrelated checkpoints, one has to keep on adding checkpoints that have no causal paths from or to any of

the checkpoints is 5. Since one can easily determine whether or not a causal path exists between two checkpoints by comparing vector timestamp of the checkpoints, constructing consistent global checkpoints under strictly z-path free checkpointing is easy.

In z-path free checkpointing, a non-causal z-path between two checkpoints exists if and only if there exists a sibling causal path between the two checkpoints. Thus, under z-path free checkpointing, even though there will be non-causal z-paths between checkpoints, they can be tracked through the sibling causal paths. So. in terms of the ease of finding consistent global checkpoints, z-path free checkpointing has the same advantages as strictly z-path free checkpointing. In addition, an optimal z-path free checkpointing algorithm has the potential to have lower checkpointing overhead than any strictly z-path free checkpointing algorithm for the same computation run.

We pointed out that an optimal strictly z-path free checkpointing algorithm already exists in the literature, however, an optimal z-path free checkpointing algorithm does not exist. So. designing an optimal z-path free checkpointing algorithm remains an open problem.

In z-cycle free checkpointing, none of the checkpoints lie on a z-cycle. So. ev­ ery checkpoint is useful for constructing consistent global checkpoints. .Moreover, an optimal z-cycle free checkpointing algorithm has the potential for having lower

136 checkpointing overhead than any z-path free checkpointing algorithm. However, due

to the presence of non-causal z-paths between checkpoints, constructing consistent

global checkpoints in a z-cycle free system is difficult because non-causal z-paths are

not on-line trackable. We also pointed out that designing an optimal z-cycle free

checkpointing algorithm remains an open problem.

Thus, from the classification of quasi-synchronous checkpointing algorithms, we

learned that if we can design a quasi-synchronous checkpointing algorithm that is

z-cycle free and if we can also present an efficient method for finding consistent global

checkpoints under such checkpointing, then that would help in minimizing rollback

and speeding up recovery when failures occur.

Based on the above insight, we designed a quasi-synchronous checkpointing algo­

rithm which is z-cycle free which also facilitates the construction of a consistent global

checkpoint containing any given checkpoint. Since every checkpoint is useful, a failed

process needs to roll back only to its latest checkpoint and all other processes will

have a checkpoint that is consistent with the latest checkpoint of the failed process

to roll back to. We exploited this elegant property of our checkpointing algorithm

to design a recovery algorithm that can handle concurrent failure of multiple pro­ cesses. Unlike the existing recovery algorithms, our recovery algorithm does not use vector timestamps to track dependency between checkpoints. The control informa­ tion pigg} backed with the application messages to track dependency is very small.

It increases with the number of recoven.' lines established due to failures in the past; thus during failure-free operation, the overhead is very little. Moreover, our algorithm uses neither pessimistic nor optimistic message logging to deal with various types of messages that arise due to rollback. Messages are selectively logged, i.e.. only those

1 3 7 messages that could be required for replay after rollback are logged into stable storage

before they are processed. This selective message logging reduces the message logging

overhead for the purpose of recover}-.

7.2 Future Research Directions

Over the last decade, several checkpointing and recover}- algorithms have appeared

in the literature. Performance of checkpointing algorithms have been studied by

simulation. Some recover}- algorithms have been implemented [16|. In general, the

performance of recover}- algorithms has been compared with respect to the parameters

like the number of concurrent failures they can handle, the number of times a process

has to roll back when a failure occurs, the additional message overhead involved,

nature of recovery (synchronous or asynchronous), delay introduced in recovery. FIFO

nature of channels, etc. Xo performance study through simulation has been done

for the recovery algorithms. Future research should involve building a simulator

which can serve as testbed for evaluating the performance of not only checkpointing algorithms but also recovery algorithms.

Our classification of checkpointing algorithms reveals that finding the optimal /- cycle free checkpointing algorithm and the optimal z-path free algorithm remain open

problems. Designing such optimal algorithms seem to be impossible because we need information about future events. Future research should focus on developing z-path free and z-cycle free checkpointing algorithms which are close to the optimal ones.

The notion of z-paths between checkpoints captures the precise requirements for consistency of checkpoints. The use of z-paths has not been fully exploited in studying various problems in distributed system design. Future research should focus on finding

138 efficient solutions to problems in distributed systems like global predicate detection using z-paths.

Future research should also focus on developing efficient solutions for recovery in mobile computing systems. Recover}' algorithms proposed for distributed systems are not suitable for mobile computing systems due to the mobile nature of the hosts its well as additional constraints on the mobile hosts such as limited memory, unreliable stable storage at mobile hosts, limited bandwidth of wireless channels through which a mobile host has to communicate with mobile support stations, etc.

139 BIBLIOGRAPHY

[1] A. Acharya and B. R. Badrinath. "Checkpointing Distributed Applications on Mobile Computers". In Proceedings of the International Conference on Par­ allel and Distributed Information Systems. Septem ber 1994.

[2] R. Baldoni. J. M. Heiary. A. Mostefaoui. and M. Raynal. "Characterizing Con­ sistent Checkpoints in Large-Scale Distributed Systems". In Proceedings of the 5'^ IEEE International Conference on Parallel and Distributed Computing, pages 314-323. Chejiu Islands (South-Korea). .\ugust 1995.

[3] R. Baldoni. I. M. Heiary. A. .Mostefaoui. and M. Raynal. "Consistent Check­ points in Message Passing Distributed Systems". Rapporte de Recherche Xo. 2564. INRIA. France. .June 1995.

[4] R. Baldoni. .1. M. Heiary. A. Mostefaoui. and M. Raynal. "On Modeling Consis­ tent Checkpoints and the Domino Effect in Distributed Systems". Rapporte de Recherche Xo. 2569. IXRIA. France. .June 1995.

[5] R. Baldoni. J. M. Heiary. A. Mostefaoui. and M. Raynal. "A Communication Induced .Algorithm that Ensures the Rollback Dependency Trackability". In Proceedings o f the 27'^ International Symposium on Fault-Tolerant Computing. Seattle. .July 1997.

[6 ] R. Baldoni. J. M. Heiary. and M. Raynal. ”.A.bout Recording in .Asynchronous Computations". In Proceedings of the 15‘* .ACM Symposium on the Principles of Distributed Computing, page 55. May 1996.

[7] B. Bhargava and P. Leu. "Concurrent Robust Checkpointing and Recovery in Distributed Systems". In Proceedings o / 4'^ IEEE International Conference on Data Engineering, pages 154-163. February 1988.

[8 ] B. Bhargava and S. R. Lian. "Independent Checkpointing and Concurrent Roll­ back for Recovery in Distributed Systeras-.An Optimistic .Approach.". In Pro­ ceedings of IEEE Symposium on Reliable Distributed Systems, pages 3 12. 1988.

140 [9| A. Borg, J. Baumbach. and S. Glazer. "A Message System Supporting Fault Tol­ erance". In Proceedings of 9^ ACM Symposium on Operating Systems Principles. pages 90-99. O ctober 1983.

[10] D. Briatico. Ciuffoletti. and L. Simoncini. Distributed Doniino-Effect free Recovery .\lgorithm"’. In Proceedings of IEEE Symposium on Reliability in Distributed Software and Database Systems, pages 207 215. IEEE. 1984.

[11] K. M. Chandy and L. Lamport. ’Distributed Snapshots ; Determining Global States of Distributed Systems". .4CM Transactions on Computer Systems. 3(l):63-75. February 1985.

[12] R. Cooper and K. Marzullo. "Consistent Detection of Global Predicates". In Pro­ ceedings of A C M /O N R Workshop on Parallel and Distributed Debugging, pages 167 174. 1991.

[13] Cm. P. Damini and \'ijay K. Garg. "How to Recover Efficiently and .\syn- chronously when Optimism Fails". In Proceedings of the 16‘* International Con­ ference on Distributed Computing Systems, pages 108-115. 1996.

[14] Luis .Moura e Silva and .Jouâo Gabriel Silva. "Global Checkpointing for Dis­ tributed Programs". In Proceedings of Symposium on Reliable Distributed Sys­ tems. pages 155-162. 1992.

[15] E. X. Elnozahy. D. B. .Johnson, and \'.M . Wang. .A Survey of Rollback-Recovery Protocols in Message-Passing Systems". Technical Report CML-CS-96-181. School of Computer Science. Carnegie Mellon University. Pittsburg. P.A 15213. 1996.

[16] E. X. Elnozahy and W. Zwaenepoel. "Manetho: Transparent Rollback-recoverv with Low Overhead. Limited Roll-back and Fast Output Commit". IEEE Trans­ actions on Computers. 41(5):526-531. May 1992.

[17] E. Fromentin. X. Plouzeau. and M. Raynal. ’’.An Introduction to the .Analysis and Debug of Distributed Computations". In Proceedings o f V'' IEEE International Conference on .Algorithms and .Architectures for Parallel Processing. Brisbane. .Australia, pages 545-554. .April 1995.

[18] K. Geihs and M. Seifert. ’.Automated \'alidation of a Co-operation Protocol for Distributed Systems". In Proceedings of 6^^ International Conference on Distributed Comp. Systems, pages 436-443. 1986.

[19] O. Gerstel. M. Hurfin. X. Plouzeau. M. Raynal. and S. Zaks. ”On-the-fiy Replay: a Practical Paradigm and its Implementation for Distributed Debugging". In Proceedings of 6‘* IEEE International Symposium on Parallel and Distributed Debugging, pages 266-272. Dallas TX. October 1995.

141 [20] M. Hurfin. X. Plouzeau. and .\I. Raynal. Debugging Tool for Distributed Estelle Programs". Journal of Computer and Cominunications. 16:328 33. 1993.

[21] G. Richard III and M. Singhal. "Complete Process Recover}-: Using \ector Time to Handle Multiple Failures in Distributed Systems". IEEE Concurrency. 5(2):50-58. April-.June 1997.

[22] Golden G. Richard III. "Techniques for Process Recovery in Message Passing and Distributed Shared Memory Systems ". PhD thesis. The Ohio State University. 1994.

[23] D. B. .Johnson and W. Zwaenepoel. "Sender-based Message Logging". In Pro­ ceedings of IEEE Symposium on Fault-Tolerant Computing, pages 14 19. June 1987.

[24] D. B. Johnson and W. Zwaenepoel. "Recover}- in Distributed Systems Using Op­ timistic Message Logging and Checkpointing”. Journal of Algorithms. 11(3):462 491. September 1990.

[25] T-V. Juang and S. \enkatesan. "Crash Recovery with Little Overhead". In Proceedings o /lU * International Conference on Distributed Computing Systems. pages 349-361. 1990.

[26] T-Y. Juang and S. \enkatesan. "Efficient .Algorithm for Crash Recovery in Distributed Systems". In Proceedings of 10'* Conference on Foundations on Software Technology and Theoretical Computer Science, pages 349-361. 1990.

[27] Junguk L. Kim and Taesoon Park. ".Un Efficient Protocol for Checkpointing recovery in Distributed Systems". IEEE Transactions on Parallel and Distributed Systems. 4{8):955-960. .Uugust 1993.

[28] K. H. Kim. Scheme for Coordinated Execution of Independently Designed Recoverable Distributed Processes". In Proceedings o / 16'* IEEE Symposium on Fault-Tolerant Computing, pages 130-135. June 1986.

[29] R. Koo and S. Toueg. "Checkpointing and Roll-back Recovery for Distributed Systems". IEEE Transactions on Software Engineering. SE-13(1):23-31. Jamiary 1987.

[30] .Ujay D. Kshemkalyani. Michel Raynal. and Mukesh Singhal. ".Un Introduction to Snapshot .Ulgorithms in Distributed Computing". Distributed Systems Engi­ neering Journal. 2(4):224-233. December 1995.

[31] K.Tsuruoka. .U. Kaneko. and V. Xishihara. "Dynamic Recovery Schemes for Distributed Process". In Proceedings of IEEE 2nd Symposium on Reliability in Distributed Software and Database Systems, pages 124-130. 1981.

142 [32] L. Lamport. "Time. Clocks and Ordering of Events in Distributed Systems". Communications of the ACM. 21(7):558-ô65. .July 1978.

[33] . "The Mutual Exclusion Problem; Part I — .A. Theoiy of Inter­ process Communication". JACM. 33(2):313-326. .April 1986.

[34] K. Li. J. F. Naughton. and J. S. Plank. "Checkpointing Multicomputer .Applica­ tions'. In Proceedings o/10‘* Symposium on Reliable Distributed Systems, pages 2-11. 1991.

[35] D. Manivannan. Robert H. B. Xetzer. and M. Singhal. "Finding Consistent Global Checkpoints in a Distributed Computation". IEEE Transactions on Par­ allel and Distributed Systems. 8(6):623-627. .June 1997.

[36] D. Manivannan and M. Singhal. ".A Low-overhead Recovery Technique using Quasi-synchronous Checkpointing". In Proceedings of the 16'* International Conference on Distributed Computing Systems, pages 100-107. Hong Kong. May 1996.

[37] F. Mattern. "A'irtual Time and Global States of Distributed Systems". In M. Cos- nard et al., editor. Parallel and Distributed .Algorithms, pages 215-226. Elsevier Science. Xorth Holland. 1989.

[38] B. .Miller and J. Choi. "Breakpoints and Halting in Distributed Programs". In Proceedings of the 8 '* International Conference on Distributed Computing Systems, pages 316-323. 1988.

[39] M.Powell and D. Presotto. "Publishing: .A Reliable Broadcast Communication Mechanism". In Proceedings of the 9'* .AC.M Sympo.'iium on Operating System Principles, pages 100-109. 1983.

[40] Robert H. B. Xetzer and .Jian .Xu. "Xecessary and Sufficient Conditions for Consistent Global Snapshots". IEEE Transactions on Parallel and Distributed Systems. 6(2):165-169. February 1995.

[41] S. L. Peterson and Phil Kearns. "Rollback Based on \ector Time". In Proceedings of 12'* Symposium on Reliable Distributed Systems, pages 68-77. 1993.

[42] R. Prakash and M. Singhal. "Low-Cost Checkpointing and Failure Recovery in Mobile Computing Systems". IEEE Transactions on Parallel and Distributed Systems. 7(10):1035-1048. October 1996.

[43] Michel Raynal and Mukesh Singhal. "Logical Time: Capturing Causality in Distributed Systems*'. Computer. 29(2):49-56. February 1996.

1 4 3 [44] Mukesh Singhal and A. Kshemkalyani. ".A.n Efficient Implementation of \ ector Clocks". Information Processing Letters. 43:47-52. August 1992.

[45] Mukesh Singhal and Friedemann Mattern. ".An Optimality Proof for .Asyn­ chronous Recovery .Algorithms in Distributed Systems". Information Processing Letters. 55:117-121. 1995.

[46] Mukesh Singhal and Niranjan G. Shivaratri. "Advanced Concepts in Operating Systems''. McGraw Hill. 1994.

[47] .A. P. Sistla and J. L. Welch. "Efficient Distributed Recovery L'sing Message Logging". In Proceedings of 8^^ .ACM Symposium on Pmictples Di.stributed Com­ puting. pages 223-238. .August 1989.

[48] Sean W. Smith. David B. .Johnson, and .1. D. Tygar. "Completely .Asynchronous Optimistic recover}' with Minimal Rollbacks". In Proceedings of 25^^ Interna­ tional Symposium on Fault-Tolerant Computing, pages 361-370. IEEE. 1995.

[49] M. Spezialetti and P. Kearns. "Simultaneous Regions: .A Framework for the Consistent Monitoring of Distributed Systems". In Proceedings of the d"' Inter­ national Conference on Distributed Computing Systems, pages 61-68. 1989.

[50] R. E. Strom and S. Veinini. "Optimistic Recovery in Distributed Systems". . 4 C.V/ Transactions on Computer Systems. 3(3):204-226. .August 1985.

[51] K. \enkatesh. T. Radhakrishnan. and H. F. Li. "Optimal Checkpointing and Local Encoding for Domino-free Rollback Recovery". Information Proce.ssmg Letters. 25:295-303. .July 1987.

[52] V. M. Wang and W. K. Fuchs. "Optimistic Message Logging for Independent Checkpointing in Message Passing Systems". In Proceedings of Symposium on Reliable Distributed Systems, pages 147-154. October 1992.

[53] V. M. Wang and W. K. Fuchs. "Scheduling Message Processing for Reducing Rollback Propagation". In Proceedings of IEEE Fault-Tolerant Computing Sym­ posium. pages 204 211. .July 1992.

[54] Y. M. Wang and W. K. Fuchs. " Lazy checkpoint coordination for bounding rollback propagation". In Proceedings o f the 12'* IEEE Symposium on Reliable Di'itributed Systems, pages 78-85. October 1993.

[55] A'. M. Wang. A . Huang, and W. K. Fuchs. "Progressive Retry for Software Recov­ ery in Distributed Systems". In Proceedings of IEEE Fault-Tolerant Computing Symposium, pages 138-144. .June 1993.

144 [56] Vi-Min Wang. "Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints". IE E E Tran.'iaclions on Coinputers. (To appear).

[57] Jian Xu and Robert H. B. Xetzer. ".A.daptive Independent Checkpointing for Reducing Rollback Propagation". In 5‘* IEEE Symposium on Parallel and Dis­ tributed Processing. December 1993.

145