Probabilistic Clock Synchronization
Total Page:16
File Type:pdf, Size:1020Kb
Distributed Computing (1989) 3 : 146-158 9 Springer-Verlag 1989 Probabilistic clock synchronization Flaviu Cristian IBM Almaden Research Center, 650 Harry Road, San Jose, CA 95120, USA Flaviu Cristian is a com- in some given maximum derivation from a time puter scientist at the IBM Al- reference external to the system. Internal clock syn- maden Research Center in San chronization keeps processor clocks within some Jose, California. He received his PhD from the University of maximum relative deviation of each other. Exter- Grenoble, France, in 1979. nally synchronized clocks are also internally syn- After carrying out research in chronized. The converse is not true: as time passes operating systems and pro- internally synchronized clocks can drift arbitrarily gramming methodology in France, and working on the far from external time. specification, design, and verifi- Clock synchronization is needed in many dis- cation of fault-tolerant pro- tributed systems. Internal clock synchronization grams in England, he joined enables one to measure the duration of distributed IBM in 1982. Since then he has activities that start on one processor and terminate worked in the area of fault-tol- erant distributed protocols and on another processor and to totally order distrib- systems. He has participated in the design and implementation uted events in a manner that closely approximates of a highly available system prototype at the Almaden Research their real time precedence. To allow exchange of Center and has reviewed and consulted for several fault-tolerant information about the timing of events with other distributed system designs, both in Europe and in the American systems and users, many systems require external divisions of IBM. He is now a technical leader in the design of a new U.S. Air Traffic Control System which must satisfy clock synchronization. For example external time very stringent availability requirements. can be used to record the occurrence of events for later analysis by humans, to instruct a system to take certain actions when certain specified (exter- Abstract. A probabilistic method is proposed for nal) time deadlines occur, and to order the occur- reading remote clocks in distributed systems sub- rence of related events observed by distinct sys- ject to unbounded random communication delays. tems. The method can achieve clock synchronization This paper proposes a new approach for read- precisions superior to those attainable by previous- ing remote clocks in networks subject to un- ly published clock synchronization algorithms. Its bounded random message delays. The method can use is illustrated by presenting a time service which be used to improve the precision of both internal maintains externally (and hence, internally) syn- and external synchronization algorithms. Our ap- chronized clocks in the presence of process, com- proach is probabilistic because it does not guaran- munication and clock failures. tee that a processor can always read a remote clock Key words: Communication - Distributed system with an a priori specified precision (such a guaran- - Fault-tolerance Time service - Clock synchro- tee cannot be provided when there is no bound nization on message delays). However, by retrying a suffi- cient number of times, a process can read the clock of another process with a given precision with a probability as close to one a desired. An important Introduction characteristic of our method is that when a process In a distributed system, external clock synchroniza- succeeds in reading a remote clock, it knows the tion consists of maintaining processor clocks with- actual reading precision achieved. F. Cristian: Probabilistic clock synchronization 147 The use of the remote clock reading method Measurements of process to process message is illustrated by describing a distributed time ser- delays in existing systems indicate that typically vice which maintains externally synchronized their distribution has a shape resembling that illus- clocks despite process, communication and clock trated in Fig. 1. This distribution has a maximum failures. The service is implemented by a group density at a mode point between the minimum de- of time servers which execute a simple probabilistic lay min and the median delay, usually close to min, clock synchronization protocol. After presenting with a long thin tail to the right. For instance, the protocol and its performance, we conclude by a sample measurement of 5000 message round trip comparing it with other published clock synchroni- delays between two light-weight MVS processes zation protocols. (running on two IBM 4381 processors connected via a channel-to-channel local area network) per- Message delays formed at the Almaden Research Center (Dong, private communication, June 1988), indicates a me- To synchronize the clocks of their host processors, dian round trip delay of 4.48 ms situated between time server processes communicate among them- a minimum delay of 4.22 ms and an average ob- selves by sending messages via a communication served delay of 4.91 ms. While the maximum ob- network. Since there is a one to one correspon- served delay in this experiment (during which no dence between time server processes and proces- route changes or 'halt' button pushes occurred) sors, we do not distinguish between processes and was very far at the right: 93.17 ms, 95% of all ob- processors. For example, when we say "the clock served delays were shorter than 5.2 ms. of process P" we mean "the clock of the processor on which P runs". In distributed systems the task of synchronizing Previous work clocks is made difficult (among other things) by Most published clock synchronization algorithms the existence of unpredictable communication de- (e.g., Cristian et al. 1985; Dolev et al. 1984; Lam- lays. Between the moment a process P sends a mes- port 1987; Lamport and Melliar-Smith 1985; Lun- sage to a process Q and the moment Q receives delius-Welch and Lynch 1988; Schneider 1987; the message, there is an arbitrary, random real time Srikanth and Toneg 1987) assume the existence of delay. A minimum rain for this delay exists. It can an upper bound max on real time message trans- be computed by counting the time needed to pre- mission delays. If the delays experienced by deliv- pare, transmit, and receive an empty message in ered messages are smaller than max with probabili- the absence of transmission errors and any other ty 1, these algorithms keep clocks within a maxi- system load. In general, one does not know an mum relative deviation greater than max-min with upper bound on message transmission delays. probability one. It is known (Lundelius and Lynch These depend on the amount of communication 1984) that the closeness with which clocks can be and computation going on in parallel in the system, synchronized with certainty (i.e., with probabil- on the possibility that transmission errors will ity one) is limited: n clocks cannot be synchron- cause messages to be retransmitted several times, ized with certainty closer than (max-min)(1- 1/n), and on other random events, such as page faults, even when no failures occur and clocks do not process switches, the establishment of new commu- drift. nication routes, or a freeze of the activity of a pro- Other authors (e.g., Gusella and Zatti 1987; cess caused by a human operator who pushes the Marzullo 1984) adopt the premise that message de- 'halt' button on the panel of the processor hosting lays are unbounded, and use as upper bounds on that process. synchronization message delays the timeout delays employed for detecting communication failures be- tween processes. Such timeouts are introduced by % of system designers to prevent situations in which msgs some process P waits forever for a message from another process Q that will never arrive (for exam- ple because of a failure of Q). Since message delays are unbounded, it is understood that a small per- centage of messages may need more than a given timeout delay to travel between processes, i.e., there rain detay is a chance that "false" communication failures are Fig. 1 detected. This is the price paid for letting systems 148 F. Cristian: Probabilistic clock synchronization subject to unbounded message transmission delays tion, we assume initially that processor clocks are continue to work despite process failures and mes- correct. We relax this assumption later, by showing sage losses. To reduce the likelihood of "false" how one can detect and handle arbitrary clock fail- alarms, a timeout delay is conservatively estimated ures. from network delay statistics to ensure that mes- We assume that message delays between pro- sage delays are smaller than the chosen timeout cesses are unbounded. As we will see later, the closer with a very high probability p (typically p > 0.99). the distribution of such delays resembles that illus- If such a timeout delay is denoted by "maxp", the trated in Fig. 1 (i.e., the closer the median delay best synchronization precision achievable by the is to rain), the better our probabilistic clock syn- algorithms proposed in Gusella and Zatti (1987) chronization algorithms perform. What is remark- and Marzullo (1984) can be characterized as being able, however, is that their correctness does not de- 4 (maxp-min). pend on any assumption about the particular shape of the message delay density function. We also assume that, to let processes continue to work Assumptions on clocks, processes, despite process failures or message losses, a timeout and communication delay maxp is chosen. The adoption of such a time- out delay divides observable network behaviors Each time server process has access to the hard- into two classes. A communication path (P, Q) be- ware clock H of its host processor. To simplify tween processes P and Q is said to function correct- our presentation, we assume these clocks have a ly if any message sent by P is delivered uncorrupted much higher resolution than the time intervals (e.g., to Q within maxp time units.