Distributed Computing (1989) 3 : 146-158

9 Springer-Verlag 1989

Probabilistic

Flaviu Cristian IBM Almaden Research Center, 650 Harry Road, San Jose, CA 95120, USA

Flaviu Cristian is a com- in some given maximum derivation from a time puter scientist at the IBM Al- reference external to the system. Internal clock syn- maden Research Center in San chronization keeps processor within some Jose, California. He received his PhD from the University of maximum relative deviation of each other. Exter- Grenoble, France, in 1979. nally synchronized clocks are also internally syn- After carrying out research in chronized. The converse is not true: as time passes operating systems and pro- internally synchronized clocks can drift arbitrarily gramming methodology in France, and working on the far from external time. specification, design, and verifi- Clock synchronization is needed in many dis- cation of fault-tolerant pro- tributed systems. Internal clock synchronization grams in England, he joined enables one to measure the duration of distributed IBM in 1982. Since then he has activities that start on one processor and terminate worked in the area of fault-tol- erant distributed protocols and on another processor and to totally order distrib- systems. He has participated in the design and implementation uted events in a manner that closely approximates of a highly available system prototype at the Almaden Research their real time precedence. To allow exchange of Center and has reviewed and consulted for several fault-tolerant information about the timing of events with other distributed system designs, both in Europe and in the American systems and users, many systems require external divisions of IBM. He is now a technical leader in the design of a new U.S. Air Traffic Control System which must satisfy clock synchronization. For example external time very stringent availability requirements. can be used to record the occurrence of events for later analysis by humans, to instruct a system to take certain actions when certain specified (exter- Abstract. A probabilistic method is proposed for nal) time deadlines occur, and to order the occur- reading remote clocks in distributed systems sub- rence of related events observed by distinct sys- ject to unbounded random communication delays. tems. The method can achieve clock synchronization This paper proposes a new approach for read- precisions superior to those attainable by previous- ing remote clocks in networks subject to un- ly published clock synchronization algorithms. Its bounded random message delays. The method can use is illustrated by presenting a time service which be used to improve the precision of both internal maintains externally (and hence, internally) syn- and external synchronization algorithms. Our ap- chronized clocks in the presence of process, com- proach is probabilistic because it does not guaran- munication and clock failures. tee that a processor can always read a remote clock Key words: Communication - Distributed system with an a priori specified precision (such a guaran- - Fault-tolerance Time service - Clock synchro- tee cannot be provided when there is no bound nization on message delays). However, by retrying a suffi- cient number of times, a process can read the clock of another process with a given precision with a probability as close to one a desired. An important Introduction characteristic of our method is that when a process In a distributed system, external clock synchroniza- succeeds in reading a remote clock, it knows the tion consists of maintaining processor clocks with- actual reading precision achieved. F. Cristian: Probabilistic clock synchronization 147

The use of the remote clock reading method Measurements of process to process message is illustrated by describing a distributed time ser- delays in existing systems indicate that typically vice which maintains externally synchronized their distribution has a shape resembling that illus- clocks despite process, communication and clock trated in Fig. 1. This distribution has a maximum failures. The service is implemented by a group density at a mode point between the minimum de- of time servers which execute a simple probabilistic lay min and the median delay, usually close to min, clock synchronization protocol. After presenting with a long thin tail to the right. For instance, the protocol and its performance, we conclude by a sample measurement of 5000 message round trip comparing it with other published clock synchroni- delays between two light-weight MVS processes zation protocols. (running on two IBM 4381 processors connected via a channel-to-channel local area network) per- Message delays formed at the Almaden Research Center (Dong, private communication, June 1988), indicates a me- To synchronize the clocks of their host processors, dian round trip delay of 4.48 ms situated between time server processes communicate among them- a minimum delay of 4.22 ms and an average ob- selves by sending messages via a communication served delay of 4.91 ms. While the maximum ob- network. Since there is a one to one correspon- served delay in this experiment (during which no dence between time server processes and proces- route changes or 'halt' button pushes occurred) sors, we do not distinguish between processes and was very far at the right: 93.17 ms, 95% of all ob- processors. For example, when we say "the clock served delays were shorter than 5.2 ms. of process P" we mean "the clock of the processor on which P runs". In distributed systems the task of synchronizing Previous work clocks is made difficult (among other things) by Most published clock synchronization algorithms the existence of unpredictable communication de- (e.g., Cristian et al. 1985; Dolev et al. 1984; Lam- lays. Between the moment a process P sends a mes- port 1987; Lamport and Melliar-Smith 1985; Lun- sage to a process Q and the moment Q receives delius-Welch and Lynch 1988; Schneider 1987; the message, there is an arbitrary, random real time Srikanth and Toneg 1987) assume the existence of delay. A minimum rain for this delay exists. It can an upper bound max on real time message trans- be computed by counting the time needed to pre- mission delays. If the delays experienced by deliv- pare, transmit, and receive an empty message in ered messages are smaller than max with probabili- the absence of transmission errors and any other ty 1, these algorithms keep clocks within a maxi- system load. In general, one does not know an mum relative deviation greater than max-min with upper bound on message transmission delays. probability one. It is known (Lundelius and Lynch These depend on the amount of communication 1984) that the closeness with which clocks can be and computation going on in parallel in the system, synchronized with certainty (i.e., with probabil- on the possibility that transmission errors will ity one) is limited: n clocks cannot be synchron- cause messages to be retransmitted several times, ized with certainty closer than (max-min)(1- 1/n), and on other random events, such as page faults, even when no failures occur and clocks do not process switches, the establishment of new commu- drift. nication routes, or a freeze of the activity of a pro- Other authors (e.g., Gusella and Zatti 1987; cess caused by a human operator who pushes the Marzullo 1984) adopt the premise that message de- 'halt' button on the panel of the processor hosting lays are unbounded, and use as upper bounds on that process. synchronization message delays the timeout delays employed for detecting communication failures be- tween processes. Such timeouts are introduced by % of system designers to prevent situations in which msgs some process P waits forever for a message from another process Q that will never arrive (for exam- ple because of a failure of Q). Since message delays are unbounded, it is understood that a small per- centage of messages may need more than a given timeout delay to travel between processes, i.e., there rain detay is a chance that "false" communication failures are Fig. 1 detected. This is the price paid for letting systems 148 F. Cristian: Probabilistic clock synchronization subject to unbounded message transmission delays tion, we assume initially that processor clocks are continue to work despite process failures and mes- correct. We relax this assumption later, by showing sage losses. To reduce the likelihood of "false" how one can detect and handle arbitrary clock fail- alarms, a timeout delay is conservatively estimated ures. from network delay statistics to ensure that mes- We assume that message delays between pro- sage delays are smaller than the chosen timeout cesses are unbounded. As we will see later, the closer with a very high probability p (typically p > 0.99). the distribution of such delays resembles that illus- If such a timeout delay is denoted by "maxp", the trated in Fig. 1 (i.e., the closer the median delay best synchronization precision achievable by the is to rain), the better our probabilistic clock syn- algorithms proposed in Gusella and Zatti (1987) chronization algorithms perform. What is remark- and Marzullo (1984) can be characterized as being able, however, is that their correctness does not de- 4 (maxp-min). pend on any assumption about the particular shape of the message delay density function. We also assume that, to let processes continue to work Assumptions on clocks, processes, despite process failures or message losses, a timeout and communication delay maxp is chosen. The adoption of such a time- out delay divides observable network behaviors Each time server process has access to the hard- into two classes. A communication path (P, Q) be- ware clock H of its host processor. To simplify tween processes P and Q is said to function correct- our presentation, we assume these clocks have a ly if any message sent by P is delivered uncorrupted much higher resolution than the time intervals (e.g., to Q within maxp time units. If a message accepted process to process communication delays) which at one path end is never delivered at the other must be measured. For example, if the delays ob- end or is delivered after more than maxp time units, servable are of the order of milliseconds, we assume the path suffers a late timing or performance failure the hardware clocks have a microsecond resolu- (Cristian et al. 1985). We assume that communica- tion. A clock H is correct if it measures the length tion channels between processes can only be af- t-t' of any real time interval [t', t] with an error fected by performance failures. of at most p(t-t'), where p is the maximum clock Processes undergo state transitions in response drift rate from external (or real) time specified by to message arrivals and timeout events generated the clock manufacturer: by timers. To simplify our presentation we assume that between the occurrence of a timeout and the (1 --p)(t--t')<_H(t)--H(t')<_(1 +p)(t--t'). (C) invocation of the associated timeout handler there is a null (process scheduling) delay and that process In the above formula, it is implicit that the de- timers advance at the same rate as the clocks of lay t- t' is long enough so that the worst case error the underlying processors. Thus, a correct process in measuring its length caused by the discrete clock which at real time t sets a timer to measure W granularity is negligible compared to that due to time units, is awakened in the real time interval drift. For most types of quartz clocks, the constant [t+(1-p)W,, t+(1 +p)W]. We say that a process p is of the order of 10-6. For example the worst behaves correctly if in response to trigger events actual drift rate measured for the microsecond res- (such as message arrivals or timeout occurrences) olution clocks existing on the IBM 4381 processors it behaves in the manner specified. The specifica- in our laboratory is 6,10 _6 (Dong, private com- tion prescribes the state transitions which should munication, June 1988). Since p is such a small occur as well as the time intervals within which quantity, we ignore in this paper terms of the order these transitions should occur. If, in response to of p2 or smaller (e.g., we equate (l+p) -1 with (1 some trigger event, a process never performs its -p) and (1 _p)-i with (1 +p)). A clock failure oc- specified state transition or undergoes it too early curs if the clock correctness condition (C) is violat- or too late (i.e., outside the time interval specified), ed. Examples of clock failure types are: crash fail- the process is said to suffer a timing failure (Cristian ures (i.e., the clock stops), timing failures (e.g., a et al. 1985). Processes which crash, omit to send change in the frequency of the quartz oscillator certain messages, respond too slowly to trigger driving the clock counter causes the clock value events (because of excessive load or slow timers), to be incremented too fast or too slowly), and arbi- or time out too early (because of timers running trary, or Byzantine, failures (e.g., the clock counter at speeds greater than 1 + p) are examples of timing displays a nonmonotonic time because some of its failures. We assume that processes can suffer only bits are stuck at 0 or at 1). To simplify our presenta- timing failures. F. Cristian: Probabilistic clock synchronization 149

Attempting to read a remote clock assume that P does not know the drift rate of Q's clock or its own clock, the value CQ(t) can be any To read the clock of a process Q, a process P sends point in this interval. In other words: a message ("time = ?') to Q. When Q receives the [T+min(1 --p), T+2D(1 +2p)--min(1 +p)] is the message it replies with a message ("time=", T) smallest interval which P can determine in terms where T is the time on Q's clock. If P does not of T and D that covers Q's clock value. receive a reply because of a failure, its attempt at Since P has no means of knowing exactly where reading Q's clock fails. Assume that P receives a Q's clock is in the interval (6), the best it can do reply and let D be half of the round trip delay is to estimate CQ(t) by a function C~(T, D) of what measured on P' clock between the sending of the it knows, that is, Tand D. In doing so, the actual ("time=?") message and the reception of the error that P makes is: ("time = ", T) message. IC~(T, D) = Co(t) I. Theorem. If the clocks of processes P and Q are P minimizes the maximum error it can make in correct, the value displayed by Q's clock when P estimating CQ(t) by choosing C~(T, D) to be the receives the ("time = ", T) message is in the interval midpoint of the interval (6): [T+min(1 -p), T+2D(1 +2p)-min(1 +p)]. C~(T, D) = T+D(1 + 2p)- min p. (7) Proof Let t be the real-time when P receives the For this choice of C~(T, D), the maximum error ("time=", T) message from Q and CQ(t) be the e that P can make when reading Q's clock is half value displayed by Q's clock at that time. Let the length of the interval (6): min + e, min + fi, e > 0, fl > 0, be the real time delays experienced by the ("time = ?") and ("time = ", 7) e=D(l + 2p)--min. (8) messages, respectively, and let 2d be the real time Any other estimate choice leads to a bigger maxi- round trip delay: mum error. We refer to the expression (7) as "P's 2d=2 rain + c~+fi. (1) reading of Q's clock" and to (8) as "P's reading error" or "P's reading precision". Since e and fl are positive, (1) implies:

0

least min. To achieve the best possible precision to make. For a given choice of k, allowing for up for which there exists a positive probability of rap- to k reading attempts increases the probability of port, P must chose a timeout delay as close to success to 1 _pk. Since p < 1, this probability can Umi, as possible. For such a limit timeout delay, be made arbitrarily close to 1 by choosing a suffi- formula (8) implies that the best reading precision ciently large k. achievable by a clock reading experiment is For large values of k and a choice of W that ensures independence between successive reading

emin = 3p min. (11) attempts, Bernoulli's law yields that the average number of reading attempts needed for achieving The first two p s correspond to the relative drift rapport is (1-p)-1. Since each attempt costs two between Q's clock and P's clock while the messages, it follows that the average number of ("time = ", T) message travels between Q and P, messages ti for achieving rapport is and the third p corresponds to P's error in setting 2 its timeout delay so that it measures at least min ti = (1 - p)" (12) real time units. Let p be the probability that P observes a round trip delay greater than 2 U. The larger U is, the Formulae (8) and (12) indicate the existence of a smaller p will be. Conversely, the smaller U is, the continuum of different clock reading algorithms in- larger p will be. Thus, there exists a fundamental dexed by different timeout delays U: from aggres- trade-off between the precision achievable when at- sive but risky algorithms indexed by U's close to tempting to read a remote clock and the probabili- min which are capable of achieving high precisions ty 1-p of success. The better the desired precision by possibly using a very large number of messages, is, the smaller is the probability of success. Conver- to low risk "deterministic" algorithms indexed by sely, the worse the precision is, the greater is the U's close to max which achieve poor precisions probability of success. In the limiting case, if a max- by using a small number of messages. imum real time message delay max is known, by settling for a remote clock reading precision of A distributed time service max(1 + 3p)-min (corresponding to a timeout de- lay of max(1 +p)), one obtains a deterministic re- The probabilistic clock reading method described mote clock reading algorithm (similar to the ones above can be used to improve the precision achiev- used by the synchronization algorithms presented able by most of the internal clock synchronization in Cristian et al. (1986), Dolev et al. (1984), Lam- algorithms surveyed in Schneider (1987) by letting port (1987), Lamport and Melliar-Smith (1985), time servers read probabilistically the remote clock Lundelius-Welch and Lynch (1988), and Srikanth values used as inputs to the convergence functions and Toueg (1987)) which always achieves rapport. mentioned there. Instead of exploring this avenue, The price for such a choice is poor precision. we devote the rest of the paper to describing a Consider now a certain specified precision e and simple distributed time service which provides ex- the associated probability p that a reading attempt ternal clock synchronization. fails. For this precision, the probability that process The goal is to keep clocks synchronized to an P reaches rapport with process Q can be increased official source of external time signals, such as the if several clock reading attempts are allowed before Universal Time Coordinated (UTC) signals broad- P gives up. To achieve a certain degree of indepen- cast by the WWV radio station of the National dence between successive attempts, these should be Bureau of Standards. Commercially available re- separated by a minimum waiting delay W. This ceivers (e.g., Kinemetrics/Truetime 1987) can re- delay must be chosen so as to ensure that if P ceive such signals. The receivers can be attached and Q stay connected and correct, then any tran- to processors via dedicated busses. To guard sient network traffic bursts that may effect their against a physical receiver failure, it is possible to communication disappear within Wclock time un- pair physically independent receivers into a self- its with high probability. (A solution to the prob- checking receiver unit, by continuously comparing lem of how to adapt to slower, nonbursty, network their results, and interpreting any disagreement load changes is sketched later.) To avoid P at- among them as a failure of the pair (Kinemetrics/ tempting, ad infinitum, to read Q's clock when Q Truetime 1987). If no multiple failures occur, a self- is permanently partitioned from P or has crashed, checking receiver either displays correctly the ex- one must decide on a maximum value k for the ternal time or signals an error. We assume that number of successive attempts that P is allowed all radio receivers used by the time service are self- F. Cristian: Probabilistic clock synchronization 151 checking. We also assume that, for reasons of econ- od the slave clock displays the values omy, only certain processors, called masters, have L=H(I+m)+N and M+~=(H+~)(I+m)+N, time receivers attached to them. The other proces- respectively, where H is the hardware clock value sors are referred to as slaves. To simplify our pre- at rapport, by solving the above system of equa- sentation we initially assume the existence of a un- tions we conclude that the parameters m,N must ique, continuously available, master time source. be set to Issues related to the implementation of this master m=(M-L)/c~, N=L-(1 +m)*H (A) time source by a group of redundant physical mas- ters are discussed later. To further simplify the pre- for the e clock time units following rapport. After sentation, we do not distinguish between real (or the ~ amortization period elapses, at local time atomic) time and astronomical UTC time, that is, E = M + e, the slave clock C can be allowed to run we ignore problems related to the existence of year- again at the speed of the local hardware clock until ly UTC time discontinuities known as "leap sec- the next rapport by setting m to 0 and (to ensure onds". (For a discussion of the differences between continuity of C) N to E- H', where H' is the value these two time references, see Kopetz and Ochsen- displayed by the hardware clock at the end of the reiter (1987)). We furthermore assume that the offi- amortization period. cial source of external time is reliable and that its signals are always available for reception by the The master-slave synchronization protocol radio receivers attached to master processors. The investigation of the issues related to maintaining The time service is implemented by a group of dis- synchronization in the presence of erroneous exter- tributed time server processes, one per correctly nal time signals or in the absence of such signals functioning processor in the system. The master constitutes a research topic in its own right. server running on the master processor M keeps the master logical clock CM within a maximum deviation em (external-master) of external (or real) Continuously adjustable clocks time. A slave server S keeps its logical clock C Some processor architectures enable the speed of within a maximum deviation ms (master-slave) a hardware clock to be changed by software while from the master clock. In this way the maximum others do not. Since the former make clock man- deviation e s of a slave from external time will be agement dependent on the particular commands era+ms and the maximum relative deviation of available for changing clock speeds, in this paper two slaves will be s s = 2 ms. we chose to discuss the latter alternative. To com- Since the protocol used for synchronizing a pensate for the fact that the speed of a hardware master clock to the clock of an attached self-check- clock /-/ is not adjustable, a logical clock C with ing receiver is similar to that used for synchroniz- adjustable speed is implemented in software. The ing a slave clock to a master clock, we only describe value of C is defined as the sum of the local hard- the latter in detail. The main difference between ware clock H and a periodically computed adjust- the two protocols lies in the variability of observed ment function A: round trip delays. While a variability of the order of milliseconds is reasonable for master slave com- C(t)=-H(t)+A(t). munications, variabilities much smaller can be To avoid logical clock discontinuities (i.e., jumps) achieved for the communication between a master A must be a continuous function of time. For sim- time server and the self-checking receiver attached plicity we consider only linear adjustment func- via a dedicated bus (for instance by ensuring that tions the master server does not relinquish control of the master CPU during a receiver clock reading A(t)=m,H(t)+N, attempt). By formula (8) this yields a very high where the m and N parameters are computed per- receiver clock reading precision. If this high read- iodically as described below. If, at local rapport ing precision is supplemented by the adoption of time L, a slave estimates that the master clock dis- a high master clock resynchroniza~ion frequency, plays time M, M+L, the goal is to increase (if the e m constant can be kept so small that it is M > L) or decrease (if M < L) the speed of the slave reasonable to assume in what follows that a master clock C so that it will show time M +~ (instead clock runs at the same speed as the external time. of L + c~) ~ clock time units after rapport, where The absence of master drift, the fact that for is a positive clock amortization parameter. Since current local area network technology round trip at the beginning and end of the amortization peri- delays smaller than 10 s are the rule, and that a 152 F. Cristian: Probabilistic clock synchronization drift rate p of the order of 10 -6 or less makes We show later that if amortization ends before a terms of the form dp-where d is a round trip delay- next attempt at rapport, (11') is satisfied. insignificant, allows us to simplify the formulae (6)- Since the slave clock continues to drift from (9) as follows. When a slave S receives a successful the master clock after amortization ends, it follows round trip of length 2D from the master M, the that for any t>t~, the distance between C and CM master clock CM is in the interval [-T+min, T can be as large as e + p t. To keep C and CM within + 2 D - mini : ms of each other, i.e., CM (t) e [ T+ rain, T+ 2 O -- mini. (6') e + pt 2U, i.e., a slave knows whether DNA=(1-p) dna=p-l(1-p)(ms-e)-kW.. (15') its previous reading attempt has succeeded when it is time to try again reading. If during an attempt Note that the time interval which elapses between to reach rapport all k reading attempts fail, S must a rapport and the beginning of the next attempt leave the group of synchronized slaves (such a de- at rapport is variable, since it is a function of the parture can be followed by a later rejoin). Consider round trip delay 2D observed at the last rapport. now that one of the reading attempts results in If D is close to min, the tight synchronization pre- a round trip delay 2D<2U allowing S to reach cision achieved allows the delay to the next attempt rapport with M. At rapport, the speed of the slave at rapport to be as long as: logical clock C is set according to the equations DNAmax = p- ~(1 - p) ms - kW. (16') (A) for the next t~ real time units, (1- p) ~ _< t~ _< (1 +p)c~, so that during amortization, say t real time When rapport is achieved with a round trip delay units after rapport, O -1. For e+pt~ <_ms. (11') this, it is sufficient to chose ~ greater than ms + U F. Cristian: Probabilistic clock synchronization 153

-min (see (23') for more details). Since the amorti- slave deviation ms of i millisecond. The minimum, zation parameter e is positive average, and maximum delays between successive synchronization are 63, 67, and 108 seconds, re- 0_ U-min +ok(1 +p) W. (20) greater than 1-10 -9 by sending on the average ff Thus, for a given choice of the U, k, and W con- = 4 messages every 1.11 minutes. stants, the smallest master slave maximum devia- A more conservative choice of 2 U'= 5.2 milli- tion that can be achieved is seconds, yields a probability p' of an unsuccessful round trip of 0.05 and an average number g' of mSmi n = U -- min + p k (1 + p) W. (21') messages per successful rapport of 2.1. For this p', For aggressive risky algorithms for which U is to achieve a probability of successful rapport close to min, maximum deviations as small as greater than 1-10 -9, k must be chosen to be at p k (1 + p)W can be achieved at the expense of many least 7 (i.e., ((0.05)7< 10-9)). Since for this choice synchronization messages (recall p is of the order of U', the reading error is 0.98 milliseconds, we of 10-6). For sure "deterministic" algorithms for settle for the goal of achieving a maximum master which U is close to an assumed maximum delay slave deviation of ms' =2 milliseconds. Assuming max, we get synchronization precisions slightly as in the previous example p = 6. ! 0- 6, and W= 2 worse than max-min with only two messages per seconds, we find that to achieve the 2"milliseconds synchronization, a result comparable to the pre- maximum deviation, a slave must on the average cisions achievable by previously published deter- spend 2.1 messages to reach rapport with a master ministic synchronization algorithms (Cristian et al. every 231 seconds (3.85 minutes). The minimum 1986; Dolev et al. 1984; Lamport 1987; Lundelius and maximum delays between successive resyn- and Lynch 1984; Lamport and Melliar-Smith chronizations are 230 and 273 seconds, respective- 1985; Schneider 1987; Srikanth and Toueg 1987). ly. The clock reading method described naturally The above example precisions compare favora- tolerates communication failures: up to k-1 suc- bly with the best precision of at most 44.47 milli- cessive performance failures can be masked if they seconds achievable by the deterministic synchroni- are followed by a successful rapport. The existence zation algorithms described in Cristian et al. (1986), of variable delays between successive slave syn- Dolev et al. (1984), Lamport and Melliar-Smith chronizations is a useful property, since it will tend (1985), Lundelius and Lynch (1984), Marzullo to uniformly spread the synchronization traffic (1984), Schneider (1987), Srikanth and Toueg generated by independent slaves in time. (1987).

Performance: two numerical examples Extensions To illustrate the synchronization precisions achiev- In this section we relax two of the assumptions able by our time service, we analyze in this section made earlier: the existence of a continuously avail- its performance in the context of a simple system able master processor and the existence of reliable of two 4381 processors (Dong, private communica- clocks that never fail. We also mention how to tion, June 1988), assuming one plays the role of handle slave failures, how one can improve syn- master and the other one the role of slave. chronization accuracy by taking advantage of past If we chose 2 U to be the median round trip local clock drift statistics and how to adapt to a delay 2 U =4.48 milliseconds, the probability p of variable system load. an unsuccessful round trip is 0.5. By (12) this yields an average number ff of messages per successful rapport of ti--4. Assuming that a probability of Dealing with master server failures losing synchronization of 10 .9 is acceptable, we find that at least k = 30 successive attempts at rap- With a unique master server, the master time ser- port should be allowed ((0.5)3o< 10-9). Assuming vice fails when the unique process implementing a worst case drift rate of p=6.10 -6, a waiting it fails. The probability of a master service failure time constant between successive reading attempts can be reduced if the service is implemented by W of 2 seconds, formulae (16') and (17') indicate a group of redundant master servers, all synchro- that it is possible to achieve a maximum master nized within e m of external time. There are several 154 F. Cristian: Probabilistic clock synchronization ways in which such a group can be structured. We Detecting clock failures mention three alternatives. Let UM, min~t be the parameters of the probabilis- Active Master Set. In this arrangement each slave tic algorithm run by a master M to synchronize multicasts ("time-- ?") requests to all masters, each its clock CM to the clock CR of its attached receiver master answers each time request, and slaves pick R. The maximum difference which can exist be- up the first answer that arrives. With such a strate- tween C~t and M's estimate of CR at rapport is gy, synchronized slaves will stay within e s = e m + m s of external time, but since some slaves maxadj~=e~ + em. (22') might be synchronized to one master, and some The first term eM = Uu-minM represents the maxi- others to another, the relative deviation among mum error in reading CR while the second term slaves ss becomes 2es, instead of 2ms as before. accounts for the maximum distance which might This solution leads to an increase in message cost: exist between CR and C~t at rapport. If a previously 2m messages per attempt at rapport, where m is synchronized master M detects at rapport that the the number of members in the master group. Note, distance between Cu and its estimate of C R is however, that if all processors are on a broadcast greater than maxadj~t, a master clock failure has local area network, this number can be reduced occurred (recall our assumption that the source of to m+l. external time signals is reliable and that receivers Ranked Master Group. To reduce the message cost, are self-checking). Upon detecting the failure of its one can use a synchronous membership protocol local clock, a master server leaves the active master (Cristian 1988) to rank the group of active masters group after reporting the failure to an operator. into a primary synchronizer, back-up, and so on. Similarly, if a master M and a slave S are cor- With such an arrangement slaves would send their rect and synchronized within ms, the maximum requests only to the primary. This results in a mes- difference at rapport between the clock C of the sage cost per attempt at rapport of 2. Let C be slave and the slave's estimate of CM is the upper bound on the failure detection time guar- ma xadjs=(U- min) + ms + 2em. (23') anteed by the synchronous master membership protocol (C is a function of the timeout delay maxp The last term is added because S can successively chosen for detecting communication failures synchronize to different masters that are 2 e m apart (Cristian 1988)) and let A be the time needed to from each other. A slave which at rapport detects inform the slaves that all subsequent time requests that its clock is more than maxadjs apart from should be sent to a new master. If W is chosen a master, detects either a master or a local clock greater than C+A, a slave cannot distinguish be- failure. Assuming that masters synchronize more tween a master failure and an excessive synchroni- frequently than slaves, if a master clock failure has zation message delay, so the maximum deviations occurred, the master will detect the failure before es and ss stay as before, i.e., es=em+ms, ss=2es. the next rapport with the slave. Thus, a slave de- If Wis chosen smaller than C+A, then one has tects a local clock failure if it observes twice in to adopt a higher upper bound on the maximum a row that its distance to the same master clock number of successive attempts at rapport and the is greater than maxadjs. Upon detecting the fail- analysis of the probability of achieving rapport be- ure of its clock, a slave leaves the group of synchro- comes slightly more complex. nized time servers after reporting the failure to an operator. Active Master Ring. A third solution would use a master membership protocol to order all active masters on a virtual ring. To send a time request Better bounds on the actual drift rate a slave chooses an active master at random. If no of a slave answer arrives within 2 U, the slave asks the next active master on the ring, and so on. The message The delay between successive resynchronizations cost of each attempt at rapport is 2 as in the of a slave (see (15')) can be increased by using better Ranked Master Group case, but the maximum de- lower and upper bounds on the actual drift rate viations es and ss stay the same as in the Active PA of a slave's hardware clock than the manufac- Master Set architecture, irrespective of the relation turer specified lower and upper bounds -p, p. This between Wand C +A, where A corresponds in this will have the desirable consequence of decreasing case to the worst case time needed to inform all the number of messages which must be sent per slaves of a master group membership change. slave per time unit for synchronization purposes. F. Cristian: Probabilistic clock synchronization 155

It is possible to compute estimates of the actual of the interval [pro,x, pmi,] : drift rate PA as well as upper and lower bounds p,_ +pm..). for Pa under a variety of different assumptions. For example one might assume that PA is a constant, One can easily verify that or the PA is a time varying function possessing a lim (pmin-pm"x)=0 and hence: lim (pA-pi)=0. first derivative that is bounded by constant from i~oo i~oo above and below, etc. In what follows we limit our- The successive estimates p~ of the actual drift rate selves to discussing the problem of approximating PA can be used to define a sequence of "virtual" Pa and determining lower and upper bounds for hardware clocks VHi-(1-p~)H, such that when it under the assumption that it is a constant. i increases to oo the speed of VHi converges to- By definition wards I. Indeed, the drift rate vp~ of VH i is vp~ Hi -- Ho =(PA-Pi), and since Pi converges towards Pa, it PA-- ] t~-- t o follows that vp~ converges towards 0. For this rea- son, we call the sequence of clocks VHi self-adjust- where t i denotes the real time at which the i'th ing. rapport is achieved, i=1, ..., and Hi=H(ti) de- Self-adjusting clocks not only improve the ac- notes the value of the slave hardware clock at the curacy with which delays can be measured, but i'th rapport. A slave can read Hi but cannot read can also cause drastic reductions in the clock syn- t~. It is however possible to estimate t~ as being chronization related message traffic. Define the i'th the value displayed by the master clock at rapport: self-adjusting logical slave clock Ci as being (1 Mi= TiA-Di, where T~ is the clock value sent by + m)VHi + N during the amortization period fol- the master and Di is half the round trip delay ob- lowing the i'th rapport, and as VH~ between the served at the i'th rapport. In estimating h by M~, end of the amortization and the i+ l'th rapport. a slave makes an error of at most ei=D~-min, To compute the delay between successive resyn- that is: chronizations of C i we need to know an upper Mi -- ei <_ ti <-- Mi -t- e i . bound vpi for Ivpil. This can be computed as fol- lows. Since pmi, < Pa --< pp,X, it follows that The above set of inequalities (i= O, 1.... ) implies P~""- Pi <-- P A-- Pi <-- pm,X__ Pi" Hi -- Ho Hi-- Ho 1< 1 Thus, Mi -- Mo + ei + eo ti -- to < Hi - Ho 1. [PA--Pi[- max {[P~ in --Pi[, [P~n~x--Pil} = m~. -- - Mi-- Mo - ei - eo Pi ) = vPi" This leads us to define the i'th upper and lower By substituting vpi for p in (15'), we get: bound drift estimates pp"~ and ppi, as: DNA,i, = 2 (p?,X _ p~nin)- 1 { Hi-Ho 1}, (1 -l(pm~x-pmi~))(ms-eO-kW. (16") pmax = min p~_,x, Mi - Mo - ei -- eo Note that since pmi. pp~X converges towards 0 {. Hi-Ho 1} as better estimates of the actual drift rates become ppin = max PF:3, Mi - M0 + ei + eo known, the delays between successive resynchroni- zations can increase apparently without bound. In where p,~,X=p and p~in= _p. These bounds on practice these delays have to stay bounded if one PA can be used to compute a longer delay to the wants to ensure upper bounds on clock failure de- next attempt at achieving the i + l'th rapport by tection times by tests such as (23'). To determine using the formula (15") below instead of (15'): the new resynchronization delays corresponding to D NA,i = fi~ 1 (1 -- #i) (ms -- ei) -- k W, the use of self-adjusting clocks, one might also want to take into account terms of the order of p2 or where fii=max{lp~'i"l, ]p~"Xl}. (15") smaller, ignored (for simplicity reasons) in this paper. Self-adjusting logical clocks Since for each i, we know upper and lower bounds Dealing with slave server failures pm~X, pmi. on the actual drift rate p~, it is natural A slow slave, which takes too long to read its mes- to define the i'th estimate Pi of PA as the midpoint sages or to time out, eventually discovers that the 156 F. Cristian: Probabilistic clock synchronization distance between its clock and a master clock has one to achieve precisions better than the best pre- become unacceptable when it evaluates the test cision bound (max-min)(1-1/n) guaranteeable (23'). Early slave timing failures (e.g., caused by fast by previously published deterministic algorithms slave timers) lead to an increase in network syn- (Cristian et al. 1986; Dolev et al. 1984; Lamport chronization traffic. The occurrence of such failures 1987; Lundelius and Lynch 1984; Lamport and can be detected by the master group, if the masters Melliar-Smith 1985; Lundelius-Welch and Lynch keep track of the last time each slave has asked 1988; Schneider 1987; Srikanth and Toueg 1987). for the time. If a slave asks too often, the masters (Specialized hardware can be used to reduce the could simply ignore it. The faulty slave will then difference between max and min (Kopetz and Och- fail to synchronize and will eventually leave the senreiter 1987), but the inherent limitation of deter- group of synchronized servers. ministic protocols remains unchanged.) When in- dexed by a conservative parameter U, such as a communication failure detection timeout delay Adapting to changing system load maxp, our external clock synchronization algo- Another extension consists in making the choice rithm also achieves a relative deviation at rapport of the acutal U-indexed slave synchronization algo- 2(maxp-min) smaller than the best precision rithm used in a system at a given moment depen- 4(maxp-min) achievable by previously published dent on the system load. We sketch here a possible algorithms based on communication failure detec- way to take into consideration system load when tion timeouts (Gusella and Zatti 1987; Marzullo deciding on a round trip acceptability threshold 1984). One of the key observations of this paper U. The intention is that U should increase when is that no relation needs to exist between clock load increases and should decrease when load de- synchronization and communication detection ti- creases. This can be achieved in the following way. meout delays. Synchronization algorithms indexed If at least one slave experiences k' U) known in advance. The masters could clocks and local area networks. One can envisage then agree on this and diffuse a decision that begin- that by estimating actual clock drift rates and using ning with some time in the future everybody has self-adjusting clocks one could achieve precisions to switch to the new round trip acceptability even better than the above bound. threshold U' and, hence, to bigger master-slave Besides improving synchronization precision, maximum deviations. The effect will be to increase the new approach has other properties worth men- the maximum master slave clock deviation when tioning. Since a probabilistic approach does not the system load increases. assume an upper bound on message transmission To decrease the maximum clock deviation delays, it can be used to synchronize clocks in all when load is light, a slave processor could commu- systems, not only those which guarantee an upper nicate to the master group the fact that it measures bound on message delays. A probabilistic time ser- round trip delays that are consistently smaller than vice such as the one sketched previously distributes U", for some U" < U known in advance. If the mas- uniformly the clock synchronization traffic in time, ters receive such messages from all slave proces- avoiding the periodic synchronization traffic bursts sors, they could diffuse the information that begin- produced by the existence of synchronization ning with some time in the future, everybody points in previously known synchronization algo- should decrease their round trip acceptability rithms. The service is simple to implement (see Ap- threshold to U". The effect will be to decrease the pendix) and robust. Likely process and communi- master slave clock deviation when the system load cation failures are tolerated. Clock failures are de- decreases. tected and processes with faulty clocks are shut down. Finally the time service described is efficient: it uses a number of messages that is linear in the number of processes to be synchronized. Conclusions While deterministic synchronization protocols A new probabilistic approach for reading remote always succeed in synchronizing clocks, the proba- clocks was proposed and illustrated by presenting bilistic approach proposed in this paper carries a synchronization algorithm that achieves external with it a certain risk of not achieving synchroniza- clock synchronization. The new approach allows tion. In view of the impossibility result of Lundelius F. Cristian: Probabilistic clock synchronization 157 and Lynch (1984) that deterministic clock synchro- References nization algorithms cannot synchronize the clocks Cristian F (1988) Reaching agreement on processor group mem- of n processes closer than (max-min)(1- i/n), this bership in synchronous distributed systems. 18th Int Conf seems to be an unavoidable price for wanting to on Fault-Tolerant Computing, Tokyo, Japan (June 1988) achieve a higher precision. As the desired precision Cristian F, Aghili H, Strong R (1986) Approximate clock syn- becomes higher, more messages are sent and the chronization despite omission and performance failures and processor joins. 16th Int Syrup on Fault Tolerant Comput- risk of not achieving synchronization becomes ing, Vienna, Austria (June 1986) higher. Conversely, as the desired precision be- Cristian F, Aghili H, Strong R, Dolce D (1985) Atomic broad- comes lower, the risk of not achieving synchroniza- cast: from simple message diffusion to Byzantine agreement. tion becomes lower and fewer messages need to 15th Int Syrup on Fault-Tolerant Computing, Ann Arbor, be sent. Actually, the U-indexed family of master- Michigan (June 1985) Dolev D, Halpern J, Simons B, Strong R (1984) Fault-tolerant slave synchronization algorithms presented clock synchronization. Proc 3 rd ACM Symp on Principles achieves a continuum between, at one end, sure "de- of terministic" protocols (indexed by large Us close Gusella R, Zatti S (1987) The accuracy of the clock synchroniza- to maxp or max) that achieve poor precision with tion achieved by tempo in Berkeley Unix 4.3BSD. Rep UCB/CSD 87/337 a high probability and a small number of messages, Kinemetrics/Truetime (1986) Time and Frequency Receivers, and "aggressive" protocols (indexed by small Us Santa Rosa, California close to min) capable of achieving very high pre- Kopetz H, Ochsenreiter W (1987) Clock synchronization in dis- cision but which carry with them a significant risk tributed real-time systems. IEEE Trans Comput 36:933-940 of not achieving synchronization even when sub- Lamport L (1987) Synchronizing time servers, TR 18. DEC Systems Research Center, Palo Alto, California (June 1987) stantial numbers of messages are exchanged. In Lundelius J, Lynch N (1984) An upper and lower bound for practice, one needs to choose a parameter U that clock synchronization. Inf Control 62:191~204 achieves the right balance between precision and Lamport L, Melliar-Smith M (1985) Synchronizing clocks in message overhead, and reduces the risk of losing the presence of faults. J Assoe Comput Mach 32:52-78 Lundelius-Welch J, Lynch N (1988) A new fault-tolerant algo- synchronization to a level that is acceptably small. rithm for clock synchronization. Inf Comput 77:1 36 The new view cast on clock synchronization Marzullo K (1984) Maintaining the time in a distributed system. in this paper prompts a number of questions which Xerox Rep OSD-T8401 (March 1984) can lead to further research. We mention here sev- Schneider F (1987) Understanding protocols for Byzantine eral. How is it possible to improve the accuracy clock synchronization. TR 87-859, Cornell University (Au- gust 1987) of some of the internal clock synchronization algo- Srikanth TK, Toueg S (1987) Optimal clock synchronization. rithms surveyed in Schneider (1987) by using pro- JACM 34:626 645 babilistic remote clock reading methods? What lower bounds exist for probabilistic synchroniza- tion algorithms? How can the accuracy of clock synchronization be improved if one knows the dis- Appendix tribution obeyed by message delays? How can one estimate bounds on the actual drift rate of hard- Detailed description of the master slave protocol ware clocks if this rate is a function of time? And A detailed description of a slave time server under the simplify- finally, how can one design algorithms which adapt ing unique master assumption is given in Figs. 2 and 3. To simplify this presentation, we do not give a detailed description to variable system load? of the master time server, since it is similar to a slave server and we do not use self-adjusting logical slave clocks. In what Acknowledgements. The idea that the randomness inherent in follows, we refer to linej of Fig. i as (i.j). message transmission delays can be used to improve clock syn- The round trip acceptability threshold U, the maximum number chronization precision originated while the author was working of successive reading attempts k, waiting time between reading with John Rankin and Mike Swanson on problems related to attempts W,, the amortization delay c~, and the maximum devia- synchronizing the clocks of high end IBM processors in Pough- tion ms are parameters of the protocol (2.1). Once U is chosen, keepsie, NY, during February 1986. In the summer of 1987, the probability p that, under worst case load conditions, a round John Palmer independently proposed a similar clock synchroni- trip delay is greater than 2 U is also determined. The constant zation protocol for the high end Amoeba system prototype k must be chosen to ensure that the probability pk of observing under development at the Almaden Research Center. The round k successive round trip delays greater than 2 U is acceptably trip delay measurements used in this paper were performed small (typically two or more orders of magnitude smaller then by Margaret Dong in the Amoeba system. Discussions with the instantaneous crash rate of the underlying processor). We Danny Dolev, Frank Schmuck, Larry Stockmeyer, and Ray assume that the constant Wis chosen greater than the accept- Strong helped improve the contents and the form of this exposi- able round trip delay 2 U. tion. The research presented was partially sponsored by IBM's The slave protocol uses three timers (2.5): a "Synch" timer Systems Integration Division group located in Rockville, Mary- for measuring delays between successive synchronization at- land, as part of a FAA sponsored project to build a new air tempts, an "Attempt" timer for measuring delays between suc- traffic control system for the US. cessive master clock reading attempts, and an "Amort" timer 158 F. Cristian: Probabilistic clock synchronization

1 task Slave(U:Time, k:Int, W,c~,ms:Time); chronized" is false (3.4). To prevent any confusion between mes- 2 const min: Time; sages pertaining to old and current clock reading attempts in 3 master: processor; the presence of performance failures each message is identified 4 var T,T',T",D,N: Time; try: Integer; m: Real; by the value of the local hardware clock H when the message 5 Synch,Attempt,Amort: Timer; is sent. This ensures that the identifiers of messages are unique 6 synchronized: Boolean; H: hardware-clock; with extremely high probability. For example, if we assume a typical hardware clock of 64 bits, the probability that some 7 synchronized~-false; Synch.set(0); "old" message still existing undelivered in the communication 8 cycle network because of a performance failure - is confused with 9 when lreceive("tirne?") do send-local-time; a recently sent message by the test (2.14) is smaller than 5 10 when Synch.timeout x 10-x8, even if we make the unrealistic assumption that mes- 11 do try~l; T'~H; sages can stay undelivered in the network for all the time needed 12 send("time = ?",T') to master; Attempt.set(W); to wrap around a clock: approximately 58 centuries for a clock 13 when receive("time = ",ET,T") whose low level bit is incremented every micro-second. Another 14 do/fT'+T" then iterate fi; variable "try" counts the number of unsuccessful master clock 15 T~H; D~(T-- T')/2; reading attempts (2.4). 16 /fD>U then iterate fi; After initializing the "synchronized" local variable, an at- 17 Attempt.reset; compute-adjustment; tempt to synchronize with the master is immediately scheduled 18 Synch.set(p-l(l-p)(ms+min-D)-kW); (2.7). When the Synch.timeout event occurs (2.10), the counter 19 when Attempt.timeout for unsuccessful attempts "try" is initialized and a message iden- 20 do/f try_>k then synchronized~false; "leave" f/; tified by the local hardware clock value T' is sent to the master 21 try~try + 1; T'~H; (2.12). The master responds to this message by sending its logical 22 send(" time = ?",T') to master; Attempt.set(W); clock value (4.3). To simplify our presentation, we assume that 23 when Amort.timeout do moO; N~N--H; a master clock is always synchronized to external time (the 24 endcycle absence of synchrony with external time at the master can be handled in a manner similar to the absence of synchrony with Fig. 2 the master at a slave). When a master response for measuring amortization periods. There are two kinds of 1 task Master; operations that are defined on a timer T: set and reset. The 4 var N: Time; m: Real; H: Hardware-clock; meaning of a 2 cycle 3 when receive ("time= ?", T)from s 1 procedure send-local-time; 4 do send("time= ",H+N+m.H,T) to s 2 /f synchronized 5 endcycle; 3 then lsend(" time = ",H + N + m * H) 4 else lsend(" undefined") Fig. 4 5 fi; arrives (2.13), if the received and locally saved message identi- 6 procedure compute-adjustment; fiers match (2.14) the message is accepted, otherwise it is dis- 7 if synchronized carded (the iterate command terminates the current iteration 8 then if-n okadj then synchronized*-false; "leave"fi; of the loop (2.8-2.24) and begins a new iteration). Unacceptably 9 m ~((ET + D)-- (T + N + m T))/e; long round trips are discarded (2.16). Unsuccessful reading at- 10 N~N--m*T; Amort.set(c0; tempts cause Attempt.timeout events (2.19). Indeed, the "At- 11 else m~0; N~(ET+D)-T; tempt" timer is set by each new attempt at reaching rapport 12 synchronized~true; (2.12, 2.22) and is reset only when rapport is reached (2.17). 13 fi; If k successive unsuccessful attempts occur (2.20), a slave can Fig. 3 no longer be sure that its clock is within ms from the master clock and must leave the group of synchronized slaves. Such a departure can be followed by a later rejoin. T.set(6) invocation is "ignore all previous T.set invocations and Consider now that a matching answer which arrives in signal a T.timeout event 6 clock time units from now". The less than 2 U time units leads to a successful rapport and causes meaning of a T.reset invocation is "ignore all previous T.set the "Attempt" timer to be reset (2.17). If the slave logical clock invocations". Thus, if after invoking T.set(100) at local time was not previously synchronized, it is bumped to the estimate 200, a new T.set(100) invocation is made a local time 250, there (7') of the master time (3.11). If the slave was previously synchro- is no T.timeout event at time 300. If no other T.set or T.reset nized, and the adjustment to be made passes the resonableness invocation is made before time 350, a T.timeout event occurs test "okadj" (3.8) defined in (23'), the speed of the local logical at local time 350. clock C is set so as to reach the slave's estimate of the master A local Boolean variable "synchronized" (2.6) is true when clock within ~ clock time units (3.9 3.10) following equation the local clock is synchronized to the master clock and is false (A). After amortization ends (2.23) the logical slave clock is when the local clock can be out of synchrony with respect to let again to run at the speed of the hardware clock until the the master clock. The local logical time is undefined when "syn- next rapport (2.17).