Enforcing Determinism for the Consistent Replication



of Multithreaded CORBA Applications

P. Narasimhan, L. E. Moser, P. M. Melliar-Smith

Department of Electrical and Computer Engineering

University of California, Santa Barbara, CA 93106

[email protected] , [email protected], [email protected]

ministic. Unfortunately, many practical CORBA ap- Abstract

plications use nondeterministic mechanisms suchaslo-

cal timers, pro cessor-sp ec i c functions and, in partic-

In CORBA-based applications that depend on object

ular, multithreading.

replication for fault tolerance, inconsistencies in the

states of the replicas of an object can arise when con-

The Eternal system [7] addresses this challenge in

current threads within those replicas perform updates

its provision of fault tolerance for CORBA applica-

in di erent orders. By imposing a single logical

tions, while ensuring lowoverheads, transparency and

of control on every replicated multithreaded CORBA

interop erability.To ensure strong replica consistency,

client or server object, and by providing deterministic

the Eternal system exploits reliable totally ordered

scheduling of threads and operations across the replicas

multicasts for the communication of op erations b e-

of each object, the Eternal system achieves consistent

tween replicated ob jects, and implements mechanisms

object replication. The Eternal system does this trans-

for the detection and suppression of duplicate op er-

parently, with no modi cation to the application, the

ations and the transfer of state to new, recovering

ORB, or the concurrency model employed by the ORB.

and backup replicas. This pap er fo cuses on the mech-

anisms that the Eternal system employs to guaran-

tee consistent replication in the face of one sp eci c

source of nondeterminism, namely,multithreading in

1 Intro duction

the ORB or the application.

Distributed ob ject systems, based on standards such

as the Ob ject Management Group's Common Ob-

2 Replicating Multithreaded

ject Request Broker Architecture CORBA [9], pro-

vide applications with valuable features such as

Applications

language transparency,interop erability and lo cation

Many commercial ORBs are multithreaded, and mul-

transparency. Unfortunately, most distributed ob ject

tithreading can yield substantial p erformance advan-

standards, including CORBA, make no provision for

tages. Unfortunately, the sp eci cation of multithread-

other desirable features, such as fault tolerance, that

ing in the CORBA 2.2 standard [9] do es not place any

current and future CORBA-based applications will

guarantee on the order of op erations dispatched by

need. The Ob ject Management Group OMG has rec-

amultithreaded ORB. In particular, the sp eci cation

ognized the need to provide fault tolerance for CORBA

of the Portable Ob ject Adapter POA, whichisakey

applications by issuing a Request for Prop osals RFP

comp onent of the CORBA standard, provides no guar-

[10] for fault-tolerant CORBA.

antee ab out how the ORB or the POA dispatches re-

A ma jor requirement of the OMG's RFP is that

quests across threads. The ORB may dispatch several

strong replica consistency must b e maintained as op-

requests for the same ob ject within multiple threads

erations are p erformed that change the states of the

at the same time.

replicas. However, strong replica consistency requires

In addition to ORB-level threads, the CORBA ap- that the b ehavior of the application ob jects is deter-

plication may itself b e multithreaded, with the thread



This research has b een supp orted by the Defense Advanced Re-

scheduling having b een determined by the application

search Pro jects Agency in conjunction with the Oce of Naval

programmer. The application programmer must en-

Research and the Air Force Research Lab oratory, Rome, under

sure correct sequencing of op erations and must prevent Contracts N00174-95-K-0083 and F3602-97-1-0248, resp ectively.

the application are lo cated on the same typ e of plat- thread hazards. Careful application programming can

form, .e., same vendor's ORB, same op erating sys- ensure thread-safe op erations within a single replica of

tem, same typ e of workstation same pro cessing sp eed, an ob ject; however, it do es not guarantee that threads

same amount of memory, etc. In addition, we assume and op erations are dispatched in the same order across

deterministic b ehavior of the op erating system. Thus, all of the replicas of the ob ject. Making the application

any replica inconsistency that arises can b e attributed programmer resp onsible for concurrency control and

to the multithreading of the ORB or the application. ordering of dispatched op erations in replicated ob jects

is unacceptable for maintaining strong replica consis-

2.1 Active Replication

tency in a fault-tolerant system.

For active replication, strong replica consistency

Several di erent concurrency mo dels [13] are sup-

means that, at the end of each op eration invoked on

p orted by current commercial ORBs. These include

the replicas, each of the replicas of an ob ject have the

thread-p er-request, where one thread is spawned for

same state. The example shown in Figure 1 illustrates

each new invo cation on an ob ject, and thread-p er-

the problem of replica inconsistency when the only

ob ject, where a single thread executes all invo cations

supp ort for consistent replication is that provided for

on an ob ject. Most practical CORBA applications

single-threaded ORBs.

consist of servers that contain multiple ob jects, each

In the gure R and R are active replicas of the

1 2

having p ossibly multiple threads. Ob jects that are co-

same MT-domain. The ORB at each replica receives

lo cated within the same server pro cess may access and

messages in the same order, but dispatches two threads

up date shared data; thus, irresp ective of the threading

T and T to p erform op erations on the replicas. The

1 2

mo del used by the ORB, multiple threads may exist

two threads are simultaneously active within each

within each server. Because a server may itself assume

replica, and can access and up date shared data. Sup-

the role of a client, we do not distinguish b etween the

p ose that threads T and T issue up date op erations

1 2

problem of multithreading for clients and servers.

A and B , resp ectively, on the shared data and that

We use the term MT-domain to refer to any

op eration B A is executed b efore op eration A B 

CORBA client or server that supp orts multiple

in replica R R . Even if the MT-domain is pro-

1 2

application-level or ORB-level threads that may

grammed to avoid race conditions and other thread

access shared data and that contains one or more

hazards, the order of op erations in the two replicas of

CORBA ob jects. The MT-domain abstraction is in-

the MT-domain may di er and, thus, their states may

dep endent of the concurrency mo del of the ORB, the

b e inconsistent at the end of the up date op erations.

role of the MT-domain as a client or server, and the

Replica inconsistency can also arise when the ORB

commercial ORB that hosts the MT-domain.

initially dispatches only a single thread, which later

In Sections 2.1 and 2.2, we provide examples that il-

spawns other threads within the same replica.

lustrate how replica inconsistency can arise for active

2.2 Passive Replication

and passive replication of MT-domains. We assume

that all of the mechanisms that guarantee replica con-

For passive primary-backup replication, strong

sistency for single-threaded ob jects are present, i.e.,

replica consistency means that, at the end of each

messages are delivered to the ORB in a reliable totally

state transfer, each of the replicas of an ob ject have

ordered sequence, duplicate op erations are detected

the same state. The example shown in Figure 2 illus-

and suppressed, and state transfers are provided for

trates the problem of replica inconsistency for passive

recovering replicas. While these mechanisms suce to

replication.

guarantee replica consistency, they serve only to facil-

In the gure MT-domain C is passively replicated,

itate, but not to guarantee, replica consistency when

with primary replica C and backup replica C . Ob-

2 1

either the ORB or the application is multithreaded.

jects A, B , D and E not shown with which C inter-

In particular, reliable totally ordered multicasts en-

acts are not necessarily multithreaded or replicated.

sure only that the ORBs that host the various replicas

The example fo cusses on the passive replication of MT-

receive the same messages in the same order. They

domain C , and how C 's multithreading results in the

do not guarantee that the ORBs will dispatch these

inconsistency of its replicas.

incoming messages onto the threads of the replicas in

The invo cation I I  of ob ject A B  on MT-

AC BC

the same order.

domain C requires thread T T  to b e dispatched.

1 2

We assume further that other sources of nondeter- If thread T T  is dispatched, MT-domain C will is-

1 2

minism, e.g., system calls such as lo cal timers that sue the nested invo cation I I  to ob ject E D .

CE CD

return pro cessor-sp eci c information, are handled by Thus, MT-domain C invokes two di erent nested op er-

other mechanisms. We also assume that all replicas of

ations through its two threads, and must obtain results Actively Replicated Object

Replica R1 Replica R2 2 1 1 2 A B A B Thread T1 Thread T2 Thread T1 Thread T2

Shared Data 1 1 Shared Data Dispatch of Dispatch of Thread T 2 Thread T2 1 2 Dispatch of Dispatch of Thread T2 Thread T1

Multithreaded ORB Multithreaded ORB

Reliable totally ordered reliable multicast messages

Figure 1: Inconsistency with active replication of multithreaded ob jects.

serve replica consistency for MT-domains, the Eter- from b oth op erations. Consider the following sequence

nal system provides mechanisms that govern the order shown in Figure 2:

in which the threads and op erations are dispatched

1. The primary replica C is initially op erational.

2

within each replica of a MT-domain, over and ab ove

The ORB hosting C dispatches thread T to han-

2 2

the total order in which the messages containing the

dle invo cation I rst.

BC

op erations are delivered to the ORB.

2. Thread T issues a nested invo cation I on ob-

2 CD

The Eternal system enforces deterministic b ehavior

ject D .

within a MT-domain by allowing only a single logical

thread of control,atany p oint in time, within each

3. The primary replica fails b efore handling invo ca-

replica of the MT-domain. Although multiple threads

tion I . The backup replica C b ecomes the new

AC 1

may exist in a MT-domain, all of them must b e related

primary replica for MT-domain C .

to and required for the completion of  the single op er-

4. The ORB hosting the new primary replica C dis-

1

ation that \holds" the logical thread-of-control. Fur-

patches thread T to handle invo cation I rst.

1 AC

thermore, at most one of these threads can b e actively

5. Thread T issues a nested invo cation I on ob-

executing; all of the other threads must b e susp ended

1 CE

ject E .

or awaiting a resp onse.

The Eternal system controls the dispatching of

6. Before the new primary's ORB handles invo ca-

threads and op erations within every replicated MT-

tion I , ob ject D returns the resp onse R to

BC CD

domain, transparently b oth to the ob jects and threads

the old primary's nested invo cation I . The re-

CD

within the MT-domain, and to the multithreaded

ceiving ORB delivers this resp onse to C , which

1

ORB that hosts the MT-domain. Toachieve this,

is unable to handle this resp onse to the nested in-

the Eternal system employs a deterministic operation

vo cation I  that it has no knowledge of ever

CD



scheduler that is inserted into the address space of ev-

having issued.

ery replica of a MT-domain, and that maps incoming

The inconsistency arises precisely b ecause of the non-

invo cations to the thread-of-control within the replica,

deterministic b ehavior of multithreaded ORBs, and

or enqueues unrelated invo cations for later dispatch.

the lack of sp eci cation of the order of dispatch of the

The scheduler dictates the creation, activation,

op erations that such ORBs receive.

deactivation and destruction of threads, within the

replica of a MT-domain, as required for the execu-

3 Enforcing Determinism



The scheduling of op erations for replica consistency is orthog-

onal to the real-time scheduling of op erations. The op eration

A MT-domain is the basic unit of replication for mul-

scheduler describ ed here do es not factor in any considerations

tithreaded applications in the Eternal system. To pre- for real-time op eration. Passively Replicated Passively Replicated MT-domain C MT-domain C Backup Primary New Primary 3 Replica C1 Replica C 2 Replica C 1 Replica C2 Fails T1 T2 T1 T 2 T1 T2

I BC IAC 1 4

ORB ORB ORB ORB

IAC ICE I Reliable totally ordered BC ICD RCD Reliable totally ordered 5 multicast messages multicast messages 2 6

(a) (b)

Figure 2: Inconsistency with passive replication of multithreaded ob jects when a the primary replica is

initiall y op erational, and b the primary later fails and the backup replica b ecomes the new primary replica.

tion of the current op eration \holding" the logical another op eration can assume the MT-domain's logi-

thread-of-control. The scheduler is inserted intoapo- cal thread-of-control only when the current op eration

sition such that it can override any thread or op era- within the MT-domain completes.

tion scheduling p erformed by either the nondetermin-

Thus, the scheduler enqueues, in the order of their

istic multithreaded ORB within the replica, or by the

arrival, all incoming invo cations and resp onses that

replica itself.

are unrelated to the current op eration or the logical

thread-of-control. The next op eration to b e sched-

Op erations are mapp ed identically onto the logical

uled on the MT-domain, up on release of the thread-of-

thread-of-control at all of the replicas of a MT-domain,

control, is the rst op eration that has b een enqueued

thereby ensuring that the same op eration \holds" the

or, in the absence of enqueued op erations, the next

thread-of-control at each replica at any logical p oint

op eration that the scheduler receives.

in time. To enable this, the Eternal system ensures

that the scheduler at each replica of a MT-domain re-

Figure 3 shows a replica of a MT-domain, along

ceives the same sequence of totally ordered messages with its op eration scheduler, and the sequence of ac-

containing invo cations and resp onses destined for the tions of the MT-domain's scheduler for the given to-

MT-domain. Based on this incoming sequence of mes-

tally ordered messages. All of the replicas of the MT-

sages, the scheduler at each replica decides on the im-

domain are forced to b ehave identically, as the example

mediate delivery, or the delayed delivery, of the mes-

illustrates.

sages to that replica. At each replica of a MT-domain,

The MT-domain in this example consists of two ob-

the scheduler's decisions are identical and, thus, op-

jects A and B , each capable of supp orting a thread.

erations and threads within the MT-domain are dis-

Invo cations A , B and C are destined for ob jects A,

i i i

patched identically at each replica.

B and C , resp ectively ob ject C is in some other MT-

domain not shown in the gure. The invo cation A

i

gives rise to a nested invo cation C ; the invo cation B

i i

4 Scheduling for Consistency

is indep endent of b oth A and C .

i i

At the start of the sequence of actions in Fig- While the MT-domain mo del may seem somewhat re-

ure 3a, there is no op eration executing in the replica strictive in terms of the e ective concurrency achieved

of the MT-domain, and the thread-of-control is free to in the application, those restrictions are necessary

b e assumed by the next op eration. Thus, the op er- to achieve replica consistency for replicated multi-

ation scheduler delivers the invo cation A whichoc- threaded CORBA applications. To ensure a single

i

curs rst in the total order of messages, leading to a logical thread-of-control within the MT-domain, the

thread executing within ob ject A. The scheduler as- scheduler may delay or reschedule invo cations and re-

signs the MT-domain's thread-of-control to the logical sp onses on a MT-domain. This is necessary b ecause 3 5 Awaiting AB Invocation A response to i A B AB Invocation C completes and i frees thread of control

1 Dispatch of Invocation A Invocation C i 2 i Response C Dispatch of due to Ai 4 r 6 dispatched to A Invocation Bi

Queued Dispatched Operation Queued Dispatched Operation Operation Queued Dispatched Scheduler A A i Scheduler i Scheduler Bi B i Bi Cr

Totally ordered messages Totally ordered messages Totally ordered messages C B ...... i i Ai Cr Ci Bi Ai Cr Ci Bi Ai . . . .

(a) (b) (c)

Figure 3: Sequence of actions of the op eration scheduler at a replica of a MT-domain.

reassigns the thread-of-control to the next op eration, op eration represented by the invo cation A until A

i i

which is the invo cation B . The MT-domain's sched- completes.

i

uler then delivers the invo cation B , leading to a new

i The invo cation A on ob ject A leads to the subse-

i

thread of execution within the target ob ject B ,as

quentinvo cation C on ob ject C , and A can complete

i i

shown in Figure 3c.

only when the resp onse C to the invo cation C is re-

r i

This delaying of invo cation B in favor of the re- turned to ob ject A. Because the thread-of-control has

i

sp onse C although B precedes C in the total order b een assigned to A , the scheduler delivers to the MT-

r i r i

of op erations do es not itself intro duce any inconsis- domain only those incoming invo cations and resp onses

tency b etween the replicas of the MT-domain. The that corresp ond to A . In this case, the only message

i

reason is that the op eration scheduler at each of the in the total order that is related to A is C . Note

i r

replicas of the MT-domain arrives at the same schedul- that the invo cation B is indep endentofA . Thus,

i i

ing decision regarding the delivery of C b efore B . although B has b een received by the scheduler ahead

r i i

of C in the total order of messages, it is not delivered

Replica consistency is thus maintained as a result

r

to its target ob ject B until the thread-of-control is re-

of the deterministic b ehavior across al l of the replicas

leased by A .To deliver only those invo cations related

of a MT-domain through the totally ordered messages i

to the thread-of-control, the scheduler requires some

that they receive, as well as the deterministic b ehav-

means of recognizing, and relating, the op erations con-

ior within each replica of the MT-domain through the

tained in the totally ordered messages. Identi ers for

identical scheduling of distinct op erations onto a single

asso ciating op erations with the thread-of-control are

thread-of-control.

discussed in Section 4.2.

Replica determinism, unfortunately but inevitably,

Figure 3b shows the receipt of the resp onse C by

reduces the degree of concurrency within the applica- r

the MT-domain, and its delivery to ob ject A, when the

tion. However, if ob jects are indeed indep endentof

scheduler determines that its delivery is appropriate.

each other, and do not share data, they can b e as-

signed to di erent pro cesses and Eternal's op eration After pro cessing the resp onse C , ob ject A com-

r

scheduler will schedule them concurrently without re- pletes the invo cation A and the thread-of-control

i

striction. The memory protection b etween pro cesses, again b ecomes available. The scheduler also p erforms

provided by the op erating system, ensures that the garbage collection of the threads that were used by the

ob jects do not share data and are indeed indep endent. thread-of-control. The MT-domain's scheduler then Replica X1 of MT-Domain X

Replica Y1 of MT-Domain Y

Active on AB Active on Invocation I1 C Invocation I2

Dispatch of Dispatch of Unique sequence number Invocation I1 for the totally ordered Invocation I2 message containing I1 Invocation I2 s1 through Eternal s1 s2 s1

s1 Eternal’s Eternal’s Operation Scheduler for Y Operation Scheduler for X1 1

Unique sequence number Totally ordered Totally ordered for the totally ordered message sequence message sequence message containing I2

Replica X1 of MT-Domain X s2 s1 Replica Y1 of MT-Domain Y

Identifier for Parent AB Operation of I2 Suspended on Active on Active on Invocation I1 Invocation I C 3 Invocation I2 Unique sequence number for the totally ordered message containing I3

s2 s1 s3 s2 s1 s s s1 Dispatch of Invocation I3 3 2 only if I ’s thread is suspended Invocation I3 1 through Eternal Identifier for Parent Operation of I3 Eternal’s Eternal’s

Operation Scheduler for X1 Operation Scheduler for Y1

Totally ordered

message sequence

Figure 4: Nested reentrant op erations on a MT-domain. Scheduling identi ers assigned by the op eration

scheduler enable the detection of reentrant op erations.

4.1 Reentrant Op erations within the MT-domain X . I requires that I com-

1 2

pletes, and I requires that I completes. Thus, the

2 3

In the example of Figure 3, the op eration A was not

i

invo cation I must b e allowed to pro ceed inside every

3

reentrant on the MT-domain b ecause the execution of

replica of X and return a resp onse to ob ject C within

A did not lead to further invo cations on the same MT-

i

every replica of Y . Because the scheduler ensures iden-

domain. A reentrant op eration may lead to multiple

tical b ehavior at all of the replicas of a MT-domain, it

nested invo cations on the same MT-domain, eachof

suces to consider replicas X and Y shown in the

1 1

whichmust b e delivered and executed in order for the

gure of the MT-domains X and Y , resp ectively.

op eration to complete. Thus, for such an op eration,

the single logical thread-of-control is realized through To p ermit a second invo cation I on a MT-domain

3

multiple threads within the MT-domain, at most one X that already has the invo cation I p ending a re-

1 1

of which is actively executing, while the others are sp onse, the scheduler for X needs to verify that the

1

susp ended or awaiting a resp onse. second invo cation is a descendant of the rst parent

Figure 4 shows the interaction b etween a replicated op eration and that the parent op eration is susp ended.

MT-domain X and a replicated MT-domain Y . Here, A descendant of a particular parent op eration is an

invo cation I on ob ject A, when dispatched to the log- invo cation that arises from the execution of the par-

1

ical thread-of-control in X , results in the invo cation I ent op eration, and that must b e allowed to execute for

2

of ob ject C in the MT-domain Y . The invo cation I the parent op eration to complete. A descendantinvo-

2

cation may b e reentrant. In this example, invo cations

leads to a further nested invo cation I on ob ject B 3

within the MT-domain. For instance, the scheduler I and I are descendants of the parentinvo cation I ,

2 3 1

at X detects that invo cation I is a descendantofI with I b eing reentrant on the MT-domain X .

1 3 1 3 1

due to the presence of I 's identi er s in I 's identi-

After the scheduler for X determines that I is

1 1 3

1 3

er s s s . Once the scheduler veri es that an op era-

a descendantofI and that the thread for I is sus-

3 2 1

1 1

tion is indeed a descendant, it waits for the currently

p ended, awaiting a resp onse in this case from ob ject C

executing thread-of-control within the MT-domain to

in Y , it pro ceeds to activate a thread to handle invo-

1

susp end itself, and then dispatches the descendant op-

cation I . If the thread executing I is not susp ended

3 1

eration.

b efore I is allowed to start executing, the states of

3

the ob jects within X may b ecome inconsistent due to

1

4.3 Scheduling

the multiple threads b eing active. The logical thread-

of-control within X is still asso ciated with the rst

1

To dispatch an op eration to the thread-of-control, or

invo cation I b ecause all other op erations within the

1

to delay an op eration that may lead to inconsistency,

concurrency domain are direct descendants of I . Once

1

each op eration scheduler for a MT-domain replica

the invo cation I completes and B returns a resp onse

maintains: 3

to the invoking ob ject C , the scheduler for X disp oses

1

 The scheduling identi er s , and semantics syn-

of the thread for I .

D

3

chronous or asynchronous, of the current op era-

A reentrant descendantinvo cation on a MT-domain

tion I b eing executed by the logical thread-of-

D

can generate further reentrant descendantinvo cations

control T within the MT-domain. The schedul-

D

that are also reentrant on the same MT-domain. Thus,

ing identi er s is used by the op eration scheduler

D

the MT-domain scheduler must maintain a stackof

to detect reentrantinvo cations. The scheduler

invo cations dispatched in the MT-domain. Every

compares the scheduling identifer of every incom-

reentrant descendantinvo cation is pushed onto the

ing op eration with s to determine any descen-

D

stack when it is dispatched onto the MT-domain, and

dants of I . If an incoming message is a descen-

D

p opp ed o the stack when it completes.

dantofI , or a resp onse to an invo cation issued

D

by the MT-domain, the op eration scheduler dis-

4.2 Scheduling Identi ers

patches it when all of the threads within the MT-

To handle nested reentrant op erations, the op eration

domain are susp ended, or awaiting a resp onse.

scheduler uses scheduling identi ers that allow the

 A dispatch queue Q of op erations invo cations,

scheduler to asso ciate parent and descendant op era-

op

resp onses and state transfer messages waiting to

tions at the MT-domain level. These scheduling iden-

b e assigned to the thread of control when it b e-

ti ers are internal to, and examined by, Eternal's op-

comes available. When the current op eration I

eration schedulers, and are never seen by the CORBA

D

completes, the op eration scheduler can dispatcha

application or the ORB.

new op eration from Q . Op erations that are not

op At the p oint that it dispatches an invo cation I onto

q

related to the thread-of-control in the MT-domain

the replica of the MT-domain that it controls, the op-

are enqueued in the total order in which the op-

eration scheduler assigns I the scheduling identi er

q

eration scheduler receives them. Op erations that

s s , where s is the sequence numb er of the message

q p q

are descendants of I are scheduled for execution,

D containing I , and s is the scheduling identi er of the

q p

in the order of their arrival, ahead of all op erations

parent, if any,of I .

q

that are not descendants of I .

D

For the example of Figure 4, the MT-domain sched-

ulers assign the identi ers s , s s and s s s to the

1 2 1 3 2 1

 A stack of the reentrant descendant op erations

invo cations I , I and I , resp ectively. In this case, s

1 2 3 1

that have already b een dispatched onto threads

and s s  are is uniquely assigned by the scheduler

2 3

within the MT-domain, and are awaiting re-

at every replica of the MT-domain X Y . Further-

1 1

sp onses. When the thread-of-control b ecomes

more, the same unique identi ers are generated within

available, the stack is empty. The rst op eration

every replica of the MT-domain X Y  b ecause these

1 1

I to b e pushed onto the stack is the one that

D

identi ers are derived, in an identical manner, from the

assumes the thread-of-control T . Subsequently,

D

totally ordered messages that the op eration scheduler

descendants of I that are reentrant on the MT-

D

at each replica receives.

domain are also pushed onto the stack. The in-

To detect a reentrant incoming invo cation, the op-

vo cation on top of the stack is removed from the

eration scheduler uses the scheduling identi er to de-

stack as so on as it completes, and an invo cation is

termine if the invo cation is a descendantofany op era-

pushed onto the stack as so on as it is dispatched

tion that has b een invoked, and is awaiting a resp onse to a thread within the MT-domain.

 A thread p o ol Q of pre-spawned threads to

thr

avoid the overhead of thread creation with every

switch Reason for Activation

new dispatch of an op eration to a thread. This

thread p o ol is used purely for eciency, rather

== The thread-of-control for the MT-domain is

== available to be assigned to a new operation.

than for correctness.

case THREAD OF CONTROL RELEASED:

The op eration scheduler at every replica of a MT-

if op eration queue Q is not empty

op

domain executes the deterministic algorithm, shown in

I = op eration at the head of Q

op

D

Figure 5, to schedule op erations within the replica that s =scheduling identi er for I

D D

if thread p o ol Q is empty

thr

it controls. The execution of this algorithm is triggered

Create new threads into Q

thr

by the o ccurrence of any of the following events:

endif

T = rst available thread in Q

D thr

 The release of the MT-domain's thread-of-control

Dispatch op eration I onto the thread T

D D

Push I onto stack of reentrant descendantinvo cations

when the current op eration completes, allowing

D

endif

the next op eration to b e dispatched

return

endcase

 The susp ension of all threads within the MT-

domain, in anticipation of receiving a resp onse,

== A new operation intended for the MT-domain is

allowing the delivery of a received resp onse, or a

== delivered in the total ly ordered messages.

new reentrant descendantinvo cation

case INCOMING INVOCATION OR RESPONSE:

 The delivery of a totally ordered invo cation or re-

Insert incoming message at the end of Q

op

sp onse message to the op eration scheduler, requir-

return

endcase

ing the scheduler to decide if the message should

b e enqueued, scheduled or dispatched.

== The thread-of-control for the MT-domain is

== suspended on the operation I . In addition to T ,

D D

The scheduling algorithm, as well as the enforce-

== numDesc threads could be suspended due to incomplete

== reentrant descendant operations of I .Al l of these

ment of the thread-of-control, are indep endentof

D

== threads are awaiting responses.

the sp eci c concurrency mo del thread-p er-request,

thread-p er-ob ject, etc adopted by the ORB that hosts

OF CONTROL SUSPENDED: case THREAD

the MT-domain.

if dispatch queue Q is not empty

op

if dispatch queue Q has descendants of I 

op

D

I = reentrant descendant at the top of the stack

Top

Increment numDesc

5 Implementation in Eternal

I = rst enqueued descendantof I in Q

op

numD esc Top

if I completes op eration I 

As shown in Figure 6, the Eternal system transpar-

numD esc Top

Remove I from the stack of op erations

Top

ently replicates the ob jects, and the MT-domains, of

else

the application. For every replica of a MT-domain

Push I onto the stack of op erations

numD esc

or an ob ject, the Interceptor transparently captures

endif

i =scheduling identi er for I

its I IOP invo cation and resp onse messages, which

numD esc numD esc

if thread p o ol Q is empty

thr

were originally destined for TCP/IP, and diverts them

Create new threads into Q

thr

instead to the Replication Manager. The Replica-

endif

tion Manager p erforms the encapsulation retrieval

T = rst available thread in Q

numD esc thr

Dispatch op eration I onto thread T

of I IOP messages to from the messages of the under- numD esc numD esc

endif

lying reliable totally ordered multicast group commu-

endif

nication system. In addition, the Replication Manager

return

implements mechanisms for the detection and suppres-

endcase

sion of duplicate invo cations and duplicate resp onses,

endswitch

and for state transfer and recovery.

The Eternal system provides an operation sched-

uler within the address space of each replica of a MT-

Figure 5: Algorithm executed by the op eration

domain. Although each replica has its own scheduler,

scheduler each time it is activated. The op eration

all of the schedulers for the replicas of a MT-domain scheduler is asso ciated with a MT-domain D , whose

logical thread-of-control T executes the op eration

D

reach identical scheduling decisions. This determinis-

I with scheduling identi er s .

D D

tic b ehavior of the schedulers ensures replica consis-

tency for every MT-domain within the application. Operation ICD on the Replica of thread-of-control T MT-Domain CD

Operation ICD and its reentrant descendants

ORB

A T2 i Interceptor Thread Operation Q Replication T Bi op Pool 3 Scheduler Manager T Ci Dispatch 4 Queue

Stack of Reentrant Reliable Descendant Totally Ordered Invocations Multicast Thread-Level Socket-Level Interposers Interposers Platform

Multicast Messages Thread creation by the Outgoing messages Totally ordered messages ORB or the MT-domain from the MT-domain for the MT-domain from

the Replication Manager

Figure 6: Implementation of the MT-domain op eration scheduler in the Eternal system, using the Interceptor

which is transparently co-lo cated with the replica of the MT-domain.

 So cket-level interp osers \replace" the so cket The op eration scheduler must op erate at a level that

library routines that CORBA ob jects use to con- allows it to govern the concurrency within each ob ject

nect and communicate over TCP/IP.To ensure of the MT-domain, irresp ective of the ORB's multi-

that the inter-MT-domain communication o ccurs threading p olicies. The op eration scheduler must re-

over a reliable totally ordered multicast proto- ceive all of the totally ordered op erations, and decide

col instead of over TCP/IP, the communicating on their delivery to the application. In the Eternal

MT-domains must establish a path of communi- system, the op eration scheduler is implemented at the

cation through the Replication Manager. The In- level of the Interceptor, as shown in Figure 6.

terceptor's so cket-level interp oser serves to redi- The Interceptor of the Eternal system exploits

rect all TCP/IP communication to the Replica- \ho oks" [8] provided by standard op erating systems

tion Manager. The Replication Manager then to allow for the runtime mo di cation of pro cess b ehav-

conveys them over the underlying multicast proto- ior. A library interp ositioning-based implementation

col. By exploiting the so cket-level interp oser, the

allows the Interceptor to b e mapp ed transparently into

op eration scheduler can receive, transparently, all the address space of a MT-domain at runtime, which

of the op erations destined for the replica of the allows the Interceptor easy access to the so ckets and

MT-domain. threads that the MT-domain creates.

The transparency of the Interceptor has the added

 Thread-level interp osers \replace" the thread

advantage that the op eration scheduler can p erform its

library routines that multithreaded ORBs and

without mo di cation of the ORB or the appli-

CORBA ob jects use to create and control threads.

cation. The interception mechanisms also enable the

The thread library interp oser do es not need to re-

scheduler to overide any dispatching or threading p er-

place all of the symb ol de nitions, for instance,

formed by the ORB or the application, without either

within the Solaris thread library, libthread.so,

of them b eing aware of the scheduler's existence. The

or the POSIX thread library, libpthread.so [6]

op eration scheduler exploits a number of interp osers

but to interp ose only on the symb ols of interest.

within the Interceptor, with eachinterp oser overriding

The op eration scheduler employs a thread-level

some sp eci c b ehavior of the ORB or the application.

due to time-triggered event activation and preemptive interp oser to provide alternative implementations

scheduling. Replica determinism is enforced using a of some of the thread library routines in order

combination of timed messages and a communication to enable the MT-domain's op eraton scheduler to

proto col for agreement on external events. determine the status of the MT-domain's thread-

of-control, as well as to control the creation, dis-

In the SCEPTRE 2 real-time system [1], nondeter-

patch and destruction of threads spawned within

ministic b ehavior of the replicas also arises from pre-

the MT-domain.

emptivescheduling. The develop ers of SCEPTRE 2

acknowledge the limitations of b oth active and pas-

Of course, not all of the threads that the MT-

sive replication of nondeterministic \capsules" for the

domain or the ORB creates need to b e controlled.

purp oses of ensuring replica consistency. Semi-active

The listening thread that an MT-domain server

replication is used, with deterministic b ehavior en-

rst spawns must not b e prevented from running

forced through the transmission of messages from a

b ecause, otherwise, the MT-domain would not b e

co ordination entity to the multiple backup replicas for

able to function in its role as a server. However,

every nondeterministic decision taken by a designated

the additional threads that are dispatched to han-

primary replica. The messages force the backup repli-

dle clientinvo cations must b e controlled b ecause

cas to override their own decisions.

they may mo dify the state of the MT-domain.

Considerable research e orts have b een directed to-

Thus, the op eration scheduler must examine every

wards building systems that augment CORBA with

I IOP message, that it receives through the totally or-

fault tolerance. While most of these systems exploit

dered messages from the Replication Manager, to de-

group communication to ensure replica consistency,to

termine if it contains a metho d invo cation or resp onse.

the b est of our knowledge, none of them addresses the

The combination of the so cket-level and the thread-

supp ort that is required to ensure consistent ob ject

level interp osers ensures that the dispatch of threads

replication in the presence of nondeterminism suchas

that execute op erations within a MT-domain is dic-

multithreading in the ORB or the application.

tated solely by the op eration scheduler, rather than

The AQuA architecture [3] is a CORBA-based

by the MT-domain or by the ORB.

framework that provides fault tolerance to CORBA

Op eration scheduling do es not signi cantly increase

applications through ob ject replication. The AQuA

the overheads in our implementation. All of the

architecture exploits the group communication facil-

mechanisms of Eternal, including not only op eration

ities and the ordering guarantees of the underlying

scheduling but also interception, replication, duplicate

Ensemble and Maestro to olkits [15] to ensure replica

detection, message ordering and multicasting, increase

consistency for the application. The AQuA gateway

the resp onse time by ab out 10 over an unreplicated

translates CORBA ob ject invo cations into messages

multithreaded application.

that are transmitted via Ensemble, and detects and

lters duplicate invo cations resp onses.

6 Related Work

The Ob ject Group Service OGS [4] and GARF [5]

provide fault tolerance, the former for CORBA appli-

Considerable research e orts have b een exp ended in

cations. Replica consistency is enforced to some extent

designing and implementing practical systems that

through the use of a consensus algorithm, which op er-

employ strategies to enforce replica determinism, or

ates on the TCP/IP-based I IOP messages, to provide

to circumvent sp eci c sources of nondeterminism.

ordered message delivery to the application. However,

The use of replication for fault tolerance requires

the consensus algorithm controls the concurrency only

replica determinism to ensure that no undesirable or

within the proto col stack and the underlying layers;

unforeseen side-e ects cause the states of the repli-

concurrency within the ORB and the application are

cas of an ob ject to b ecome inconsistent. The Delta-4

not controlled.

pro ject [12] employs a primary-backup, or passive,

replication approachtoovercome the problems asso ci-

Some of the issues surrounding replica consistency

ated with nondeterministic replicas. The restrictions

and multithreading have b een addressed for fault-

required are rather severe.

tolerant systems that are not based on CORBA. In

[14], a technique is employed to track and record the For systems that must meet real-time requirements

nondeterminism due to asynchronous events and mul- in addition to fault tolerance, the replicated data must

tithreading. While the nondeterminism is not elim- b e b oth consistent and timely. The fault-tolerant

inated, the nondeterministic executions are recorded real-time MARS system [11] requires deterministic b e-

so that they can b e replayed to restore replica consis- havior in highly resp onsive applications, such as au-

tency in the event of rollback.

tomotive applications, which exhibit nondeterminism

The TransparentFault Tolerance TFT system [2] [3] M. Cukier, J. Ren, C. Sabnis, W. H. Sanders, D. E.

Bakken, M. E. Berman, D. A. Karr, and R. Schantz.

enforces deterministic computation on replicas at the

AQuA: An adaptive architecture that provides de-

level of the op erating system interface. TFT sanitizes

p endable distributed ob jects. In Proceedings of the

nondeterministic system calls byinterp osing a soft-

IEEE 17th Symposium on Reliable Distributed Sys-

ware layer b etween the application and the op erating

tems, pages 245{253, West Lafayette, IN, Octob er

system. The ob ject co de of the application binaries is

1998.

edited to insert co de that redirects all nondeterminis-

[4] P.Felb er, R. Guerraoui, and A. Schip er. The imple-

tic system calls to a layer that returns identical results

mentation of a CORBA ob ject group service. Theory

at all replicas of an ob ject.

and Practice of Object Systems, 42:93{105, 1998.

[5] R. Guerraoui, P.Felb er, B. Garbinato, and K. Ma-

7 Conclusion

zouni. System supp ort for ob ject groups. SIGPLAN

The Eternal system provides consistent ob ject repli-

Notices, ACM SIGPLAN Conference on Object-

cation for multithreaded CORBA applications and

OrientedProgramming, Systems, Languages and Ap-

plications, 3310:244{258, Octob er 1998.

ORBs. The basic unit of replication is an ob ject

or a MT-domain which is a CORBA client or server

[6] S. Kleiman, D. Shah, and B. Smaalders. Programming

that can supp ort concurrent ORB- or application-level

with Threads. Prentice Hall, Mountain View, 1996.

threads. An Interceptor, implemented using thread-

[7] L. E. Moser, P. M. Melliar-Smith , and P. Narasimhan.

and so cket-level interp osers, allows the Eternal system

Consistent ob ject replication in the Eternal system.

to schedule and dispatch threads and op erations onto

Theory and Practice of Object Systems, 42:81{92,

replicas of MT-domains transparently, therebyover-

1998.

riding the scheduling of op erations by the ORB.

[8] P. Narasimhan, L. E. Moser, and P. M. Melliar-Smith.

Each replica of a MT-domain is equipp ed with an

Using interceptors to enhance CORBA. IEEE Com-

op eration scheduler, which enforces a single logical

puter, pages 62{68, July 1999.

thread-of-control within the replica. The op eration

[9] Ob ject Management Group. The Common Ob-

scheduler dispatches op erations onto the thread-of-

ject Request Broker: Architecture and sp eci cation,

control, and enqueues op erations not related to the

2.2 edition. OMG Technical Committee Do cument

thread-of-control for later dispatch. Replica consis-

formal/98-07-01, February 1998.

tency is maintained as a result of the deterministic

[10] Ob ject Management Group. Fault tolerant CORBA

b ehavior across the replicas of a MT-domain through

using entity redundancy: Request for prop osals.

the totally ordered messages that they receive, as well

OMG Technical Committee Do cument orb os/98-04-

as the deterministic b ehavior within each replica of the

01, April 1998.

MT-domain through the identical scheduling of op er-

[11] S. Poledna. Replica Determinism in Fault-Tolerant

ations onto a single thread-of-control.

Real-Time Systems. PhD thesis, Technical University

The Eternal system maintains replica consistency,

of Vienna, Vienna, Austria, April 1994.

irresp ective of the multithreading mo del thread-p er-

[12] D. Powell. Delta-4: A Generic Architecture for

request, thread-p er-ob ject, etc adopted by the ORB.

Dependable Distributed Computing. Springer-Verlag,

The transparency of the op eration scheduling and

1991.

the replica consistency mechanisms enables Eternal to

[13] D. C. Schmidt. Evaluating architectures for multi-

provide fault tolerance to unmo di ed multithreaded

threaded Ob ject Request Brokers. Communications

CORBA applications, without requiring the mo di ca-

of the ACM, 4110:54{60, Octob er 1998.

tion of either the ORB or its concurrency mo del.

[14] J. H. Slye and E. N. Elnozahy. Supp orting nondeter-

ministic execution in fault-tolerant systems. In Pro-

References

ceedings of the IEEE 26th International Symposium

[1] S. Bestaoui. One solution for the nondeterminism on Fault-Tolerant Computing, pages 250{259, Sendai,

problem in the SCEPTRE 2 fault tolerance tech-

Japan, June 1996.

nique. In Proceedings of the Euromicro 7th Workshop

[15] A. Vaysburd and K. Birman. The Maestro approach

on Real-Time Systems, pages 352{358, Odense, Den-

to building reliable interop erable distributed applica-

mark, June 1995.

tions with multiple execution styles. Theory and Prac-

tice of Object Systems, 42:73{80, 1998.

[2] T. C. Bressoud. TFT: A software system for

applicatio n-transp arent fault tolerance. In Proceed-

ings of the IEEE 28th International Conferenceon

Fault-Tolerant Computing, pages 128{137, Munich,

Germany, June 1998.