Lo ck-Free Transactions for Real-Time Systems

James H. Anderson, Srikanth Ramamurthy, Mark Moir, and Kevin Je ay

Department of Computer Science, University of North Carolina at Chap el Hill

tion phase or write phase) must b e executed as crit- Abstract

ical sections [7,8]. Priorityinversion is often dealt

with through the use of priority inheritance proto cols

We show that previous algorithmic and

or priority ceiling proto cols [14]. These proto cols dy-

work concerning the use of lo ck-free ob jects in hard

namically adjust transaction priorities to ensure that a

real-time systems can b e extended to supp ort real-time

transaction within a critical section executes at a pri-

transactions on memory-resident data. Using our ap-

ority that is suciently high to b ound the duration of

proach, transactions are not susceptible to priorityin-

any priorityinversion. This functionality comes at the

version or deadlo ck, do not require complicated mecha-

exp ense of additional overhead asso ciated with main-

nisms for data-logging or for rolling back ab orted trans-

taining information ab out transactions and the data

actions, and are implemented as library routines that

they access.

require no sp ecial kernel supp ort.

In addition, in many database systems, ma jor func-

tional comp onents are implemented as separate mo d-

1 Intro duction

ules (e.g., transaction manager, lo ck manager, etc.),

eachofwhich consists of one or more pro cesses. Trans-

In most real-time database systems, conventional

actions interact with these mo dules byinvoking sp ecial

mechanisms suchaslocks, timestamps, and serializa-

primitives (e.g., Begin Transaction , End Transaction ,

tion graphs are used for concurrency control. The main

Read , Write , Commit , Abort ). Although structuring a

problem when using any of these mechanisms is that

system in this way is attractive from a software engi-

of handling con icting op erations. If an op eration of a

neering standp oint, such an arrangement can result in

transaction creates a con ict, then one of two strategies

signi cantinterpro cess communicationoverhead.

may b e used: either that transaction may b e blo cked,

To summarize, although conventional concurrency

or one or more of the transactions involved in the con-

control schemes provide a general framework for real-

ict may b e ab orted. When using con ict resolution

time transactions, actual implementations of these

schemes that employblocking, deadlo ckisakey issue,

schemes can entail high overhead. In this pap er, we

and mechanisms for deadlo ckavoidance or resolution

prop ose a new approach for implementing real-time

are required. In contrast, deadlo ck is not a problem

transactions on memory-resident data on unipro ces-

when using schemes that always resolve con icts by

sors. In our approach, transactions are implemented

ab orting transactions. However, with suchschemes,

using a collection of library routines that hide from

overhead asso ciated with undoing the e ects of par-

transaction programmers all details asso ciated with

tially completed transactions and then restarting them

managing concurrency.Transactions implemented us-

is a key issue.

ing these routines are not susceptible to priority in-

Most existing con ict resolution schemes for real-

version or deadlo ck, and do not require complicated

time database systems are susceptible to priorityin-

mechanisms for data-logging or for rolling back ab orted

version; a priority inversion o ccurs when a given trans-

transactions. In addition, they do not require sp ecial

action is blo cked by another transaction of lower pri-

kernel supp ort or additional pro cesses for concurrency

ority. Priorityinversion is a problem even for systems

control; they therefore avoid interpro cess communica-

that employ optimistic techniques, b ecause in such sys-

tion overhead.

tems, certain phases of a transaction (e.g., the valida-

Our approach to implementing real-time transac-



tions is based up on previous researchinvolving the use

The rst three authors were supp orted, in part, byNSF

contract CCR 9216421, and bya Young Investigator Award from

of lo ck-free ob jects in real-time systems. An ob ject

the U.S. Army Research Oce, grantnumber DAAH04-95-1-

implementation is -free i it guarantees that, af-

0323. The third author was also supp orted by a UNC Alumni

ter a nite numb er of steps of any op eration, some

Fellowship. The fourth author was supp orted bygrants from

op eration on the ob ject completes. Lo ck-free ob jects

Intel and IBM. 1

typ e obj = record data : valtype; next : obj end

at once. Suchanimplementation di ers from conven-

shared variable queue tail : obj

tional optimistic concurrency control proto cols in two

pro cedure Enqueue(input : valtype)

resp ects. First, our implementations do not use lo ck-

lo cal variable old tail, new tail : obj

ing. Therefore, they do not require sp ecial supp ort for

tail := (input, NULL); new

rep eat old tail := queue tail dealing with priorityinversion. Second, our implemen-

until CAS2(queue tail, old tail > next,

tations allow transactions to b e ab orted at arbitrary

old tail, NULL,

points during their execution without the aid of recov-

tail, new tail) new

ery pro cedures. An imp ortant implication of this p oint

is that data-logging is greatly simpli ed.

Figure 1: Lo ck-free enqueue op eration.

The rest of this pap er is organized as follows. In

Section 2, we review previous work involving the use of

are usually implemented using \retry lo ops".Figure1

lo ck-free ob jects in real-time systems. Then, in Section

depicts an example of a lo ck-free op eration | an en-

3, we outline our approach for implementing lo ck-free

queue taken from a shared queue implementation given

transactions. We end the pap er with concluding re-

in [11]. In this example, an enqueue is p erformed by

marks in Section 4.

trying to an item onto the tail of the queue us-

1

ingatwo-word compare-and-swap (CAS2) instruction.

This threading is attempted rep eatedly until it suc-

ceeds. Note that the queue is never actually \lo cked".

2 Lo ck-Free Ob jects

Using lo ck-free ob ject implementations, ob jects are

We b egin this section by reviewing previous work on

optimistically accessed without lo cking, and an ac-

scheduling hard real-time tasks that share lo ck-free ob-

cess is retried if an interference o ccurs. Op erations

jects. We then consider the issue of hardware supp ort

are atomically validated and committed byinvoking a

for lo ck-free synchronization.

strong synchronization primitive(like CAS2). In Fig-

ure 1, this validation step can fail for a task  only if

a higher-priority task p erforms a successful CAS2 be-

2.1 Scheduling with Lo ck-Free Ob jects

tween the read of queue tail by  and the subsequent

CAS2 op eration by  .

Although it should b e clear that lo ck-free ob jects do

From a real-time p ersp ective, lo ck-free ob ject imple-

not give rise to priorityinversions, it may seem that un-

mentations are of interest b ecause they avoid prior-

b ounded retry lo ops render such ob jects useless in real-

ityinversions with no kernel supp ort. On the surface,

time systems. Nonetheless, Anderson, Ramamurthy,

however, it is not immediately apparentthatlock-free

and Je ayhaveshown that if tasks on a unipro ces-

shared ob jects can b e employed if tasks must adhere

sor are scheduled appropriately,thensuch lo ops are

to strict timing constraints. In particular, rep eated

indeed b ounded [4]. Wenow explain intuitively why

interferences can cause a given op eration to takean

such b ounds exist. For the sake of explanation, let us

arbitrarily long time to complete. Fortunately,such

call an iteration of a retry lo op a successful update if it

interferences can b e b ounded byscheduling tasks ap-

successfully completes, and a failedupdate otherwise.

propriately [4]. As explained in the next section, the

Thus, a single invo cation of a lo ck-free op eration con-

key to scheduling such tasks is to allow enough spare

sists of anynumb er of failed up dates followed by one

pro cessor time to accommo date the failed ob ject up-

successful up date.

dates that can o ccur over anyinterval. The number

Consider two tasks  and  that access a common

i j

of failed up dates in suchaninterval is b ounded bythe

lo ck-free ob ject B . Supp ose that  causes  to exp eri-

i j

numb er of task preemptions within that interval.

ence a failed up date of B . On an unipro cessor, this can

In this pap er, we show that previous work on lo ck-

only happ en if  preempts the access of  and then

i j

free ob jects can b e extended to apply to lo ck-free trans-

up dates B successfully.However,  preempts  only

i j

actions on memory-resident databases. Our approach

if  has higher prioritythan  .Thus, at each prior-

i j

to implementing such transactions is to rst imple-

ity level, there is a correlation b etween failed up dates

menta multi-word CAS primitive(MWCAS). This primi-

and task preemptions. The maximum number of task

tive can then b e used as the basis for a lo ck-free retry

preemptions within a time interval can b e determined

lo op in which op erations on many ob jects are validated

from the timing requirements of the tasks. Using this

information, it is p ossible to determine a b ound on the

1

The rst two parameters of CAS2 are shared variables, the

numb er of failed up dates in that interval. A set of tasks

next two parameters are values to which the shared variables

that share lo ck-free ob jects is schedulable if there is

are compared, and the last two parametersarenewvalues to b e

enough free pro cessor time to accommo date the failed placed in the shared variables should b oth comparisons succeed. 2

 

P

c +s

N

j

up dates that can o ccur over anyinterval.

 1. 2

j =1

p

j

In [4], scheduling conditions are established for

Informally, this condition states that a task set is dead line-monotonic (DM) [9]andearliest-dead line- rst

(EDF) [10] priority assignments. (The rate-monotonic

schedulable if pro cessor utilization is at most 1. As

in the case of DM scheduling, this condition extends (RM) [10] priorityscheme is also considered, but the

the corresp onding condition for indep endent tasks [10] corresp onding scheduling condition is a sp ecial case of

by requiring that the computation time of eachtask that presented for the DM case.) In order to state these

b e dilated by the time it takes for one lo ck-free lo op conditions, wemust rst de ne some notation. Each

iteration. condition applies to a collection of N p erio dic tasks

f ;:::; g. The p erio d of task  is denoted by p ,

1 N i i

The results presented ab ove suggest a general strat-

and the relativedeadlineoftask is denoted by l ,

i i

egy for determining the schedulability of tasks that

where l  p ; under the EDF scheme, we assume that

i i

share lo ck-free ob jects. First, determine a b ound on

l = p .Tasks are assumed to b e sorted in nondecreas-

i i

demand due to failed up dates over anyinterval of time.

ing order by their deadlines, i.e., l

i j

Then, mo dify existing scheduling conditions for inde-

c denote the worst-case computational cost (execution

i

p endent tasks by incorp orating this demand. Schedul-

time) of task  when it is the only task executing on the

i

ing conditions derived in this manner are applicable not

pro cessor, i.e., when there is no contention for the pro-

only for tasks that p erform single-ob ject up dates, but

cessor or for shared ob jects. Let s denote the execution

also for tasks that p erform multi-ob ject transactions.

time required for one lo op iteration in the implementa-

The b ounds on failed up dates given in the theorems

tion of a lo ck-free ob ject. For simplicity, all suchloops

ab ovearenotvery tight | these b ounds are based on

are assumed to have the same cost. Note that s is also

the conservative assumption that each task interferes

the extra computation required in the event of a failed

with all lower-priority tasks that execute concurrently.

ob ject up date. Given this notation, the DM condition

Scheduling conditions that incorp orate tighter b ounds

can b e stated as follows.

will b e given in a forthcoming pap er. In these new con-

ditions, tighter b ounds are obtained by more accurately

Theorem 1: (Suciency under DM) A set of periodic

accounting for interferences that can o ccur b etween

tasks that sharelock-free objectsonauniprocessor can

tasks. Also, the new conditions allow di erentlock-free

be scheduled under the DM scheme if, for every task  ,

i

accesses to have di erent lo op costs. When implement-

there exists some t 2 (0;l ] such that

i

ing transactions, signi cantvariations in lo op costs are

 l m m  l

P P

i i1

likely.

t t1

 c +  s  t: 2

j

j =1 j =1

p p

j j

Informally, this condition states that a task set is

2.2 Hardware Supp ort for Lo ck-Free

schedulable if, for each job of every task  , there ex-

i

Synchronization

ists a p ointintimet between the release of that job

and its deadline, such that the demand placed on the

A p ossible criticism of the lo ck-free algorithm given in

pro cessor in the interval b etween the job's release and

Figure 1 is that it requires a strong synchronization

time t is at most the available pro cessor time in that

primitive, namely CAS2. The fact that manylock-free

interval. Demand in this interval can b e broken into

ob ject implementations are based on strong synchro-

two comp onents: demand due to job releases, ignoring

nization primitives is no accident. Herlihy has shown

ob ject invo cations (this is given by the rst summa-

that such strong primitives are, in general, necessary

tion); and demand due to failed ob ject up dates, which

for these implementations [5]. Herlihy's results are

is b ounded bythenumb er of preemptions by higher-

based up on a categorization of ob jects by \consensus

priority tasks in the interval (this is given by the second

number".Anobjecthasconsensus number N if it can

summation). In comparing the ab ove condition to the

b e used to solve N -pro cess consensus, but not (N + 1)-

2

DM condition for indep endent tasks given in [1], wesee

pro cess consensus, in a wait-free manner. Herlihy's

that our condition essentially requires that the compu-

results show that primitives with unb ounded consen-

tation time of each task b e \dilated" by the time it

sus numb ers are necessary in general-purp ose lo ck-free

takes for one lo ck-free lo op iteration.

(or wait-free) ob ject implementations. Such ob jects

are called universal b ecause they can b e used to im-

Wenow turn our attention to the EDF scheme.

plementany other ob ject. Herlihy's consensus-number

Theorem 2: (Suciency under EDF) A set of periodic

2

Wait-freedom is a strong form of lo ck-freedom that precludes

tasks that sharelock-free objectsonauniprocessor can

al l waiting dep endencies among tasks | including p otentially

unbounded op eration retries. be scheduled under the EDF scheme if 3

shared var Fin , Prp : valtype [?

Consensus Ob ject

private var v : valtype

Number

initially Fin = ?^Prp = ?

1 read/write registers

pro cedure decide (val : valtype) returns valtype

2 test&set, swap, fetch&add, queue

1: if Fin = ? then

. .

. .

2: if Prp = ? then

. .

3: Prp := val

2n 2 n-register assignment

;

. .

. .

4: if Fin = ? then

. .

5: v := Prp ;

memory-to-memory moveandswap,

6: Fin := v

1 compare&swap,

load-linked /store-con di tion al

;

7: return Fin

Figure 4: Consensus using reads and writes.

Figure 2: Herlihy's consensus-numb er hierarchy.

mo del like that depicted in Figure 3(b). Twokey as-

sumptions underlie this mo del: (i) a task  may pre-

i

empt another task  only if  has higher priority than i

(a) j

 ; (ii) a task's priority can change over time, but not

j

during any ob ject access. Assumption (i) is common to

all priority-driven scheduling p olicies. Assumption (ii)

holds for most common p olicies, including RM, DM,

and EDF scheduling. The only common scheduling

eknow of that violates assumption (ii) is

(b) p olicy that w

least-laxity- rst (LLF) scheduling [12].

Figure 3: Line segments denote op erations on shared ob-

The collapse of Herlihy's hierarchy for hard real-time

jects with time running from left to right. Eachlevel cor-

unipro cessor systems is established in [13]by giving a

resp onds to op erations by a di erent task. (a) Interleaved

wait-free (and hence lo ck-free) algorithm that solves

op erations in a asynchronous system. Op erations mayover-

the consensus problem in such systems using only reads

lap arbitrarily.(b)Interleaved op erations in a unipro cessor

and writes. This algorithm is shown in Figure 4. In

real-time system. Two op erations overlap only if one is

this algorithm, eachtaskcho oses a decision value by

contained within the other.

invoking the pro cedure decide .

Pro cedure decide uses two shared variables, Prp

hierarchyisshown in Figure 2.

(\prop ose") and Fin (\ nal"). Each task that do es not

Although Herlihy's hierarchy is of fundamental

detect the input of another task in Prp or Fin writes its

imp ortance for truly asynchronous systems, Rama-

own value into Prp .Having ensured that some value

murthy, Moir, and Anderson recently showed that, for

has b een prop osed (lines 1 to 3), a task copies the pro-

unipro cessor-based real-time systems, Herlihy's hierar-

p osed value to Fin , if necessary (lines 4 to 6). It is

chy collapses, i.e., reads and writes can b e used to im-

easy to see that no task returns b efore some task's in-

plementanylock-free ob ject [13]. The basis for this

put value is written into Fin, and that all tasks return

result is the realization that certain task interleavings

avalue read from Fin . It can b e further shown that

cannot o ccur within real-time systems. In particular,

the rst value written into Prp is the only value writ-

0

if a task  accesses an ob ject in the time interval [t; t ],

i

ten into Fin . The correctness of the algorithm easily

and if another task  accesses that ob ject in the inter-

j

follows from this fact, yielding the following theorem.

0 0 0

val [u; u ], then it is not p ossible to have t< u< t

(It is easy to show that the algorithm of Figure 4 do es

b ecause the higher-prioritytaskmust nish its op era-

not correctly solve the consensus problem in a truly

tion b efore relinquishing the pro cessor. Requiring an

asynchronous system consisting of twoormorepro-

ob ject implementation to correctly deal with this in-

cesses. Thus, it do es not contradict the fact the reads

terleaving is therefore p ointless, b ecause it cannot arise

and writes have consensus number1forsuch systems.)

in practice. The distinction b etween traditional asyn-

chronous systems, to which Herlihy's work is directed,

and hard real-time systems is illustrated in Figure 3.

Theorem 3: Consensus can be implemented with con-

The results of [13] are based up on an execution stant time and spaceusingreads and writes on a hard 4

real-time uniprocessor system. 2 action successfully completes, and the corresp onding

mo di cations to data take e ect, only if the MWCAS op-

Given an implementation of consensus ob jects, any

eration succeeds. Thus, transactions can b e ab orted

shared ob ject can b e implemented in a lo ck-free man-

arbitrarily without executing exp ensive recovery pro-

ner [5]. However, such implementations usually en-

cedures. This can b e advantageous in situations that

tail high overhead. More practical lo ck-free imple-

require transactions with rm deadlines to b e ab orted

mentations are based on universal primitives suchas

due to transientoverload conditions. In such situa-

CAS and load-linked/store-conditional (LL/SC)[6]. To

tions, transactions can b e arbitrarily terminated by the

enable such implementations to b e used in real-time

system without fear of compromising data consistency.

unipro cessor systems, Ramamurthy, Moir, and Ander-

One p ossible criticism of our implementationisthat

son presenttwo implementations of an ob ject that sup-

the MWCAS primitive is not supp orted by the hardware

p orts Read and CAS.(LL/SC can b e implemented using

in most systems. However, Anderson and Moir have

Read and CAS in constant time [2].) These implementa-

shown that MWCAS can b e implemented using the single-

tions, which are summarized in the following theorems,

word CAS primitive in general asynchronous systems

use read/write and memory-to-memory Move instruc-

[3]. Furthermore, b ecause real-time systems only al-

tions, resp ectively. Although Move is rare in multipro-

low a subset of the transaction interleavings in asyn-

cessors, it is widely available on unipro cessors. For

chronous systems, these constructions can b e simpli ed

example, unipro cessor systems based on Intel's 80x86

for real-time systems using techniques similar to those

pro cessor and the Pentium line of pro cessors supp ort

presented in Section 2.2.

the move instruction. (In these theorems, N denotes

In our lo ck-free construction, all data is stored in

the numb er of tasks that share an ob ject.)

a single array.However, we do not require the array

to b e stored in contiguous memory lo cations. Instead,

Theorem 4: Read and CAS can be implementedina

weprovide the \illusion" of a contiguous array. Before

wait-free manner on a hardreal-time uniprocessor sys-

describing the co de in Figure 5, we explain how this

tem with O(N ) time and spacecomplexity using reads

is achieved, and how the details of the implementation

and writes. 2

are hidden from the programmer.

Theorem 5: Read and CAS can be implementedin

The implemented array MEM is partitioned into B

a wait-free manner on a hardreal-time uniprocessor

blo cks of size S . (Figure 6 depicts this arrangement

system with constant time and O(N ) spacecomplexity

for B = 5.) The rst blo ckcontains array lo cations 0

using Move. 2

through S 1, the second blo ckcontains lo cations S

through 2S 1, and so on. (We assume that blo cks have

the same size only b ecause it simpli es the presentation

3 Lo ck-Free Transactions

of the ideas in the pap er.) A bank of p ointers | one for

each blo ck | is used to p ointtotheblocks that make

In this section, we outline an implementation of lo ck-

up the array. In order to mo dify the contents of the

free transactions on memory-residentdata.We assume

array, a task makes a copy of each blo cktobechanged,

that transactions are invoked by a collection of priori-

and then attempts to up date the bank of p ointers so

tized tasks executing on the same pro cessor. Our im-

that they p oint to the mo di ed blo cks.

plementation is mostly based on the universal lo ck-free

When a transaction of task p accesses a word in the

constructions of large ob jects and multiple ob jects by

array,say MEM [x], the blo ckcontaining the xth word

Anderson and Moir [2,3].

is identi ed. If p's transaction writes into MEM [x],

In contrast to conventional schemes for concurrency

then p must replace the corresp onding blo ck. The de-

control, when lo ck-free algorithms are used, transac-

tails of identifying blo cks and replacing mo di ed blo cks

tions are executed as if they do not access any shared

are hidden from the programmer by means of the Read

data | i.e., such transactions can b e viewed as b eing

and Write routines, which p erform all necessary ad-

indep endent. As a result, mechanisms are not needed

dress translation and record-keeping. These routines

to handle priorityinversion, deadlo ck, or data con icts,

are called by the transaction in order to read or write

or to undo the e ects of partially-completed transac-

an elementofthe MEM array.Thus, instead of writing

tions that are ab orted.

parameter sp eci es the number of words on whichaCAS op-

In our lo ck-free implementation, whichisshown in

eration is to b e p erformed. The remaining parameters sp ecify

Figure 5, a transaction commits and validates at the

the words' addresses, old values, and new values, resp ectively.

3

The op eration returns true if a CAS op eration can b e success-

same time by p erforming a MWCAS op eration. A trans-

fully p erformed on all the words simultaneously, and returns false

3

otherwise. The MWCAS pro cedure takes four input parameters. The rst 5

typ e blktype = array[0::S 1] of wordtype ; ptrtype = record addr :0::B + NT 1; ver : integer end

shared variable BANK : array[0::B 1] of ptrtype; = Bank of p ointers to arrayblocks =

BLK : array[0::B + NT 1] of blktype ; = Array and copyblocks =

initially (8k :0 k

private variable addrlist, copy : array[0::T 1] of 0::B + NT 1; old lst, ptrs : array[0::B 1] of ptrtype ;

dirty, touch : array[0::B 1] of b o olean; src, dest : blktype; nw, dirtycnt :0::T ; i, blkidx :0::B 1;

oldval, newval : array[0::B 1] of ptrtype ; blklist : sorted list of 0::B 1; env : jmp buf

initially (8k :0 k

pro cedure Read (address :0::B S 1) returns wordtype

1: blkidx := address div S ;

2: if :touch [blkidx] then touch [blkidx]:=true ; blklist.insert (blkidx ); ptrs [blkidx ]:=BANK [blkidx ] ;

3: v := BLK [ptrs [blkidx ]][address mo d S ];

4: if BANK [blkidx ]:ver = ptrs [blkidx ]:ver then return v else longjmp (env,1)

pro cedure Write (address :0::B S 1; val : wordtype ) pro cedure LF Transaction (tr : function pointer)

5: blkidx := address div S ; while true do

6: if :touch [blkidx] then dirtycnt ; nw := 0; 0;

touch [blkidx]:=true ; 11: if setjmp (env ) 6=1then tr () ;

blklist.insert (blkidx); 12: while :blklist :empty () do

list head (); ptrs [blkidx]:=BANK [blkidx] i := blklist.delete

addrlist [nw ]:=i; ;

if dirty [i] then if :dirty [blkidx ] then

oldval [nw ]:= old lst [i]; dirty [blkidx ]:=true ;

newval [nw ]:=(ptrs [i]:addr ; ptrs [i]:ver + 1); 7: src := BLK [ptrs [blkidx ]:addr ];

dirty [i ]:=false 8: dest := BLK [copy [dirtycnt ]] ;

else oldval [nw ]; newval [nw ]:=ptrs [i]; ptrs [i] memcpy (dest, src, sizeof (blktype ));

; 9: if BANK [blkidx ]:ver = ptrs [blkidx ]:ver then

touch [i ]; nw := false ; nw +1 old lst [blkidx ]:=ptrs [blkidx ];

od; ptrs [blkidx ]:addr := copy [dirtycnt ];

13: if MWCAS(nw, addrlist, oldval, newval ) then dirtycnt := dirtycnt +1

for i := 0 to dirtycnt 1 do copy [i]:= old lst [i]:addr od; else longjmp (env,1)

return

;

od 10: BLK [ptrs [blkidx]:addr ][address mo d S ]:= val

Figure 5: Implementation of lo ck-free transactions (LF Transaction ) and the asso ciated Read and Write pro cedures.

BANK of Current Blocks Pointers Replacement Blocks

Modified Block Pointers Block 1 Copy of Block 1 Unmodified Block Pointers

Block 2 Copy of Block 2

Block 3

Block 4

Block 5 Copy of Block 5

MEM array made up Transaction T Transaction T2 Transaction T 3 of S−word blocks 1 Writes block 1 Writes block 2 Writes block 5

Reads blocks 3,5 Reads block 3 Reads block 4

Figure 6: Implementation of the MEM array for lo ck-free transactions (depicted for B = 5). 6

action reads a blo ck but do es not mo dify it, then the \MEM [1] := MEM [10]", the programmer would write

blo ck is marked as \touched" but not \dirty", and the \Write (1; Read (10))".

subsequent successful MWCAS do es not change the blo ck

In Figure 5, BANK is a B -word shared variable.

pointer or version number.

Each elementofBANK contains a p ointer to a blo ckof

A complication arises in our implementation when size S and a version numb er for the p ointer. (Actually,

the BANK variable is mo di ed by a higher-priority each p ointer is an index into an arrayofblocks BLK.)

task's transaction during the execution of task p's The B blo cks p ointed to by BANK constitute the cur-

transaction, thereby causing p to read inconsistentval- rentvalue of the array MEM.We assume that an upp er

ues from the MEM array. Because its MWCAS op eration b ound T is known on the number of blocks mo di ed by

will subsequently fail, p will not b e able to install cor- any transaction. Because a task's transaction copies a

rupted data. However, there is a risk that p's sequential blo ck b efore mo difying it, T \copy" blo cks are required

op eration might cause an error, such as a division by p er task. Therefore, a total of B + NT blo cks are used.

zero or a range error. This problem is solved by ensur- These blo cks are stored in the array BLK . Although

ing that if task p detects that the version numb er of one blo cks BLK [NT ]toBLK [NT + B 1] are the initial

of the blo cks accessed by it has b een mo di ed since its blo cks of the array,andBLK [pT ]to BLK [(p +1)T 1]

most recent access, then control is returned from the are task p's initial copy blo cks, the roles of these blo cks

Read or Write pro cedure to line 12 in LF are not xed. If p successfully completes a transaction Transaction

by p erforming a MWCAS op eration, then p reclaims the using Unix-like setjmp and longjmp system calls. Task

replaced blo cks as copy blo cks. Thus, the copy blo cks p then \cleans up" by reinitializing blklist , dirty ,and

of one task may b ecome blo cks of the array MEM , and touch , fails the MWCAS op eration, and retries the trans-

then later b ecome copyblocks of another task. action. Transactions can takeadvantage of this mecha-

nism by re-reading previously accessed blo cks in order

Task p p erforms a lo ck-free transaction by calling the

to fail early in the event that the blo ck has b een mo d-

LF Transaction pro cedure, which rep eatedly p erforms

i ed by another transaction.

the transaction until it executes a successful MWCAS op-

An example transaction that up dates the temp era- eration. The user-supplied transaction co de accesses

ture display of a b oiler is given in Figure 7. Under our the MEM array in a sequential manner using the Read

implementation, the transaction is executed by call- and Write pro cedures. The Read pro cedure rst com-

ing LF Transaction (update display ). As can b e seen, putes the index of the blo ckcontaining the accessed

transactions implemented under conventional schemes word, and then marks the blo ck as \touched". Before

can b e easily mo di ed to work under our lo ck-free returning the value from the appropriate o set within

scheme. that blo ck, the blo ck index is inserted into a sorted list

blklist , and the blo ckpointer is recorded in ptrs .

In our implementation, concurrent read op erations

do not interfere with one another. Also, concurrent Like the Read pro cedure, the Write pro cedure rst

transactions that mo dify disjoint sets of blo cks do not computes the blo ck to b e accessed and records it as

interfere with one another, as illustrated in Figure 6. \touched", if necessary. Then, if Write is mo difying

The gure depicts three concurrent transactions T , T , the blo ck for the rst time during the transaction, the

1 2

and T .Transactions T and T do not interfere with blo ck is copied into one of the task's copy blo cks, and

3 1 2

each other b ecause neither of them mo di es a blo ck the blo ckismarked as \dirty". The copy blo ck is then

accessed by b oth. However, T can p otentially interfere made part of the lo cal version of MEM by linking it

3

with T b ecause T mo di es blo ck5,whichisreadby into ptrs .Finally, the displaced blo ckisrecordedin

1 3

transaction T . old lst for p ossible reclaiming later, and the appropriate

1

word is mo di ed in the lo cal copy of the blo ck.

Observe that the scheduling conditions presented in

4

Section 2.2 apply to task sets. These conditions can b e The ver counter asso ciated with eachblockpointer

applied directly to p erio dic (or sp oradic) transactions in BANK records the blo ck's current\version" num-

even if the ob jects accessed by a transaction are not b er. If a transaction successfully replaces a mo di ed

known in advance, i.e., a transaction can access data blo ck, then it increments the blo ck's version numb er.

anywhere in the MEM array.Inmany applications, Thus, a transaction determines whether the ith blo ck

it should b e p ossible to tighten these scheduling con- has b een changed by comparing the version number

ditions by more carefully accounting for the kinds of that it last read from BANK [i] to the currentversion

con icts that can arise among transactions. Note that number of BANK [i]. Note that if a successful trans-

a transaction T can interfere with another transaction

i

4

Our implementation uses unb ounded counters to implement

T only if T has higher prioritythan T and T writes

j i j i

blo ckversion numb ers. In a forthcoming pap er, weshowthat

a blo ckthatisaccessedby T .

j

these counters can b e b ounded. 7

pro cedure update display ()

[3] J. Anderson and M. Moir, \Universal Construc-

lo cal variable t : integer

tions for Large Ob jects", Proceedings of the

t := Read (Boiler temp );

Ninth International Workshop on DistributedAl-

temp ) 6= t then Write (Disp temp ; t ) if Read (Disp

gorithms , Lecture Notes in Computer Science 972,

Figure 7: An example Transaction.

Springer-Verlag, Septemb er 1995, pp. 168-182.

In some applications, it is essential to backupthe

[4] J. Anderson, S. Ramamurthy, and K. Je ay \Real-

database in stable storage. This is usually achieved

Time Computing with Lo ck-Free Shared Ob jects",

by logging op erations on the database. The techniques

Proceedings of the 16th IEEE Real-Time Systems

presented here have the p otential to greatly simplify

Symposium , 1995, pp. 28-37.

transaction logging, mainly b ecause they do not re-

[5] M. Herlihy,\Wait-Free Synchronization", ACM

quire pro cedures for recovery from ab orted transac-

Transactions on Programming Languages and Sys-

tions. When a task p erforms a transaction, it can

tems ,Vol. 13, No. 1, 1991, pp. 124-149.

up date the database and the log le simultaneously

(treating the log le as simply another blo ck of memory

[6] M. Herlihy, \A Metho dology for Implementing

to up date). Because all transactions access the log le,

Highly Concurrent Data Ob jects", ACM Trans-

a simplistic application of this approachwould result

actions on Programming Languages and Systems ,

in each transaction interfering with all lower-priority

Vol. 15, No. 5, 1993, pp. 745-770.

concurrent transactions, whichwould adversely impact

[7] J. Huang, J. Stankovic, K. Ramamritham,andD.

schedulability. Log- le up dates should therefore b e

Towsley, \Exp erimental Evaluation of Real-Time

handled di erently from other memory accesses. In

Optimistic Concurrency Control Schemes", Pro-

particular, if a transaction fails to up date the log le,

ceedings of the Seventeenth International Confer-

then it should retry only the log le up date (using a

enceonVery Large Databases , 1991, pp. 35-46.

very short retry lo op like that in Figure 1), as opp osed

to retrying the entire transaction.

[8] H. Kung and J. Robinson, \On Optimistic Meth-

o ds for Concurrency Control", ACM Transactions

on Database Systems ,Vol. 6, No. 2, 1981, pp. 213-

4 Concluding Remarks

226.

[9] J. Leung and J. Whitehead, \On the Complex-

The research outlined ab oveleaves many opp ortunities

ity of Fixed-PriorityScheduling of Perio dic, Real-

for further research. Of foremost imp ortance are ex-

Time Tasks", Performance Evaluation ,Vol. 2, No.

p erimental studies that compare lo ck-free transactions

4, 1982, pp. 237-250.

with more conventional implementations. It would also

be interesting to determine if lo ck-free algorithms could

[10] C. Liu and J. Layland, \Scheduling Algorithms for

b e used in systems with disk-resident data. Finally,

multiprogramming in a Hard Real{Time Environ-

it would b e interesting to investigate the applicabil-

ment", Journal of the ACM ,Vol. 30, No. 1, 1973,

ity of these techniques within multipro cessors and dis-

pp. 46-61.

tributed systems.

[11] H. Massalin, Synthesis: An Ecient Implementa-

Acknowledgement: We thank Steve Go ddard and the

tion of Fundamental Operating System Services ,

anonymous referees for their valuable comments.

Ph.D. Dissertation, Columbia University,1992.

[12] A. Mok, Fundamental Design Problems of Dis-

References

tributed Systems for the HardReal-Time Environ-

ment , Ph.D. Thesis, MIT Lab oratory for Com-

[1] N. Audsley, A. Burns, M. Richardson, and A. puter Science, 1983.

Wellings, \Hard Real-Time Scheduling: The

[13] S. Ramamurthy, M. Moir, and J. Anderson, \Real-

Deadline Monotonic Approach", Proceedings of the

Time Ob ject Sharing with Minimal System Sup-

8th IEEE Workshop on Real-Time Operating Sys-

port", Proceedings of the 15th Annual ACM Sym-

tems and Software , Oxford, UK, 1992, pp. 127-

posium on Principles of Distributed Computing ,

132.

to app ear.

[2] J. Anderson and M. Moir, \Universal Construc-

[14] L. Sha, R. Ra jkumar, S. Son, and C. Chang, \A

tions for Multi-Ob ject Op erations", Proceedings of

Real-Time Lo cking Proto col", IEEE Transactions

the 14th Annual ACM Symposium on Principles of

on Computers ,Vol. 40, No. 7, 1991, pp. 793-800.

Distributed Computing , 1995, pp. 184-193. 8