Lo ck-Free Transactions for Real-Time Systems
James H. Anderson, Srikanth Ramamurthy, Mark Moir, and Kevin Je ay
Department of Computer Science, University of North Carolina at Chap el Hill
tion phase or write phase) must b e executed as crit- Abstract
ical sections [7,8]. Priorityinversion is often dealt
with through the use of priority inheritance proto cols
We show that previous algorithmic and scheduling
or priority ceiling proto cols [14]. These proto cols dy-
work concerning the use of lo ck-free ob jects in hard
namically adjust transaction priorities to ensure that a
real-time systems can b e extended to supp ort real-time
transaction within a critical section executes at a pri-
transactions on memory-resident data. Using our ap-
ority that is suciently high to b ound the duration of
proach, transactions are not susceptible to priorityin-
any priorityinversion. This functionality comes at the
version or deadlo ck, do not require complicated mecha-
exp ense of additional overhead asso ciated with main-
nisms for data-logging or for rolling back ab orted trans-
taining information ab out transactions and the data
actions, and are implemented as library routines that
they access.
require no sp ecial kernel supp ort.
In addition, in many database systems, ma jor func-
tional comp onents are implemented as separate mo d-
1 Intro duction
ules (e.g., transaction manager, lo ck manager, etc.),
eachofwhich consists of one or more pro cesses. Trans-
In most real-time database systems, conventional
actions interact with these mo dules byinvoking sp ecial
mechanisms suchaslocks, timestamps, and serializa-
primitives (e.g., Begin Transaction , End Transaction ,
tion graphs are used for concurrency control. The main
Read , Write , Commit , Abort ). Although structuring a
problem when using any of these mechanisms is that
system in this way is attractive from a software engi-
of handling con icting op erations. If an op eration of a
neering standp oint, such an arrangement can result in
transaction creates a con ict, then one of two strategies
signi cantinterpro cess communicationoverhead.
may b e used: either that transaction may b e blo cked,
To summarize, although conventional concurrency
or one or more of the transactions involved in the con-
control schemes provide a general framework for real-
ict may b e ab orted. When using con ict resolution
time transactions, actual implementations of these
schemes that employblocking, deadlo ckisakey issue,
schemes can entail high overhead. In this pap er, we
and mechanisms for deadlo ckavoidance or resolution
prop ose a new approach for implementing real-time
are required. In contrast, deadlo ck is not a problem
transactions on memory-resident data on unipro ces-
when using schemes that always resolve con icts by
sors. In our approach, transactions are implemented
ab orting transactions. However, with suchschemes,
using a collection of library routines that hide from
overhead asso ciated with undoing the e ects of par-
transaction programmers all details asso ciated with
tially completed transactions and then restarting them
managing concurrency.Transactions implemented us-
is a key issue.
ing these routines are not susceptible to priority in-
Most existing con ict resolution schemes for real-
version or deadlo ck, and do not require complicated
time database systems are susceptible to priorityin-
mechanisms for data-logging or for rolling back ab orted
version; a priority inversion o ccurs when a given trans-
transactions. In addition, they do not require sp ecial
action is blo cked by another transaction of lower pri-
kernel supp ort or additional pro cesses for concurrency
ority. Priorityinversion is a problem even for systems
control; they therefore avoid interpro cess communica-
that employ optimistic techniques, b ecause in such sys-
tion overhead.
tems, certain phases of a transaction (e.g., the valida-
Our approach to implementing real-time transac-
tions is based up on previous researchinvolving the use
The rst three authors were supp orted, in part, byNSF
contract CCR 9216421, and bya Young Investigator Award from
of lo ck-free ob jects in real-time systems. An ob ject
the U.S. Army Research Oce, grantnumber DAAH04-95-1-
implementation is lock-free i it guarantees that, af-
0323. The third author was also supp orted by a UNC Alumni
ter a nite numb er of steps of any op eration, some
Fellowship. The fourth author was supp orted bygrants from
op eration on the ob ject completes. Lo ck-free ob jects
Intel and IBM. 1
typ e obj = record data : valtype; next : obj end
at once. Suchanimplementation di ers from conven-
shared variable queue tail : obj
tional optimistic concurrency control proto cols in two
pro cedure Enqueue(input : valtype)
resp ects. First, our implementations do not use lo ck-
lo cal variable old tail, new tail : obj
ing. Therefore, they do not require sp ecial supp ort for
tail := (input, NULL); new
rep eat old tail := queue tail dealing with priorityinversion. Second, our implemen-
until CAS2(queue tail, old tail > next,
tations allow transactions to b e ab orted at arbitrary
old tail, NULL,
points during their execution without the aid of recov-
tail, new tail) new
ery pro cedures. An imp ortant implication of this p oint
is that data-logging is greatly simpli ed.
Figure 1: Lo ck-free enqueue op eration.
The rest of this pap er is organized as follows. In
Section 2, we review previous work involving the use of
are usually implemented using \retry lo ops".Figure1
lo ck-free ob jects in real-time systems. Then, in Section
depicts an example of a lo ck-free op eration | an en-
3, we outline our approach for implementing lo ck-free
queue taken from a shared queue implementation given
transactions. We end the pap er with concluding re-
in [11]. In this example, an enqueue is p erformed by
marks in Section 4.
trying to thread an item onto the tail of the queue us-
1
ingatwo-word compare-and-swap (CAS2) instruction.
This threading is attempted rep eatedly until it suc-
ceeds. Note that the queue is never actually \lo cked".
2 Lo ck-Free Ob jects
Using lo ck-free ob ject implementations, ob jects are
We b egin this section by reviewing previous work on
optimistically accessed without lo cking, and an ac-
scheduling hard real-time tasks that share lo ck-free ob-
cess is retried if an interference o ccurs. Op erations
jects. We then consider the issue of hardware supp ort
are atomically validated and committed byinvoking a
for lo ck-free synchronization.
strong synchronization primitive(like CAS2). In Fig-
ure 1, this validation step can fail for a task only if
a higher-priority task p erforms a successful CAS2 be-
2.1 Scheduling with Lo ck-Free Ob jects
tween the read of queue tail by and the subsequent
CAS2 op eration by .
Although it should b e clear that lo ck-free ob jects do
From a real-time p ersp ective, lo ck-free ob ject imple-
not give rise to priorityinversions, it may seem that un-
mentations are of interest b ecause they avoid prior-
b ounded retry lo ops render such ob jects useless in real-
ityinversions with no kernel supp ort. On the surface,
time systems. Nonetheless, Anderson, Ramamurthy,
however, it is not immediately apparentthatlock-free
and Je ayhaveshown that if tasks on a unipro ces-
shared ob jects can b e employed if tasks must adhere
sor are scheduled appropriately,thensuch lo ops are
to strict timing constraints. In particular, rep eated
indeed b ounded [4]. Wenow explain intuitively why
interferences can cause a given op eration to takean
such b ounds exist. For the sake of explanation, let us
arbitrarily long time to complete. Fortunately,such
call an iteration of a retry lo op a successful update if it
interferences can b e b ounded byscheduling tasks ap-
successfully completes, and a failedupdate otherwise.
propriately [4]. As explained in the next section, the
Thus, a single invo cation of a lo ck-free op eration con-
key to scheduling such tasks is to allow enough spare
sists of anynumb er of failed up dates followed by one
pro cessor time to accommo date the failed ob ject up-
successful up date.
dates that can o ccur over anyinterval. The number
Consider two tasks and that access a common
i j
of failed up dates in suchaninterval is b ounded bythe
lo ck-free ob ject B . Supp ose that causes to exp eri-
i j
numb er of task preemptions within that interval.
ence a failed up date of B . On an unipro cessor, this can
In this pap er, we show that previous work on lo ck-
only happ en if preempts the access of and then
i j
free ob jects can b e extended to apply to lo ck-free trans-
up dates B successfully.However, preempts only
i j
actions on memory-resident databases. Our approach
if has higher prioritythan .Thus, at each prior-
i j
to implementing such transactions is to rst imple-
ity level, there is a correlation b etween failed up dates
menta multi-word CAS primitive(MWCAS). This primi-
and task preemptions. The maximum number of task
tive can then b e used as the basis for a lo ck-free retry
preemptions within a time interval can b e determined
lo op in which op erations on many ob jects are validated
from the timing requirements of the tasks. Using this
information, it is p ossible to determine a b ound on the
1
The rst two parameters of CAS2 are shared variables, the
numb er of failed up dates in that interval. A set of tasks
next two parameters are values to which the shared variables
that share lo ck-free ob jects is schedulable if there is
are compared, and the last two parametersarenewvalues to b e
enough free pro cessor time to accommo date the failed placed in the shared variables should b oth comparisons succeed. 2
P
c +s
N
j
up dates that can o ccur over anyinterval.
1. 2
j =1
p
j
In [4], scheduling conditions are established for
Informally, this condition states that a task set is dead line-monotonic (DM) [9]andearliest-dead line- rst
(EDF) [10] priority assignments. (The rate-monotonic
schedulable if pro cessor utilization is at most 1. As
in the case of DM scheduling, this condition extends (RM) [10] priorityscheme is also considered, but the
the corresp onding condition for indep endent tasks [10] corresp onding scheduling condition is a sp ecial case of
by requiring that the computation time of eachtask that presented for the DM case.) In order to state these
b e dilated by the time it takes for one lo ck-free lo op conditions, wemust rst de ne some notation. Each
iteration. condition applies to a collection of N p erio dic tasks
f ;:::; g. The p erio d of task is denoted by p ,
1 N i i
The results presented ab ove suggest a general strat-
and the relativedeadlineoftask is denoted by l ,
i i
egy for determining the schedulability of tasks that
where l p ; under the EDF scheme, we assume that
i i
share lo ck-free ob jects. First, determine a b ound on
l = p .Tasks are assumed to b e sorted in nondecreas-
i i
demand due to failed up dates over anyinterval of time.
ing order by their deadlines, i.e., l i j Then, mo dify existing scheduling conditions for inde- c denote the worst-case computational cost (execution i p endent tasks by incorp orating this demand. Schedul- time) of task when it is the only task executing on the i ing conditions derived in this manner are applicable not pro cessor, i.e., when there is no contention for the pro- only for tasks that p erform single-ob ject up dates, but cessor or for shared ob jects. Let s denote the execution also for tasks that p erform multi-ob ject transactions. time required for one lo op iteration in the implementa- The b ounds on failed up dates given in the theorems tion of a lo ck-free ob ject. For simplicity, all suchloops ab ovearenotvery tight | these b ounds are based on are assumed to have the same cost. Note that s is also the conservative assumption that each task interferes the extra computation required in the event of a failed with all lower-priority tasks that execute concurrently. ob ject up date. Given this notation, the DM condition Scheduling conditions that incorp orate tighter b ounds can b e stated as follows. will b e given in a forthcoming pap er. In these new con- ditions, tighter b ounds are obtained by more accurately Theorem 1: (Suciency under DM) A set of periodic accounting for interferences that can o ccur b etween tasks that sharelock-free objectsonauniprocessor can tasks. Also, the new conditions allow di erentlock-free be scheduled under the DM scheme if, for every task , i accesses to have di erent lo op costs. When implement- there exists some t 2 (0;l ] such that i ing transactions, signi cantvariations in lo op costs are l m m l P P i i 1 likely. t t 1 c + s t: 2 j j =1 j =1 p p j j Informally, this condition states that a task set is 2.2 Hardware Supp ort for Lo ck-Free schedulable if, for each job of every task , there ex- i Synchronization ists a p ointintimet between the release of that job and its deadline, such that the demand placed on the A p ossible criticism of the lo ck-free algorithm given in pro cessor in the interval b etween the job's release and Figure 1 is that it requires a strong synchronization time t is at most the available pro cessor time in that primitive, namely CAS2. The fact that manylock-free interval. Demand in this interval can b e broken into ob ject implementations are based on strong synchro- two comp onents: demand due to job releases, ignoring nization primitives is no accident. Herlihy has shown ob ject invo cations (this is given by the rst summa- that such strong primitives are, in general, necessary tion); and demand due to failed ob ject up dates, which for these implementations [5]. Herlihy's results are is b ounded bythenumb er of preemptions by higher- based up on a categorization of ob jects by \consensus priority tasks in the interval (this is given by the second number".Anobjecthasconsensus number N if it can summation). In comparing the ab ove condition to the b e used to solve N -pro cess consensus, but not (N + 1)- 2 DM condition for indep endent tasks given in [1], wesee pro cess consensus, in a wait-free manner. Herlihy's that our condition essentially requires that the compu- results show that primitives with unb ounded consen- tation time of each task b e \dilated" by the time it sus numb ers are necessary in general-purp ose lo ck-free takes for one lo ck-free lo op iteration. (or wait-free) ob ject implementations. Such ob jects are called universal b ecause they can b e used to im- Wenow turn our attention to the EDF scheme. plementany other ob ject. Herlihy's consensus-number Theorem 2: (Suciency under EDF) A set of periodic 2 Wait-freedom is a strong form of lo ck-freedom that precludes tasks that sharelock-free objectsonauniprocessor can al l waiting dep endencies among tasks | including p otentially unbounded op eration retries. be scheduled under the EDF scheme if 3 shared var Fin , Prp : valtype [? Consensus Ob ject private var v : valtype Number initially Fin = ?^Prp = ? 1 read/write registers pro cedure decide (val : valtype) returns valtype 2 test&set, swap, fetch&add, queue 1: if Fin = ? then . . . . 2: if Prp = ? then . . 3: Prp := val 2n 2 n-register assignment ; . . . . 4: if Fin = ? then . . 5: v := Prp ; memory-to-memory moveandswap, 6: Fin := v 1 compare&swap, load-linked /store-con di tion al ; 7: return Fin Figure 4: Consensus using reads and writes. Figure 2: Herlihy's consensus-numb er hierarchy. mo del like that depicted in Figure 3(b). Twokey as- sumptions underlie this mo del: (i) a task may pre- i empt another task only if has higher priority than i (a) j ; (ii) a task's priority can change over time, but not j during any ob ject access. Assumption (i) is common to all priority-driven scheduling p olicies. Assumption (ii) holds for most common p olicies, including RM, DM, and EDF scheduling. The only common scheduling eknow of that violates assumption (ii) is (b) p olicy that w least-laxity- rst (LLF) scheduling [12]. Figure 3: Line segments denote op erations on shared ob- The collapse of Herlihy's hierarchy for hard real-time jects with time running from left to right. Eachlevel cor- unipro cessor systems is established in [13]by giving a resp onds to op erations by a di erent task. (a) Interleaved wait-free (and hence lo ck-free) algorithm that solves op erations in a asynchronous system. Op erations mayover- the consensus problem in such systems using only reads lap arbitrarily.(b)Interleaved op erations in a unipro cessor and writes. This algorithm is shown in Figure 4. In real-time system. Two op erations overlap only if one is this algorithm, eachtaskcho oses a decision value by contained within the other. invoking the pro cedure decide . Pro cedure decide uses two shared variables, Prp hierarchyisshown in Figure 2. (\prop ose") and Fin (\ nal"). Each task that do es not Although Herlihy's hierarchy is of fundamental detect the input of another task in Prp or Fin writes its imp ortance for truly asynchronous systems, Rama- own value into Prp .Having ensured that some value murthy, Moir, and Anderson recently showed that, for has b een prop osed (lines 1 to 3), a task copies the pro- unipro cessor-based real-time systems, Herlihy's hierar- p osed value to Fin , if necessary (lines 4 to 6). It is chy collapses, i.e., reads and writes can b e used to im- easy to see that no task returns b efore some task's in- plementanylock-free ob ject [13]. The basis for this put value is written into Fin, and that all tasks return result is the realization that certain task interleavings avalue read from Fin . It can b e further shown that cannot o ccur within real-time systems. In particular, the rst value written into Prp is the only value writ- 0 if a task accesses an ob ject in the time interval [t; t ], i ten into Fin . The correctness of the algorithm easily and if another task accesses that ob ject in the inter- j follows from this fact, yielding the following theorem. 0 0 0 val [u; u ], then it is not p ossible to have t< u< t (It is easy to show that the algorithm of Figure 4 do es b ecause the higher-prioritytaskmust nish its op era- not correctly solve the consensus problem in a truly tion b efore relinquishing the pro cessor. Requiring an asynchronous system consisting of twoormorepro- ob ject implementation to correctly deal with this in- cesses. Thus, it do es not contradict the fact the reads terleaving is therefore p ointless, b ecause it cannot arise and writes have consensus number1forsuch systems.) in practice. The distinction b etween traditional asyn- chronous systems, to which Herlihy's work is directed, and hard real-time systems is illustrated in Figure 3. Theorem 3: Consensus can be implemented with con- The results of [13] are based up on an execution stant time and spaceusingreads and writes on a hard 4 real-time uniprocessor system. 2 action successfully completes, and the corresp onding mo di cations to data take e ect, only if the MWCAS op- Given an implementation of consensus ob jects, any eration succeeds. Thus, transactions can b e ab orted shared ob ject can b e implemented in a lo ck-free man- arbitrarily without executing exp ensive recovery pro- ner [5]. However, such implementations usually en- cedures. This can b e advantageous in situations that tail high overhead. More practical lo ck-free imple- require transactions with rm deadlines to b e ab orted mentations are based on universal primitives suchas due to transientoverload conditions. In such situa- CAS and load-linked/store-conditional (LL/SC)[6]. To tions, transactions can b e arbitrarily terminated by the enable such implementations to b e used in real-time system without fear of compromising data consistency. unipro cessor systems, Ramamurthy, Moir, and Ander- One p ossible criticism of our implementationisthat son presenttwo implementations of an ob ject that sup- the MWCAS primitive is not supp orted by the hardware p orts Read and CAS.(LL/SC can b e implemented using in most systems. However, Anderson and Moir have Read and CAS in constant time [2].) These implementa- shown that MWCAS can b e implemented using the single- tions, which are summarized in the following theorems, word CAS primitive in general asynchronous systems use read/write and memory-to-memory Move instruc- [3]. Furthermore, b ecause real-time systems only al- tions, resp ectively. Although Move is rare in multipro- low a subset of the transaction interleavings in asyn- cessors, it is widely available on unipro cessors. For chronous systems, these constructions can b e simpli ed example, unipro cessor systems based on Intel's 80x86 for real-time systems using techniques similar to those pro cessor and the Pentium line of pro cessors supp ort presented in Section 2.2. the move instruction. (In these theorems, N denotes In our lo ck-free construction, all data is stored in the numb er of tasks that share an ob ject.) a single array.However, we do not require the array to b e stored in contiguous memory lo cations. Instead, Theorem 4: Read and CAS can be implementedina weprovide the \illusion" of a contiguous array. Before wait-free manner on a hardreal-time uniprocessor sys- describing the co de in Figure 5, we explain how this tem with O(N ) time and spacecomplexity using reads is achieved, and how the details of the implementation and writes. 2 are hidden from the programmer. Theorem 5: Read and CAS can be implementedin The implemented array MEM is partitioned into B a wait-free manner on a hardreal-time uniprocessor blo cks of size S . (Figure 6 depicts this arrangement system with constant time and O(N ) spacecomplexity for B = 5.) The rst blo ckcontains array lo cations 0 using Move. 2 through S 1, the second blo ckcontains lo cations S through 2S 1, and so on. (We assume that blo cks have the same size only b ecause it simpli es the presentation 3 Lo ck-Free Transactions of the ideas in the pap er.) A bank of p ointers | one for each blo ck | is used to p ointtotheblocks that make In this section, we outline an implementation of lo ck- up the array. In order to mo dify the contents of the free transactions on memory-residentdata.We assume array, a task makes a copy of each blo cktobechanged, that transactions are invoked by a collection of priori- and then attempts to up date the bank of p ointers so tized tasks executing on the same pro cessor. Our im- that they p oint to the mo di ed blo cks. plementation is mostly based on the universal lo ck-free When a transaction of task p accesses a word in the constructions of large ob jects and multiple ob jects by array,say MEM [x], the blo ckcontaining the xth word Anderson and Moir [2,3]. is identi ed. If p's transaction writes into MEM [x], In contrast to conventional schemes for concurrency then p must replace the corresp onding blo ck. The de- control, when lo ck-free algorithms are used, transac- tails of identifying blo cks and replacing mo di ed blo cks tions are executed as if they do not access any shared are hidden from the programmer by means of the Read data | i.e., such transactions can b e viewed as b eing and Write routines, which p erform all necessary ad- indep endent. As a result, mechanisms are not needed dress translation and record-keeping. These routines to handle priorityinversion, deadlo ck, or data con icts, are called by the transaction in order to read or write or to undo the e ects of partially-completed transac- an elementofthe MEM array.Thus, instead of writing tions that are ab orted. parameter sp eci es the number of words on whichaCAS op- In our lo ck-free implementation, whichisshown in eration is to b e p erformed. The remaining parameters sp ecify Figure 5, a transaction commits and validates at the the words' addresses, old values, and new values, resp ectively. 3 The op eration returns true if a CAS op eration can b e success- same time by p erforming a MWCAS op eration. A trans- fully p erformed on all the words simultaneously, and returns false 3 otherwise. The MWCAS pro cedure takes four input parameters. The rst 5 typ e blktype = array[0::S 1] of wordtype ; ptrtype = record addr :0::B + NT 1; ver : integer end shared variable BANK : array[0::B 1] of ptrtype; = Bank of p ointers to arrayblocks = BLK : array[0::B + NT 1] of blktype ; = Array and copyblocks = initially (8k :0 k private variable addrlist, copy : array[0::T 1] of 0::B + NT 1; old lst, ptrs : array[0::B 1] of ptrtype ; dirty, touch : array[0::B 1] of b o olean; src, dest : blktype; nw, dirtycnt :0::T ; i, blkidx :0::B 1; oldval, newval : array[0::B 1] of ptrtype ; blklist : sorted list of 0::B 1; env : jmp buf initially (8k :0 k pro cedure Read (address :0::B S 1) returns wordtype 1: blkidx := address div S ; 2: if :touch [blkidx] then touch [blkidx]:=true ; blklist.insert (blkidx ); ptrs [blkidx ]:=BANK [blkidx ] ; 3: v := BLK [ptrs [blkidx ]][address mo d S ]; 4: if BANK [blkidx ]:ver = ptrs [blkidx ]:ver then return v else longjmp (env,1) pro cedure Write (address :0::B S 1; val : wordtype ) pro cedure LF Transaction (tr : function pointer) 5: blkidx := address div S ; while true do 6: if :touch [blkidx] then dirtycnt ; nw := 0; 0; touch [blkidx]:=true ; 11: if setjmp (env ) 6=1then tr () ; blklist.insert (blkidx); 12: while :blklist :empty () do list head (); ptrs [blkidx]:=BANK [blkidx] i := blklist.delete addrlist [nw ]:=i; ; if dirty [i] then if :dirty [blkidx ] then oldval [nw ]:= old lst [i]; dirty [blkidx ]:=true ; newval [nw ]:=(ptrs [i]:addr ; ptrs [i]:ver + 1); 7: src := BLK [ptrs [blkidx ]:addr ]; dirty [i ]:=false 8: dest := BLK [copy [dirtycnt ]] ; else oldval [nw ]; newval [nw ]:=ptrs [i]; ptrs [i] memcpy (dest, src, sizeof (blktype )); ; 9: if BANK [blkidx ]:ver = ptrs [blkidx ]:ver then touch [i ]; nw := false ; nw +1 old lst [blkidx ]:=ptrs [blkidx ]; od; ptrs [blkidx ]:addr := copy [dirtycnt ]; 13: if MWCAS(nw, addrlist, oldval, newval ) then dirtycnt := dirtycnt +1 for i := 0 to dirtycnt 1 do copy [i]:= old lst [i]:addr od; else longjmp (env,1) return ; od 10: BLK [ptrs [blkidx]:addr ][address mo d S ]:= val Figure 5: Implementation of lo ck-free transactions (LF Transaction ) and the asso ciated Read and Write pro cedures. BANK of Current Blocks Pointers Replacement Blocks Modified Block Pointers Block 1 Copy of Block 1 Unmodified Block Pointers Block 2 Copy of Block 2 Block 3 Block 4 Block 5 Copy of Block 5 MEM array made up Transaction T Transaction T2 Transaction T 3 of S−word blocks 1 Writes block 1 Writes block 2 Writes block 5 Reads blocks 3,5 Reads block 3 Reads block 4 Figure 6: Implementation of the MEM array for lo ck-free transactions (depicted for B = 5). 6 action reads a blo ck but do es not mo dify it, then the \MEM [1] := MEM [10]", the programmer would write blo ck is marked as \touched" but not \dirty", and the \Write (1; Read (10))". subsequent successful MWCAS do es not change the blo ck In Figure 5, BANK is a B -word shared variable. pointer or version number. Each elementofBANK contains a p ointer to a blo ckof A complication arises in our implementation when size S and a version numb er for the p ointer. (Actually, the BANK variable is mo di ed by a higher-priority each p ointer is an index into an arrayofblocks BLK.) task's transaction during the execution of task p's The B blo cks p ointed to by BANK constitute the cur- transaction, thereby causing p to read inconsistentval- rentvalue of the array MEM.We assume that an upp er ues from the MEM array. Because its MWCAS op eration b ound T is known on the number of blocks mo di ed by will subsequently fail, p will not b e able to install cor- any transaction. Because a task's transaction copies a rupted data. However, there is a risk that p's sequential blo ck b efore mo difying it, T \copy" blo cks are required op eration might cause an error, such as a division by p er task. Therefore, a total of B + NT blo cks are used. zero or a range error. This problem is solved by ensur- These blo cks are stored in the array BLK . Although ing that if task p detects that the version numb er of one blo cks BLK [NT ]toBLK [NT + B 1] are the initial of the blo cks accessed by it has b een mo di ed since its blo cks of the array,andBLK [pT ]to BLK [(p +1)T 1] most recent access, then control is returned from the are task p's initial copy blo cks, the roles of these blo cks Read or Write pro cedure to line 12 in LF are not xed. If p successfully completes a transaction Transaction by p erforming a MWCAS op eration, then p reclaims the using Unix-like setjmp and longjmp system calls. Task replaced blo cks as copy blo cks. Thus, the copy blo cks p then \cleans up" by reinitializing blklist , dirty ,and of one task may b ecome blo cks of the array MEM , and touch , fails the MWCAS op eration, and retries the trans- then later b ecome copyblocks of another task. action. Transactions can takeadvantage of this mecha- nism by re-reading previously accessed blo cks in order Task p p erforms a lo ck-free transaction by calling the to fail early in the event that the blo ck has b een mo d- LF Transaction pro cedure, which rep eatedly p erforms i ed by another transaction. the transaction until it executes a successful MWCAS op- An example transaction that up dates the temp era- eration. The user-supplied transaction co de accesses ture display of a b oiler is given in Figure 7. Under our the MEM array in a sequential manner using the Read implementation, the transaction is executed by call- and Write pro cedures. The Read pro cedure rst com- ing LF Transaction (update display ). As can b e seen, putes the index of the blo ckcontaining the accessed transactions implemented under conventional schemes word, and then marks the blo ck as \touched". Before can b e easily mo di ed to work under our lo ck-free returning the value from the appropriate o set within scheme. that blo ck, the blo ck index is inserted into a sorted list blklist , and the blo ckpointer is recorded in ptrs . In our implementation, concurrent read op erations do not interfere with one another. Also, concurrent Like the Read pro cedure, the Write pro cedure rst transactions that mo dify disjoint sets of blo cks do not computes the blo ck to b e accessed and records it as interfere with one another, as illustrated in Figure 6. \touched", if necessary. Then, if Write is mo difying The gure depicts three concurrent transactions T , T , the blo ck for the rst time during the transaction, the 1 2 and T .Transactions T and T do not interfere with blo ck is copied into one of the task's copy blo cks, and 3 1 2 each other b ecause neither of them mo di es a blo ck the blo ckismarked as \dirty". The copy blo ck is then accessed by b oth. However, T can p otentially interfere made part of the lo cal version of MEM by linking it 3 with T b ecause T mo di es blo ck5,whichisreadby into ptrs .Finally, the displaced blo ckisrecordedin 1 3 transaction T . old lst for p ossible reclaiming later, and the appropriate 1 word is mo di ed in the lo cal copy of the blo ck. Observe that the scheduling conditions presented in 4 Section 2.2 apply to task sets. These conditions can b e The ver counter asso ciated with eachblockpointer applied directly to p erio dic (or sp oradic) transactions in BANK records the blo ck's current\version" num- even if the ob jects accessed by a transaction are not b er. If a transaction successfully replaces a mo di ed known in advance, i.e., a transaction can access data blo ck, then it increments the blo ck's version numb er. anywhere in the MEM array.Inmany applications, Thus, a transaction determines whether the ith blo ck it should b e p ossible to tighten these scheduling con- has b een changed by comparing the version number ditions by more carefully accounting for the kinds of that it last read from BANK [i] to the currentversion con icts that can arise among transactions. Note that number of BANK [i]. Note that if a successful trans- a transaction T can interfere with another transaction i 4 Our implementation uses unb ounded counters to implement T only if T has higher prioritythan T and T writes j i j i blo ckversion numb ers. In a forthcoming pap er, weshowthat a blo ckthatisaccessedby T . j these counters can b e b ounded. 7 pro cedure update display () [3] J. Anderson and M. Moir, \Universal Construc- lo cal variable t : integer tions for Large Ob jects", Proceedings of the t := Read (Boiler temp ); Ninth International Workshop on DistributedAl- temp ) 6= t then Write (Disp temp ; t ) if Read (Disp gorithms , Lecture Notes in Computer Science 972, Figure 7: An example Transaction. Springer-Verlag, Septemb er 1995, pp. 168-182. In some applications, it is essential to backupthe [4] J. Anderson, S. Ramamurthy, and K. Je ay \Real- database in stable storage. This is usually achieved Time Computing with Lo ck-Free Shared Ob jects", by logging op erations on the database. The techniques Proceedings of the 16th IEEE Real-Time Systems presented here have the p otential to greatly simplify Symposium , 1995, pp. 28-37. transaction logging, mainly b ecause they do not re- [5] M. Herlihy,\Wait-Free Synchronization", ACM quire pro cedures for recovery from ab orted transac- Transactions on Programming Languages and Sys- tions. When a task p erforms a transaction, it can tems ,Vol. 13, No. 1, 1991, pp. 124-149. up date the database and the log le simultaneously (treating the log le as simply another blo ck of memory [6] M. Herlihy, \A Metho dology for Implementing to up date). Because all transactions access the log le, Highly Concurrent Data Ob jects", ACM Trans- a simplistic application of this approachwould result actions on Programming Languages and Systems , in each transaction interfering with all lower-priority Vol. 15, No. 5, 1993, pp. 745-770. concurrent transactions, whichwould adversely impact [7] J. Huang, J. Stankovic, K. Ramamritham,andD. schedulability. Log- le up dates should therefore b e Towsley, \Exp erimental Evaluation of Real-Time handled di erently from other memory accesses. In Optimistic Concurrency Control Schemes", Pro- particular, if a transaction fails to up date the log le, ceedings of the Seventeenth International Confer- then it should retry only the log le up date (using a enceonVery Large Databases , 1991, pp. 35-46. very short retry lo op like that in Figure 1), as opp osed to retrying the entire transaction. [8] H. Kung and J. Robinson, \On Optimistic Meth- o ds for Concurrency Control", ACM Transactions on Database Systems ,Vol. 6, No. 2, 1981, pp. 213- 4 Concluding Remarks 226. [9] J. Leung and J. Whitehead, \On the Complex- The research outlined ab oveleaves many opp ortunities ity of Fixed-PriorityScheduling of Perio dic, Real- for further research. Of foremost imp ortance are ex- Time Tasks", Performance Evaluation ,Vol. 2, No. p erimental studies that compare lo ck-free transactions 4, 1982, pp. 237-250. with more conventional implementations. It would also be interesting to determine if lo ck-free algorithms could [10] C. Liu and J. Layland, \Scheduling Algorithms for b e used in systems with disk-resident data. Finally, multiprogramming in a Hard Real{Time Environ- it would b e interesting to investigate the applicabil- ment", Journal of the ACM ,Vol. 30, No. 1, 1973, ity of these techniques within multipro cessors and dis- pp. 46-61. tributed systems. [11] H. Massalin, Synthesis: An Ecient Implementa- Acknowledgement: We thank Steve Go ddard and the tion of Fundamental Operating System Services , anonymous referees for their valuable comments. Ph.D. Dissertation, Columbia University,1992. [12] A. Mok, Fundamental Design Problems of Dis- References tributed Systems for the HardReal-Time Environ- ment , Ph.D. Thesis, MIT Lab oratory for Com- [1] N. Audsley, A. Burns, M. Richardson, and A. puter Science, 1983. Wellings, \Hard Real-Time Scheduling: The [13] S. Ramamurthy, M. Moir, and J. Anderson, \Real- Deadline Monotonic Approach", Proceedings of the Time Ob ject Sharing with Minimal System Sup- 8th IEEE Workshop on Real-Time Operating Sys- port", Proceedings of the 15th Annual ACM Sym- tems and Software , Oxford, UK, 1992, pp. 127- posium on Principles of Distributed Computing , 132. to app ear. [2] J. Anderson and M. Moir, \Universal Construc- [14] L. Sha, R. Ra jkumar, S. Son, and C. Chang, \A tions for Multi-Ob ject Op erations", Proceedings of Real-Time Lo cking Proto col", IEEE Transactions the 14th Annual ACM Symposium on Principles of on Computers ,Vol. 40, No. 7, 1991, pp. 793-800. Distributed Computing , 1995, pp. 184-193. 8