Software

Nir Shavit Dan Touitou

TelAviv University TelAviv University

Abstract

As we learn from the literature exibilityincho osing synchronization op erations greatly sim

plies the task of designing highly concurrent programs Unfortunately existing hardware is in

LinkedStore Conditional op eration on a single word exible and is at b est on the level of a Load

Building on the hardware based transactional synchronization metho dology of Herlihy and Moss

we oer softwaretransactional memory STManovel software metho d for supp orting exible

transactional programming of synchronization op erations STM is nonblo cking and can b e imple

mented on existing machines using only a Load LinkedStore Conditional op eration We use STM

to provide a general highly concurrent metho d for translating sequential ob ject implementations to

nonblo cking ones based on implementing a k word compareswap STMtransaction Empirical

evidence collected on simulated multipro cessor architectures shows that the our metho d always

outp erforms all the nonblo cking translation metho ds in the style of Barnes and outp erforms Her

lihys translation metho d for suciently large numb ers of pro cessors The key to the eciency

of our softwaretransactional approachisthatunlike Barnes style metho ds it is not based on a

costly recursive helping p olicy

Contact Author Email shanirtheorylcsmitedu

y

A preliminary version of this pap er app eared in the th ACM Symp osium on the Principles of Distributed

Computing Ottowa Ontario Canada

Intro duction

A ma jor obstacle on the way to making multipro cessor machines widely acceptable is the diculty

of programmers in designing highly concurrent programs and data structures Given the growing

realization that unpredictable delay is an increasingly serious problem in mo dern multipro cessor

architectures we argue that conventional techniques for implementing concurrent ob jects by means

of critical sections are unsuitable since they limit parallelism increase contention for memory

and interconnect and make the system vulnerable to timing anomalies and pro cessor failures

The key to highly concurrent programming is to decrease the numb er and size of critical sections

amultipro cessor program uses p ossibly eliminating critical sections altogether by constructing

classes of implementations that are nonblocking As we learn from the literature

exibilityincho osing the synchronization op erations greatly simplies the task of designing non

blo cking concurrent programs Examples are the nonblo cking datastructures of Massalin and

Pu which use a CompareSwap on twowords Andersons parallel path compression on

lists which uses a sp ecial Splice op eration the counting networks of which use combination of

FetchComplement and FetchInc Israeli and Rapp op orts Heap which can b e implemented

using a threeword CompareSwap and many more Unfortunately most of the currentorsoon

Load LinkedStore Conditional to b e develop ed architectures supp ort op erations on the level of a

op eration for a single word making most of these highly concurrent algorithms impractical in the

near future

Bershad suggested to overcome the problem of providing ecient programming primitives on

existing machines by employing op erating system supp ort Herlihy and Moss have prop osed an

ingenious hardware solution transactional memory By adding a sp ecialized asso ciative cache and

making several minor changes to the cache consistency proto cols they are able to supp ort a exible

transactional language for writing synchronization op erations Any synchronization op eration can

b e written as a transaction and executed using an optimistic algorithm built into the consistency

proto col Unfortunately though this solution is blocking

This pap er prop oses to adopt the transactional approach but not its hardware based implemen

tation Weintro duce softwaretransactional memory STM a novel design that supp orts exible

transactional programming of synchronization op erations in software Though we cannot aim for

the same overall p erformance our software transactional memory has clear advantages in terms of

applicabili tytotodays machines p ortability among machines and resiliency in the face of timing

anomalies and pro cessor failures

We fo cus on implementations of a software transactional memory that supp ort static transactions

that is transactions which access a predetermined sequence of lo cations This class includes most

of the known and prop osed synchronization primitives in the literature

STM in a nutshell

In a nonfaultyenvironment the way to ensure the atomicity of the op erations is usually based on

lo cking or acquiring exclusively ownerships on the memory lo cations accessed by an op eration Op

If a transaction cannot capture an ownerships it fails and releases the ownerships already acquired

Otherwise it succeeds in executing Op and frees the ownerships acquired To guarantee liveness

one must rst eliminate deadlo cks which for static transactions is done by acquiring the ownerships

needed in some increasing order In order to continue ensuring liveness in a faultyenvironment

wemust make certain that every transaction completes even if the pro cess which executes it has

b een delayed swapp ed out or crashed This is achieved by a helping metho dology forcing other

transactions which are trying to capture the same lo cation to help the owner of this lo cation to

complete its own transaction The key feature in the transactional approach is that in order to

free a lo cation one need only help its single owner transaction Moreover one can eectively avoid

the overhead of co ordination among several transactions attempting to help release a lo cation by

employing a reactive helping p olicy whichwe call nonredundanthelping

SequentialtoNonBlo cking Translation

One can use STM to provide a general highly concurrent metho d for translating sequential ob ject

implementations into nonblo cking ones based on the caching approachof The approachis

straightforward use transactional memory to to implementany collection of changes to a shared

ob ject p erforming them as an atomic kword CompareSwap transaction see Figure on the

desired lo cations The nonblo cking STM implementation guarantees that some transaction will

always succeed

Herlihyin referred to in the sequel as Herlihys method was the rst to oer a general

transformation of sequential ob jects into nonblo cking concurrent ones According to his metho d

ology up dating a data structure is done by rst copying it into a new allo cated blo ckofmemory

making the changes on the new version and tentatively switching the p ointer to the new data struc

LinkedStore Conditional atomic op erations Unfortunately ture all that with the help of Load

Herlihys method do es not provide a suitable solution for large data structures and like the stan

dard approachoflocking the whole ob ject do es not supp ort concurrent up dating Alemanyand

Felten and LaMarca suggested to improve the eciency of this general metho d at the price

of lo osing p ortabilityby using op erating system supp ort making a set of strong assumptions on

system b ehavior

Toovercome the limitations of Herlihys method Barnes in intro duced his caching metho d

that avoids copying the whole ob ject and allows concurrent disjoint up dating A similar approach

was indep endently prop osed byTurek Shasha and Prakash According to Barnes a pro cess

rst simulates the execution of the up dating in its private memory ie reading a lo cation for the

rst time is done from the shared memory but writing is done into the private memory Then

the pro cess uses an nonblo cking kword ReadModifyWrite atomic op eration whichchecks if the

values contained in the memory are equivalent tothethevalue read in the cache up date If this

is the case the op eration stores the new values in the memory Otherwise the pro cess restarts

from the b eginning Barnes suggested to implement the kword ReadMo difyWrite bylocking in

ascending order of their key the lo cations involved in the up date executing the op eration and

after executing the op eration needed releasing the lo cks The key to achieving the nonblo cking

resilientbehavior in the caching approachof is the cooperative method whenever a pro cess

needs a lo cation already lo cked by another pro cess it helps the lo cking pro cess to complete its own

op eration and this is done recursively along the dep endency chain Though Barnes and Turek

Shasha and Prakash are vague on sp eci implementation details a recent pap er by Israeli and

Rapp op ort gives using the co op erative metho d a clean and streamlined implementation of

a nonblo cking kword CompareSwap using Load LinkedStore Conditional However as our

empirical results suggest b oth the general metho d and its sp ecic implementation havetwo ma jor

drawbacks whichareovercome by our STM based translation metho d

The cooperative method has a recursive structure of helping which frequently causes pro cesses

to help other pro cesses which access a disjoint part of the data structure

Unlike STMs transactional kword CompareSwap op erations which mostly fail on the

transaction level and are thus not help ed a high p ercentage of co op erative kword Com

pareSwap op erations fail but generate contention since they are nevertheless help ed by other

pro cesses

Take for example a pro cess P which executes a word CompareSwap on lo cations a and b

Assume that some other pro cess Q already owns b According to the co op erative metho d P

rst helps Q complete its op eration and only then acquires b and continues on its own op eration

However in many cases P s CompareSwap will not change the memory since Q changed b after

will havetoretry All the pro cesses waiting for lo cation a will haveto P already read it and P

rst help P then Q and again P when in anycase P s op eration will likely fail Moreover after

P has acquired b all the pro cesses requesting b will also redundantly help to P

On the other hand if P executes the word CompareSwap as an STM transaction P will fail

to acquire bhelp Q release a and restart The pro cesses waiting for a will have to help only P

The pro cesses waiting for b will not havetohelp P FinallyifQ hasnt changed b P will most

likely nd the value of b in its own cache

Empirical Results

To make sequentialtononblo cking translation metho ds acceptable one needs to reduce the p erfor

mance overhead one has to pay when the system is stable nonfaulty We present see Section

the rst exp erimental comparison of the p erformance under stable conditions of the translation

techniques cited ab ove We use the well accepted Proteus Parallel Hardware Simulator

We found that on a simulated Alewife cachecoherent distributed sharedmemory machine

as the p otential for concurrency in accessing the ob ject grows the STM nonblo cking translation

metho d outp erforms b oth Herlihys method and the cooperative metho d Unfortunately our ex

p eriments show that in general STM and other nonblo cking techniques are inferior to standard

nonresilient lo ckbased metho ds such as queuelo cks Results for a shared bus architecture

were similar in avor

In summary STM oers a novel software package of exible co ordinationop eration for the design

of highly concurrent shared ob jects which ensures resiliency in faulty runs and improved p erfor

mance in nonfaulty ones The following section intro duces STM In Section we describ e our

implementation and and provide a sketch of the correctness pro of Finally in Section we present

our empirical p erformance evaluation

Transactional Memory

We b egin by presenting softwaretransactional memory avariant of the transactional memory of

A transaction is a nite sequence of lo cal and shared memory machine instructions

Readtransactional reads the value of a shared lo cation into a lo cal register

Writetransactional stores the contents of a lo cal register into a shared lo cation

transactional and The data set of a transaction is the set of shared lo cations accessed by the Read

transactional instructions Any transaction may either fail or complete successfully in which Write

case its changes are visible atomically to other pro cesses For example dequeuing a value from the

head of a doubly linked list as in Figure may b e p erformed as a transaction If the transaction

terminates successfully it returns the dequeued item or an Empty value

A kword CompareSwap transaction as in Figure is a transaction which gets as parameters

the data set its size and twovectors Old and New of the data sets size A successful kword

CompareSwap transaction checks whether the values stored in the memory are equivalent to old

In that case the transaction stores the New values into the memory and returns a CSSuccess

value otherwise it returns CSFailure

Dequeue

BeginTransaction

DeletedItem ReadtransactionalHead

if DeletedItem Null

ReturnedValue Empty

else

WritetransactionalHeadDeletedItemNext

if DeletedItemNext Null

WritetransactionalTailNull

ReturnedValue DeletedItemValue

EndTransaction

end Dequeue

Figure A Non Static Transaction

k word CSSizeDataSetOldNew

BeginTransaction

for i to Size do

if ReadtransactionalDataSeti  Oldi

ReturnedValue CSFailure

ExitTransaction

for i to Size do

Writetransactional DataSetiNewi

ReturnedValue CSSuccess

EndTransaction

word CS end k

Figure A Static Transaction

A softwaretransactional memory STM is a shared ob ject whichbehaves like a memory that

supp orts multiple changes to its addresses by means of transactions A transaction is a

of control that applies a nite sequence of primitive op erations to memory The basic correctness

requirement for a STM implementation is linearizability every concurrent history is equiva

lent to some legal sequential history which which is consistent with the realtime order induced

by the concurrent history

A static transaction is a sp ecial form of transaction in which the data set is known in advance

and can thus b e thought of as an atomic pro cedure which gets as parameters the data set and

a deterministic transition function which determines the new values to b e stored in the data set

This pro cedure up dates the memory and returns the previous value stored This pap er we will

fo cus on implementations of a transactional memory that supp orts static transactions a class that

includes most of the known and prop osed synchronization op erations in the literature The kword

CompareSwap transaction in Figure is an example of a static transaction while the Dequeue

pro cedure in Figure is not

An STM implementation is waitfree if any pro cess which rep eatedly executes the transaction

terminates successfully after a nite numb er of attempts It is nonblocking if the rep eated execution

of some transaction by a pro cess implies that some pro cess not necessarily the same one and

with a p ossibly dierent transaction will terminate successfully after a nite numb er of attempts

in the whole system An STM implementation is swap tolerantifitisnonblocking under the

assumption that a pro cess cannot b e swapp ed out innitely many times The hardware implemented

transactions of could in theory rep eatedly fail forever if pro cesses try to write two lo cations in

dierent order as when up dating a doubly linked list However if used only for static transactions

their implementation can b e made swaptolerant but not nonblo cking since a single pro cess which

is rep eatedly swapp ed during the execution of a transaction will never terminates successfully

The System Mo del

Our computation mo del follows Herlihy and Wing and can also b e cast in terms of the IO

automata mo del of Lynch and Tuttle A concurrent system consists of a collection of pro

cesses Pro cesses communicate through shared data structures called objectsEach ob ject has a

set of primitive operations that provide the only means to manipulate that ob ject Each pro cess

is a sequential thread of control which applies a sequence of op erations to ob jects by issuing

an invo cation and receiving the asso ciated resp onse A history is a sequence of invo cations and

resp onses of some system execution Each history induces a realtime order of op erations 

where an op eration A precedes another op eration B if As resp onse o ccurs b efore Bs invo cation

Two op erations are concurrent if they are unrelated by the realtime order A sequential history

is a history in which eachinvo cation is followed immediately by its corresp onding resp onse The

sequential specication of an ob ject is the set of legal sequential histories asso ciated with it The

basic correctness requirement for a concurrent implementation is linearizability every con

current history is equivalent to some legal sequential history which which is consistent with the

partial realtime order induced by the concurrent history In a linearizable implementation op era

tions app ear to take eect atomically at some p ointbetween their invo cation and resp onse In our

mo del every shared memory lo cation L of a multipro cessor machines memory is formally mo deled

as an ob ject which provides every pro cessor i n four t yp es of p ossible op erations with the

following sequential sp ecication

i

W r ite L v writes the value v to lo cation L

i

Read L reads lo cation L and returns its value v

i

i

Linked L reads lo cation L and returns its value v Marks lo cation L as read by i Load

i

Store Conditional L v iflocation L is marked as read by i the op eration writes the value v to

L erases all existing marks by other pro cessors on L and returns a success status Otherwise

returns a failure status

The a more detailed formal sp ecication of these op eration can b e found in

A Sequential Sp ecication of STM

The following is the sequential sp ecication of STM Let L  N b e a set of lo cations A memory

state is a function s L  V which returns for each lo cation of L avalue from some set V Let S

b e the set of all p ossible memory states A transition function t S  S isacomputable function

which gets as a parameter a state and returns a new state Given a subset ds  Lwesay that

a transition function t is ds dependent if the following conditions hold a for every state s and

every lo cation l if l  ds then sl tsl bif s and s are two states st for every l  ds



s l s l then for every l  ds ts l ts l

 

Given a set L of lo cations a Static Transactional Memory over L is a concurrent ob ject which

provides every pro cess i with a Tran iDataSet f r status op eration Its has as input DataSet

a subset of L and f a transition function whichisD ataS et dep endent It returns a function

r D ataS et  V and a b o olean value statusWe omit the subscript of a Tran op eration when the

id of the pro cessor p erforming the op eration is unimp ortant

Let h o o o b e a nite or innite sequential history where o is the ith op eration executed

  i

m m m

For every nite prex h o o o o of hwe dene the terminating state of h TSh inthe

  m

m

following inductiveway If m then TSh e where e is the function el  for every l  L

m

o TranDSfr status and let h o o o o If If m then assume wlog that

m   m

m m m m

status success then TSh f TSh otherwise TSh TSh

We can now pro ceed to dene the sequential sp ecication of the static transactional memory

  

Given a function f A  B and A  Awe dene the restriction of f on A denoted f  A to

   

b e the function f A  B st a  A f a f a We require that a correct implementation of

an STM ob ject meet the following sequential sp ecication Memory

Ownerships

status status status version version version description description description size size . . . . size OldValues OldValues OldValues

Rec

Rec12 Rec n

Figure STM implementation shared data structures

Denition The set of sequential histories such that for each nite or innite history h

o o o it is the case that for al l k ifo TranD ataS et f r status and status success

  k

then r TSo o o o  DataS et

  k

A NonBlo cking Implementation of STM

We implement a nonblo cking static STM of size M using the following data structures See

Figure

MemoryM a vector whichcontains the data stored in the transactional memory

Ownerships M a vector which determines for anycellin Memory which transaction owns it

Each pro cess i keeps in the shared memory a record p ointed to byRec that will b e used to store

i

information on the current transaction it initiated It has the following elds Size which contains

the size of the data set A dd avector whichcontains the data set addresses in increasing order

OldValues a vector whose cells are initialized to Nul l at the b eginning of every transaction In

case of a successful transaction this vector will contain the former values stored in the involved

StartTransactionDataSet

InitializeRec DataSet

i

Rec stable True

i

TransactionRec Rec versionTrue

i i

Rec stable False

i

Rec version

i

if Rec status Success then

i

return SuccessRec OldValues

i

else

return Failure

Figure StartTransaction

lo cations The other elds are used in order to synchronize b etween the owner of the record and

the pro cesses whichmayeventually help its transactions Version is an integer initially which

determines the instance numb er of the transaction This eld is incremented every time the pro cess

terminates a transaction

A pro cess i initiates the execution of a transaction by calling the Transaction routine of Figure

Transaction rst initializes the pro cesss record and then declares the record as stable ensuring that

any pro cessors helping the transaction complete will read a consistent description of the transaction

After executing the transaction the pro cess checks if the transaction has succeeded and if so returns

the content of the vector OldValues

The pro cedure Transaction Figure gets as parameters Rec the records address of the trans

action executed and a b o olean value IsInitiator indicating whether Transaction was called by the

initiating pro cess or by a helping pro cess The parameter version contains the instance number

of the record executed This parameter is not used when the routine is called by the initiating

pro cess since the version eld will never change during the call Transaction rst tries to acquire

ownership on the data sets lo cations by calling AquireOwnership If it fails to do so then up on

returning from AquireOwnership the status eld will b e set to Failurefailadd If the status eld

alue yet the pro cess sets it to Success In case of success the pro cess writes do esnt haveav

the old values into the transactions record calculates the new values to b e stored writes them

to the memory and releases the ownerships Otherwise the status eld contains the lo cation that

caused the failure The pro cess rst releases the ownerships that it already owns and in the case

that it is not a helping pro cess it helps the transaction whichowns the failing lo cation Helping is

p erformed only if the help ed transactions record is in a stable state

The use of this unb ounded eld can b e avoided if an additional Validate op eration is available

TransactionrecversionIsInitiator

AcquireOwnershipsrecversion

statusfailadd LLrecstatus

if status Null then

if version  recversion then return

SCrecstatusSuccess

statusfailadd LLrecstatus

if status Success then

AgreeOldValuesrecversion

NewValues CalcNewValuesrecOldValues

UpdateMemoryrecversionNewValues

ReleaseOwnershipsrecversion

else

ReleaseOwnershipsrecversion

if IsInitiator then

failtran Ownershipsfailadd

if failtran Nob o dy then

return

else

failversion failtranversion

if failtranstable

TransactionfailtranfailversionFalse

Figure Transaction

AcquireOwnershipsrecversion

transize recsize

for j to size do

while true do

lo cation recaddj

if LLrecstatus  Null then return

owner LL OwnershipsrecAddj

if recversion version return

if owner rec then exit while lo op

if owner Nob o dy then

if SCrecstatus Null then

if SCOwnershipslo cationrec then

exit while lo op

else

if SCrecstatus Failurej then

return

ReleaseOwnershipsrecversion

size recsize

for j to size do

lo cation recAddj

if LLOwnershipslo cation rec then

if recversion version then return

SCOwnershipslo cationNob o dy

AgreeOldValuesrecversion

size recsize

for j to size do

lo cation recAddj

if LLrec OldValueslo cation  Null then

if recversion version then return

SCrecOldValueslo cationMemorylo cati on

UpdateMemoryrecversionnewvalues

size recsize

for j to size do

lo cation recAddj

oldvalue LLMemorylo cation

if recAllWritten then return

if version recversion then return

if oldvaluenewvaluesj then

SCMemorylo cationnewvaluesj

if not LLrecAllWritten then

if version recversion then return

SCrecAllWrittenTrue

Figure Ownerships and Memory access

Since AcquireOwnerships of Figure may b e called either by the initiator or by the helping pro

cesses wemust ensure that all pro cesses will try to acquire ownership on the same lo cations this

is done bychecking the version b etween the Load Linked and the Store Conditional instructions

from the moment that the status of the transaction b ecomes xed no additional ownerships are

allowed for that transaction The second prop erty is essential for proving not only atomicity but

also the nonblo cking prop ertyAny pro cess which reads a free lo cation will have b efore acquiring

ownership on it to conrm that the transaction status is still undecided This is done by writing

Conditional Nul l in the status eld This prevents any pro cess which read the with Store

lo cation in the past while it was owned by a dierent transaction to set the status to Failure

When writing the new values to the UpdateMemory as in Figure the pro cesses synchronize in

order to prevent a slow pro cess from up dating the memory after the ownerships have b een released

Todosoevery pro cess sets the Al lWritten eld to b e True after up dating the memory and b efore

releasing the ownerships

Correctness Pro of

Given a run we freely interchange b etween run and history of the STM implementation the nth

transaction execution of pro cess i is marked as T i n The transaction record for T i n is denoted

as R and by denition only pro cess i up dates R Itisthus clear that the number n in T i nis

i i

equal to the contentofR v er sion during T i ns execution The executing pro cesses of T i n

i

consist of pro cess i called the initiator and the helping pro cesses those executing Transaction

with parameters R nFalse

i

The following are the denitions of the register op erations where the sup erscript of an op eration

marks the id of the pro cess which executed it and the subscript marks the transaction instance

that the pro cess executes Sometimes when subscript and sup erscript are not needed we will omit

them

i

W variablevalue Pro cess i p erforms a Write op eration on variable with value while executing

T

transaction T

i

R variablevalue Pro cess i p erforms a Read op eration on variable which returns value while

T

executing transaction T

i

Linked op eration on variable which returns value LL variablevalue Pro cess i deforms is a Load

T

while executing T

i

Conditional op eration on variable with SC variablevalue Pro cess i p erforms a successful Store

T

value while executing T

i i

R variable is a short form for R variablevalue value for some predicate

T T

Clearlyany implementation of transactional memory which is based on an ownerships p olicy

only without helping will satisfy the linearizabil ity requirement if a single pro cess is able to lo ck

all the needed memory lo cations it will b e able to up date the memory atomically Consequently

in order to prove the linerizabil ity of our implementation we will have mainly to show that the

fact that many pro cesses may execute the same transaction will b ehaveasiftheywere a single

pro cess running alone In the following pro of we will rst show that all the executing pro cesses of a

transaction p erform the same transaction that the initiator intended Then we will prove that all

the executing pro cesses agree on the nal status of the transaction Finallywe will demonstrate

that the executing pro cess of a successful transaction will up date the memory correctly

The nonblo cking prop erty of the implementation will b e established by showing rst that no

executing pro cess will ever b e able to acquire an ownership after the transaction has failed and

then showing that since lo cations are acquired in increasing order some transaction will eventually

succeed

Linearizability

We rst show that although pro cess i uses the same record for all its transactions and mayeven

tually change it while some executing pro cess reads its content all the executing pro cesses of a

transaction read a consistent description of what is is supp osed to do

Claim Given an execution r of the STM implementation the helping processes of a transaction

T i n in r read the same data set vector which was storedbyiAny executing of T i n

which read a dierent data set wil l not update any of the the shared data structures

Pro of Assume bywayofcontradiction that there is a helping pro cess j of T i n whichread

a dierent description of the transaction That means that for some lo cation a from T i ns

j

i

description R a x and W a y but x  y By the algorithm only pro cess i up dates the

T in T in

i

description elds in Rec and it do es it only once p er transaction Assume rst that W a y 

i

T in

j

i

y there is some write op eration W a x st R a x Since x 

T in 

T in

j

i i

a x  R a x a y  W W

T in  T in

T in



where n nSince

j

i i

a x W Rec v er sion n  W a x  R

i

T in T in 

T in

and all the helping pro cesses of T n i compare b etween n and Rec v er sion b efore executing

i

a SC op eration j will not from this p oint on up date any shared data structure Assume that

j

i

a x  W a y By the algorithm lines in the Transaction pro cedure R

T in

T in

j j j

R Rec v er sion n  R Rec stabl e tr ue  R a x

i i

T in T in T in

and in that case

j

i i

stabl e tr ue  W a y W Rec v er sion n  W Rec

i i

T in T in

T in

whichisacontradiction to the description of the StartTransaction pro cedure

Next we show that all the executing pro cesses of a transaction agree on its terminating status

Claim Assume that i and j aretwoexecuting processes of some transaction T i nIfi and

j read dierent values of the terminating status line in Transaction procedure at least one of

them wil l henceforth not update the shared data structures

Pro of Assume bywayofcontradiction that

j

k

Rec status Failure and R Rec status Success R

i i

T in

T in

and assume wlog that

j

k

R Rec status Failure  R Rec status Success

i i

T in

T in

In that case there is some pro cess z such that

k k

R Rec status Failure  LL Rec status Nul l 

i i

T in

j

k

SC Rec status Success  R Rec status Success

i i

T in

Since i is the only pro cess which initializes Rec status it follows that

i

k z

Nul l  R Rec status Failure  W Rec status

i i

T in

j

z

SC Rec status Success  R Rec status Success

i i

T in

By the algorithm

k k k

R Rec v er sion n  R Rec stabl e tr ue  R Rec status Failure

i i i

T in T in T in

and

i i

W Rec v er sion n  W Rec status Nul l

i i

wemay therefore conclude that pro cess j will not up date the shared data structures anymore after

j

Rec status Success executing R

i

T in

Thanks to Claim wecannow dene a transaction as successful if its terminating status is

Success and failing otherwise From the algorithm and Claim it is clear that executing pro cesses

of failing transactions will never change the Memory data structure

Claim Every successful transaction has

a only one executing process which writes Success as the terminating status of the transaction

b only one executing process who sets the Al lWritten eld to true

Pro of Assume that during a successful transaction T i n one of those elds f was up dated

bytwo executing pro cesses k and j Both have executed

R Rec v er sion n  LL f Nul l  R Rec v er sion n

T in i T in T in i

LLf Nul l  SC f v  W Rec stabl e T r ue 

T in i

j

k

f v By the sp ecication of the f v  SC Assume wlog that SC

T in T in

Load LinkedStore Conditional op eration

j j

k k

LL f Nul l  SC f v  LL f Nul l  SC f v

T in T in

T in T in

But since only pro cess i writes Nul l into eld f itfollows that

i i i

f Nul l W Rec stabl e F al se  W Rec v er sion n  W

i i

T in

Pro cess j thus read Rec stabl e as false and therefore should not havehelped T i n

i

For any successful transaction T i n let SU n i b e the SC op eration which has set T i ns

status to Success and let AW n i b e the SC op eration which has set the Al lWritten eld to True

By the ab ove claims those op erations are well dened The following lemma shows that successful

transactions access the memory atomically

Lemma For every n and every process iifT i n is a successful transaction then

a between SU n i and AW n i al l the entries in the Ownerships vector from T i ns data set

contain Rec and

i

i

b at W Rec v er sion n no entry contains Rec

i i

T in



Pro of The pro of is by joint induction on n Assume that the prop erties hold for n n and let

us prove them for n

To prove a consider j the pro cess which executed SU n i By the algorithm j has p erformed

j

j

Ownerships x Rec  R Rec v er sion n 

i i

Tranin

j

 Ownerships x Rec  SU n i

l i

Tranin

where is either a SC or R op eration and x x are Trani ns data set lo cations Assume

l

that for some lo cation x at SU n i Ownerships x dier from those of Rec By the algorithm

r j i

this may happ en only if

j

k

Ownerships x Rec  SC Ownerships x Nul l

r i r

Tranin

 

ownerships during Trani n forn nMore Therefore there is some pro cess k executing release

precisely the following sequence of op erations has o ccured

k k  i

LL O w ner shipsx Rec  R Rec v er sion n  W Rec v er sion n

r i i i

Tranin  Tranin 

j k

 R Rec v er sion n  SC O w ner shipsx Nul l

i r

Tranin 

i

By the induction hyp othesis on prop erty batW Rec v er sion n O w ner shipsx diers from

i r

k

and therefore the SC O w ner shipsx Nul l should have failed A contradiction Rec

r i

Tranin 

To prove b note that from the algorithm it follows that pro cess i has executed

i i

Rec status Success  Ownerships x Nul l 

i

Tranin Tranin

j

i

 Ownerships x Nul l  W Rec v er sion n

l i

T in

Tranin

i

Assume bywayofcontradiction that at W Rec v er sion n for some lo cation x where

i r

T  in

Ownerships x Rec By the induction hyp othesis on prop erty b x b elongs to Tran s data set

r i r i

Let k b e the pro cessor that has written Rec on O w ner shipsx By the algorithm k p erformed

i r

k k k

LL Rec status Nul l  LL Ownerships x Nul l  SC Rec status Nul l 

i r i

k i

Rec status Success  SC O w ner shipsx Rec

i r i

Tranin

By prop erty a atthepoint of exeuting SC Rec status Success O w ner shipsx Rec

Tranin i j i

k

and therefore SC O w ner shipsx Nul l should have failed a contradiction

r

The following corollary will b e useful when proving the non blo cking prop erty of the implemen

tation The pro of is similar to the pro of of part b in Lemma

i

Corollary Let T i n be a failing transaction then at the point of executing W Rec

i

T in

v er sion n no entry contains Rec

i

We can now complete the pro of of linearizabili tyWe dene the execution state of the imple

mentation at anypoint of the execution to b e the function F st F xMemoryx for every

x  L

Lemma Let T i n bea successful transaction and let F and F be the execution states of

SU i n and AW i n respectively The fol lowing properties hold

values F  DataSet ol d a At AW i n Rec

in i

b If F and F are the execution states of SU i n and AW i n respectively then F 

DataSet f F  DataSet where f is the transition function of T i n

in in in in

c After AW i n no process executing T wil l update Memory

in

Pro of The pro of is by joint induction on the length of the execution

yofcontradiction that To prove aletF b e the execution state at SU i n Assume bywa

for some lo cation x  DataSet Rec ol d valuesx  F x That means that Memoryx

in i

was changed b etween SU i n and the p oint in the execution in which Rec ol d valuesxwas

i

values b efore set Since by the algorithm all the executing pro cesses of T n i up date Rec ol d

i

up dating the Memory Memoryxwas altered by an executing pro cess of some other successful

   

transaction T i n By Lemma AW i n SU i n and therefore by the induction hyp oth

esis on prop erty cwehaveacontradiction

To prove b assume bywayofcontradiction that at AW i n there is some lo cation x 

r

DataSet st Memoryx  f F x Let j b e the pro cess which has executed AW i n

in r in r

By the algorithm as a part of the UpdateMemory pro cedure j p erformed either

j

R Memoryx f F x  AW i n

r in r

T in

or

j j

R Memoryx  f F x  SC Memoryx f F x  AW i n

r in r r in r

T in T in

k

Therefore there is some pro cess k which p erformed SC on Memoryx witha value dierent

r

j

and b efore AW i n Assume wlog that k is executing the then f F x after R x

in r r

   

transaction Trani n If i i then clearly n nand by the induction hyp othesis on prop erty c

   k  

k s writing should have failed Therefore i  iIfAW i n  SC then AW i n  AW i n

k

and using the induction hyp othesis on prop erty c we again haveacontradiction Therefore SC 

  k

AW i n In that case wehave a contradiction to Lemma since at SC Ownershipsx is

r

supp osed to contain b oth Rec and Rec

i i

To prove c assume bywayofcontradiction that some executing pro cess j of T i n up dated a

lo cation x in memory after AW i n Pro cess j p erformed the following sequence of op erations

r

j j

LL Memoryx val  R Rec Al l W r itten F al se 

r i

Tranin Tranij 

j j

Memoryx f F x Rec v er sion n  SC R

r in r i

Tranin ij  Tran

where

j

val  f F x and therefore LL Memoryx val  AW i n

in r r

Tranin

By prop erty aatAW i n Memory x contains f F x and therefore

r in r

j

f F x should have failed SC Memoryx

in r r

Tranin

In order to prove that the implementation is linearizable let us rst consider executions of

the STM implementation whichcontain successful transactions only Let HS b e one of those

executions and let AW  AW  AW b e the sequential subsequence of all the AW events

that o ccured during HS Since an AW event o ccurs only once for every successful transaction let

H b e the sequence

T DataSet f ov success T DataSet f ov success T DataSet f ov success

AW AW     AW 

of transaction executions induced by the AW events where for every T the triple

AW n

valuesand status elds DataSet f ov success represents the contentofthe DataSet F old

n n n

resp ectively in T s records at AW By Lemma it is a simple exercise to showby induction

AW n

n

that H is a legal sequential history according to Denition Since failing transactions do not

cause anychange to Memorywemay conclude that

Theorem The implementation is linearizable

Nonblo cking

We denote the executing pro cess which wrote Failure to Rec status of a transaction T i nas

i

its failing process In order to prove the nonblo cking prop erty of the implementation The failing

location of T i n is the lo cation that the failing pro cess has failed to acquire

Claim Given a failing transaction T i nalltheexecuting process of T i n wil l never acquire

alocation which is higher or equal to the failing location of T i n

Pro of Assume bywayofcontradiction that some executing pro cess of T i n acquired a lo cation

higher or equal to T i ns failing lo cation Since the pro cess saw all the smaller lo cations captured

for T i n let j b e executing pro cess of T i n that captured the failing lo cation x of T i n By

r

the algorithm j p erformed the following sequence of op erations

j j

LL Rec status Nul l  LL O w ner shipsx N obody 

i r

T in T in

j j

SC Rec status Nul l  SC O w ner shipsx Rec

i r i

T in T in

The failing pro cess of T i n k has p erformed the following sequence of op erations

k k

LL Rec status Nul l  LL O w ner shipsx other

i r

T in T in

k

 SC Rec status Failure

i

T in

where other is neither Nul l nor Rec Assume that

i

j

k

SC Rec status Failure  SC O w ner shipsx Rec

i r i

T in

T in

In that case

j

k

Rec status Nul l  Nul l  LL SC Rec status

i i

T in

T in

k k

LL O w ner shipsx other  SC Rec status Failure

r i

T in T in

and consequently k

j

must have seen Ownershipsx already owned by i or the SC O w ner shipsx Rec should

r r i

T in

j

k

Rec status Failure Now if have failed Therefore SC O w ner shipsx Rec  SC

i r i

in T

T in

j

k

SC Ownershipsx Rec  LL O w ner shipsx other wehaveacontradiction since

r i r

T in T in

a pro cess executing T i nnever releases an ownerships b efore the status was set and pro cesses

 

executing T i n n n will see that the Rec version has changed For that reason

i

k k

LL O w ner shipsx other  SC Rec status Failur e

r i

T in T in

j

O w ner shipsx Rec  SC

r i

T in

In that case

k k

LL Rec status Nul l  LL O w ner shipsx other 

i r

T in T in

j j

LL O w ner shipsx N obody  SC Rec status Nul l

r i

T in T in

j

 SC O w ner shipsx Rec

r i

T in

k

Rec status Failuremust have failed a contradiction and SC

i

T in

Theorem The implementation is nonblocking

Pro of Assume bywayofcontradiction that there is an innite schedule in which no transaction

terminates successfully Assume that the numb er of failing transaction is nite This happ ens only

if from some p oint on in the computation all the pro cesses are stuck in the AcquireOwnerships

routine In this case there are several pro cesses which try to get ownership on the same lo cation for

the same transaction But in that case at least one pro cess will succeed or will fail the transaction

a contradiction It must thus b e the case that the numb er of failing transactions is innite In

that case there is at least one lo cation which is a failing address innitely often Consider A

the highest of those addresses Since the initiator of the transaction tries to help the transaction

which has failed him b efore retrying and since by Corollary all acquired lo cations are released

b efore helping it follows that there are innitely many transactions whichhave acquired ownership

on A but have failed By Claim those transactions have failed on addresses higher than Aa

contradiction to the fact that A is the highest failed lo cation

Toavoid ma jor overheads when no Failures o ccur any algorithm based on the helping paradigm

must avoid as much as p ossible redundant helping In the STM implementation given ab ove

redundant helping o ccurs when a failing transaction helps another nonfaulty pro cess Such

helping will only increase contention and consequently will cause the help ed pro cess to release the

ownerships later then it would have released if not help ed In our algorithm a pro cess increases or

decreases the interval b etween helps as a function of the redundant helps it discovered

An Empirical Evaluation of Translation Metho ds

Metho dology

We compared the p erformance of STM and other software metho ds on pro cessor bus and network

architectures using the Proteus simulator develop ed byBrewer Dellaro cas Colbro ok and Weihl

Proteus simulates parallel co de bymultiplexing several parallel threads on a single CPU Each

thread runs on its own virtual CPU with accompanying lo cal memorycache and communications

hardware keeping trackofhowmuch time is sp ent using each comp onent In order to facilitate

fast simulations Proteus do es not do complete hardware simulations Instead op erations which

are lo cal do not interact with the parallel environment are run uninterrupted on the simulating

machines CPU and memory The amount of time used for lo cal calculations is added to the

time sp ent p erforming simulated globally visible op erations to deriveeach threads notion of the

current time Proteus makes sure a thread can only see global events within the scop e of its lo cal

time

In the simulated bus architecture pro cessors communicate with shared memory mo dules through

a common bus Uniform sharedmemory access is assumed that is access of any memory mo dule

from any pro cessor takes the same amount of time which is cycles ignoring delays due to bus

contention Each pro cessor has a cache with lines of bytes and the cache coherence is

maintained using Go o dmans sno opy cachecoherence proto col

The simulated network architecture is similar to that of the Alewife cachecoherent distributed

memory machine currently under development at MIT Each no de of the machines Torus

shap ed communication grid consists of a pro cessor cache memory a router and a p ortion of the

globallyaddressable memory The cost of switching or wiring in the Alewife architecture was

cyclepacket As for the bus architecture each pro cessor has a cache with lines of bytes

The cache coherence is provided using a using a version of Chaikens directorybased cache

coherence proto col

LinkedStore Conditional instructions The currentversion of Proteus do es not supp ort Load

Instead we used a slightly mo died version that supp orts a bit CompareSwap op eration

where bits serve as a time stamp Naturally this op eration is less ecient than the theoret

ical Load LinkedStore Conditional prop osed in whichwe could have built directly into

Proteus since a failing CompareSwap will cost a memory access while a failing Store Conditional

wont However we b elieve the bit CompareSwap is closer to the real world then the theoretical

Load LinkedStore Conditional since existing implementations of Load LinkedStore Conditional

Linked as on Alpha orPowerPC do not allow access to the shared memory b etween the Load

and the Store Conditional op erations On existing machines the bits CompareSwap may b e im

LinkedStore Conditional as on the Alpha or using Bershads plemented by using the a bits Load



lo ckfree metho dology

We used four synthetic b enchmarks for evaluating various metho ds for implementing shared data

structures The metho ds vary in the size of the data structure and the amount of parallelism

Counting Eachofn pro cesses increments a shared counter n times In this b enchmark

up dates are short change the whole ob ject state and have no built in parallelism

Resource Allo cation A resource allo cation scenario a few pro cesses share a set of resources

and from time to time a pro cess tries to atomically acquire a subset of size s of those resources

This is the typical b ehavior of a well designed distributed data structure For lack of space

we show only the b enchmark which has n pro cesses atomically increment n times with

s lo cations chosen uniformly at random from a vector of length The b enchmark

captures the b ehavior of highly concurrent queue and counter implementations as in



The nonblo cking prop erty will b e achieved only if the numb er of spurious failures is nite

Priority Queue A shared priority queue on a heap of size nWeusedavariant of a sequential

heap implementation In this b enchmark each of the n pro cesses consequently enqueues

a random value in a heap and dequeues the greatest value from it n times The heap is

initially empty and its maximal size is n This is probably the most trying b enchmark since

there is no p otential for concurrency and the size of the data structure increases with n

Doubly Linked Queue An implementation of a queue as a doubly linked list in an array The

rst two cells of the arraycontain the head and the tail of the list Every item in the list

is a couple of cells in the array which represent the index of the previous and next element

resp ectivelyEach pro cess enqueues a new item by up dating tail to contain the new items

index and dequeues an item by up dating the head to contain the index of the next item in

the list Each pro cess executes n couples of enqueuedequeue op erations on a queue

of initial size n This b enchmark supp orts limited parallelism since when the queue is not

empty enqueuesdequeues up date the tailhead of the queue without interfering each other

Forahighnumb er of pro cesses the size of the up dated lo cations in each enqueuedequeue is

relatively small compared to the ob ject size

We implemented the kword CompareSwap transaction given in Figure as sp ecialization of

ve The simplication is that pro cesses do not have to agree on the general STM scheme given ab o

the value stored in the data set b efore the transaction started only on a b o olean value whichsays

if the value is equal to old or not

We used the ab ovebenchmarks to compare STM to the two nonblo cking software translation

metho ds describ ed earlier and a blo cking MCS queue based solution the data structure is

accessed in a mutually exclusive manner The nonblo cking metho ds include Herlihys Method and

Israeli and Rapp op orts kword CompareSwap based implementation of the cooperative method

All the nonblo cking metho ds use exp onential backo to reduce contention

Results

The data to b e presented leads us to conclude that there are three factors dierentiating among

the p erformance of the four metho ds

Potential paral lelism Bothlocking and Herlihys metho d do not exploit p otential parallelism

and only one pro cess at a time is allowed to up date the data structure The software

transactional and the co op erative metho ds allow concurrent pro cesses to access disjoint parts

of the data structure

The price of a failing update InHerlihys nonblo cking metho d the numb er of memory ac

cesses of a failing up date in is at least the size of the ob ject reading the ob ject and copying BUS Alewife 12000 6000 STM STM 10000 Coordinated method 5000 Coordinated method Herlihy’s method Herlihy’s method QUEUE spin lock QUEUE spin lock 8000 4000

6000 3000

4000 2000

operations per 10**6 cycles 2000 operations per 10**6 cycles 1000

0 0 0 10 20 30 40 50 60 0 10 20 30 40 50 60

Processors Processors

Figure Counting Benchmark

it to the private copy and reading and writing to the p ointer Fortunately the nature of the

cache coherence proto cols is such that almost all accesses p erformed when the pro cess up dates

its private copy are lo cal In b oth caching metho ds the price of a failure is a least the number

lo cations accessed during the cached execution

The amount of helping by other processes Helping exists only in the softwaretransactional

and the co op erative metho ds In the co op erative implementation kword CompareSwap

including failing ones are help ed not only by the kword CompareSwap op erations that

access the same lo cations concurrently but also by all the op erations that are in turn helping

them and so on In the STM metho d an kword CompareSwap is help ed only by op erations

that need the same lo cations Moreover and this is a crucial p erformance factor in STM

most of the unsuccessful up dates terminate as failing transactions not as failing kword

CompareSwap and when a transaction fails on the rst lo cation it is not help ed

The results for the counting b enchmark are given in Figure The horizontal axis shows the

numb er of pro cessors and the vertical axis shows the throughput achieved This b enchmark is cruel

to the caching based metho ds since the amount of up dated memory is equivalent to the size of the

ob ject and there is no p otential for parallelism On the bus architecture lo cking and Herlihys

metho d give signicantly higher throughput than the caching metho ds

Resource Allo cation Benchmark

The results of the resource allo cation b enchmark are shown in Figure We measured the

p otential for parallelism as a p ercentage of the atomic swordincrements that succeeded on rst

attempt When s this p ercentage varies b etween at pro cessors down to at

pro cessors For s the p otential for parallelism is at pro cessors down to at

pro cessors and when s itvaries b etween at pro cessors to at pro cessors

In general as the numb er of pro cessors increases lo cal work can b e p erformed concurrentlyand

thus the p erformance of the STM improves Beyond a certain numb er of pro cessors the p otential

for parallelism declines causing a growing number of kword CompareSwap conicts and the

throughput degrades This is the reason for the relatively low throughput of the STM metho d for

small numb ers of pro cessors and the concave form of STM graphs As one can see when s on

the bus or when s on Alewife architecture the STM metho d outp erforms even the queuelo ck

metho d

A priority queue is a data structure that do es not allow concurrency and as the number of

pro cessors increases the numb er of lo cations accessed increases to o Still the numb er of accessed

lo cations is smaller than the size of the ob ject Therefore the STM p erforms b etter than Herlihys

metho d in most concurrency level

Figure contains the doubly linked queue results There is more concurrency in accessing the

ob ject than in the counter b enchmark though it is limited at most two pro cesses may concurrently

ys metho d p erforms p o orly b ecause the p enalty paid for a failed up date up date the queue Herlih

grows linearly with queue size usually twice the numb er of the pro cesses In the STM metho d

the low granularity of the twoword CompareSwap transactions implies that the price of a failure

remains constant in all concurrency levels though lo cal work is still higher than the TestandTest

andSet

A comparison of nonblo cking metho ds only

Every theoretical metho d can b e improved in manyways when implemented in practice In order

to get a fair comparison b etween the nonblo cking metho ds one should use them in their purest

form Therefore we compare the p erformance of all the nonblo cking metho ds without backo in

all the metho ds and without the nonredundanthelpin g p olicy in STM We also compare the

co op erativekword CompareSwap with STM for a sp ecic implementation which explicitly needs

such a software supp orted op eration Wechose Israeli and Rapp op orts algorithm for a concurrent

priority queue since it is based on recursive helping Therefore whenever a pro cess during

the execution of a kword CompareSwap helps another remote disjoint pro cess it should givean

advantage to Israeli and Rapp op ort metho d Our implementation is slightly dierent since it uses BUS Alewife 12000 12000 STM STM 10000 Coordinated method 10000 Coordinated method Herlihy’s method Herlihy’s method QUEUE spin lock QUEUE spin lock 8000 8000

6000 6000

4000 4000

operations per 10**6 cycles 2000 operations per 10**6 cycles 2000

0 0 0 10 20 30 40 50 60 0 10 20 30 40 50 60

Processors Processors

P P P Two lo cations P P P BUS Alewife 3500 2500

3000 2000 2500

2000 1500

1500 1000

1000 500 operations per 10**6 cycles 500 operations per 10**6 cycles

0 0 0 10 20 30 40 50 60 0 10 20 30 40 50 60

Processors Processors

P P P Four lo cations P P P BUS Alewife 2500 1200 1100 2000 1000 900 800 1500 700 600 1000 500 400 500

operations per 10**6 cycles operations per 10**6 cycles 300 200 0 100 0 10 20 30 40 50 60 0 10 20 30 40 50 60

Processors Processors

P P P Six lo cations P P P

Figure Resource Allo cation Benchmark BUS Alewife 4000 2500 STM STM 3500 Coordinated method Coordinated method Herlihy’s method 2000 Herlihy’s method 3000 QUEUE spin lock QUEUE spin lock

2500 1500 2000

1500 1000

1000 500 operations per 10**6 cycles operations per 10**6 cycles 500

0 0 0 10 20 30 40 50 60 0 10 20 30 40 50 60

Processors Processors

Figure Priority Queue Benchmark

BUS Alewife 9000 3500 STM STM 8000 Coordinated method 3000 Coordinated method 7000 Herlihy’s method Herlihy’s method QUEUE spin lock QUEUE spin lock 2500 6000 5000 2000

4000 1500 3000 1000 2000 operations per 10**6 cycles operations per 10**6 cycles 500 1000 0 0 0 10 20 30 40 50 60 0 10 20 30 40 50 60

Processors Processors

Figure Doubly Linked Queue Benchmark



aword CompareSwap op eration instead of a word StoreConditional op eration

We ran the same b enchmark as for the regular priority queue The results of the concurrent

priority queue b enchmark are given in Figure In spite of the advantage that the inherent

structure of the algorithm should give to Israeli and Rapp op ort metho d STM provides the highest

throughput As in the counter and the sequential priority queue b enchmarks the reason for this is



In fact using word CompareSwap simplies the implementation since it avoids freezing no des BUS Alewife 6000 4000 STM STM 3500 5000 Coordinated method Coordinated method Herlihy’s method Herlihy’s method 3000 4000 2500

3000 2000

1500 2000 1000

operations per 10**6 cycles 1000 operations per 10**6 cycles 500

0 0 0 10 20 30 40 50 60 0 10 20 30 40 50 60

Processors Processors

Figure nonblo cking comparison Counting Benchmark

the high numb er of failing kword CompareSwap op erations in Israeli and Rapp op ort metho d

up to times the numb er of successful kword CompareSwap Every theoretical metho d can

b e improved in manyways when implemented in practice

In general the result we got were that STM metho d outp erforms Barnes metho d in all circum

stances and except from the counter b enchmark STM outp erforms Herlihys metho d to o The

results of the counter b enchmark are shown in Figure As in the previous tests Herlihys metho d

p erforms b etter than caching metho ds on b oth architectures In Bus architecture Herlihys metho d

is from times faster than STM on pro cessors down to time faster than STM on

pro cessors On Alewife architecture Herlihys metho d is from times faster than STM on

pro cessors down to times faster than STM on pro cessors STM is from faster than

Israeli and Rapp op ort metho d on Bus architecture up to time faster than Israeli and Rap

p op ort metho d on pro cessors On Alewife architecture STM is from times up to time

faster than Israeli and Rapp op ort metho d This degradation in the p erformance of Israeli and

Rapp op ort metho d is due to the high numb er of failing kword CompareSwap up to

times the numb er of successful kword CompareSwap while in STM the numb er of failing kword

CompareSwap is at most times the numb er of successful kword CompareSwap Most of

the transactions in STM terminate ad failing transactions and are not help ed since they failed in

acquiring the rst and last lo cation needed In the priority queue b enchmark on a simulated bus

architecture Herlihys metho d is from times faster than STM down to times slower than

STM On the Alewife architecture Herlihys metho d has a throughput that is times higher

than STM throughput down to times lower than in the STM metho d The results of the doubly

linked queue are in Figure On the Bus architecture our STM metho d is up to faster than BUS Alewife 1400 900 STM STM 800 1200 Coordinated method Coordinated method Herlihy’s method 700 Herlihy’s method 1000 600 800 500

600 400 300 400 200 operations per 10**6 cycles 200 operations per 10**6 cycles 100 0 0 0 10 20 30 40 50 60 0 10 20 30 40 50 60

Processors Processors

Figure nonblo cking comparison Priority Queue Benchmark

BUS Alewife 2500 1600 STM STM Coordinated method 1400 Coordinated method 2000 Herlihy’s method Herlihy’s method 1200

1500 1000 800

1000 600

400 500 operations per 10**6 cycles operations per 10**6 cycles 200

0 0 0 10 20 30 40 50 60 0 10 20 30 40 50 60

Processors Processors

Figure nonblo cking comparison Doubly Linked Queue Benchmark

Israeli and Rapp op ort metho d and up to times faster than Herlihys metho d When simulating

Alewife architecture our metho d had a times higher throughput than Herlihys metho d and

higher throughput than Israeli and Rapp op ort metho d In the resource allo cation b enchmark

to o Figure STM outp erforms other metho ds On Bus simulations STM is b etween

times faster than Israeli and Rapp op ort metho d on pro cessors and times faster on pro ces

sors On Alewife simulations STM is times faster than Israeli and Rapp op ort metho d on

pro cessors up to time faster on pro cessors Note that in this b enchmark the factor

that eect Israeli and Rapp op ort p erformance is not the numb er of failing k word CompareSwaps

which is relatively low but more the remote redundant help that pro cessors execute

The results of the concurrent priority queue b enchmark are given in Figure Though the

advantage that the inherent structure of the algorithm should give to Isreali and Rapp op ort metho d

STM give here also the highest throughput As in the counter and the sequential priority queue

b enchmarks the reason for this is the high numb er of failing kword CompareSwap op erations in

Israeli and Rapp op ort metho d up to times the numb er of successful k word CompareSwaps

Conclusions

Our pap er intro duces a nonblo cking software version of Herlihy and Moss transactional memory

approach There are many p ossible directions in which it can b e extended One issue is to design

b etter nonblo cking translation engines p ossibly by limiting STMs expressability to a smaller set

of implementable transactions Another interesting question is what p erformance guarantees one

can get with a less robust STM software package p ossibly programmed on the machines message

passing level Finally the ability to add an STM comp onent to existing software based virtual

shared memory systems raises theoretical questions on the computational p ower of a programming

abstraction based on having a variety of op erations that can b e applied to memory lo cations vs

the traditional approach of thinking of synchronization op erations as ob jects

Acknowledgments

We wish to thank Greg Barnes and Maurice Herlihy for their many helpful comments

References

A Agarwal et al The MIT Alewife Machine A LargeScale DistributedMemory Multipro ces

oceedings of Workshop on Scalable Shared Memory MultiprocessorsKluwer Academic sor In Pr

Publishers An extended version of this pap er has b een submitted for publication and

app ears as MITLCS Memo TM

R J Anderson Primitives for Asynchronous List Compression Proceeding of the th ACM

Symposium on Paral lel Algorithms and Architectures pages BUS Alewife 9000 11000 STM 10000 STM 8000 Coordinated method Coordinated method 9000 7000 Herlihy’s method Herlihy’s method 8000 6000 7000 5000 6000 4000 5000 4000 3000 3000 2000

operations per 10**6 cycles operations per 10**6 cycles 2000 1000 1000 0 0 0 10 20 30 40 50 60 0 10 20 30 40 50 60

Processors Processors

Two lo cations BUS Alewife 3000 2500

2500 2000

2000 1500 1500 1000 1000

500 operations per 10**6 cycles 500 operations per 10**6 cycles

0 0 0 10 20 30 40 50 60 0 10 20 30 40 50 60

Processors Processors

Four lo cations BUS Alewife 1200 900

800 1000 700 800 600

600 500

400 400 300

operations per 10**6 cycles 200 operations per 10**6 cycles 200

0 100 0 10 20 30 40 50 60 0 10 20 30 40 50 60

Processors Processors

Six lo cations

Figure nonblo cking comparison Resource Allo cation Benchmark BUS Alewife 600 400 550 STM STM Coordinated method 350 Coordinated method 500 450 300 400 350 250

300 200 250 200 150

operations per 10**6 cycles 150 operations per 10**6 cycles 100 100 50 50 0 10 20 30 40 50 60 0 10 20 30 40 50 60

Processors Processors

Figure nonblo cking comparison Israeli Rapp op ort Priority Queue

TE Anderson The p erformance of spin lo ck alternatives for shared memory multipro cessors

In IEEE Transaction on Paral lel and Distributed Systems January

J AlemanyEWFelten Performance Issues in NonBlo cking Synchronization on Shared

Memory Multipro cessors In Proceedings of th ACM Symposium on Principles of Distributed

Computation Pages August

J Aspnes MP Herlihy and N Shavit Counting Networks Journal of the ACMVol

No Septemb er pp

G Barnes A Metho d for Implementing Lo ckFree Shared Data Structures In Proceedings of

the th ACM Symposium on Paral lel Algorithms and Architectures

BN Bershad Practical consideration for lo ckfree concurrent ob jects Technical Rep ort CMU

CS Carnegie Mellon UniversitySeptemb er

EA Brewer CN Dellaro cas A Colbro ok and W E Weihl Proteus A HighPerformance

ParallelArchitecture Simulator MITLCSTR September

EA Brewer CN Dellaro cas Proteus User Do cumentation

K Chandy and J Misra The Drinking Philosophers Problem In ACM Transaction on

amming Languages and Systems Octob er Progr

TH Cormen CE Leiserson and RL Rivest Intro duction to algorithms MIT Press

David Chaiken Cache Coherence Proto cols for LargeScale Multipro cessors SM thesis

Massachusetts Institute of Technology Lab oratory for Computer Science Technical Rep ort

MITLCSTR Septemb er

DEC Alpha system reference manual

M Herlihy and JM Wing Linearizabili ty A correctness condition for concurrent ob jects In

ACM Transaction on Programming Languages and Systems pages July

M HerlihyWaitFree Synchronization In ACM Transaction on Programming Languages and

Systems pages January

M Herlihy A metho dology for implementing highly concurrent data ob jects ACM Transac

tions on Programming Languages and Systems November

M Herlihy and JE B Moss Transactional Memory Architectural Supp ort for Lo ckFree

Data Structures In th Annual Symposium on Computer Architecture pages May

J R Go o dman Using cachememory to reduce pro cessormemory trac In Proceeding of the

th International Symposium on Computer Architectures pages June

wer PC Reference manual IBM Po

A Israeli and L Rapp op ort EcientWait Free Implementation of a Concurrent Priority

Queue In WDAG Lecture Notes in Computer Science Springer Verlag pages

A Israeli and L Rapp op ort DisjointAccessParallel Implementations of Strong Shared Mem

ory Proc of the th ACM Symposium on Principles of Distributed Computing pages

A LaMarca A Performance Evaluation of Lo ckFree Synchronization Proto cols Proc of the

th ACM Symposium on Principles of Distributed Computing pages

N Lynch and M Tuttle Hierachical Correctness Pro ofs for Distributed Algorithm In Pro

ceedings of th ACM Symposium on Principles of Distributed Computation Pages

August Full version available as MIT Technical Rep ort MITLCSTR

H Massalin and C Pu A lo ckfree multipro cessor OS kernel Technical Rep ort CUCS

Columbia University Mars

JM MellorCrummey and ML Scott Synchronization without Contention In Proceedings

of the th International ConferenceonArchitectureSupport for Progr amming Languages and

Operating Systems April

L Rudolph M Slivkin and E Upfal A Simple Load Balancing Scheme for Task Allo cation

in Parallel Machines In Proceedings of the rdACM SymposiumonParal lel Algorithms and

Architectures pages July

N Shavit and A Zemach Diracting Trees In Proceedings of the Annual Symposium on

Paral lel Algorithms and Architectures SPAA June

J Turek D Shasha and S Prakash Lo cking without blo cking Making Lo ck Based Concurrent

Data Structure Algorithms Nonblo cking In Proceedings of the Principle of Database

Systems pages

D Touitou Lo ckFree Programming A Thesis Prop osal Tel Aviv University April